Skip to content

[CI] Chaotic devnet test#4069

Closed
ljedrz wants to merge 4 commits intostagingfrom
ci/network_delay_test
Closed

[CI] Chaotic devnet test#4069
ljedrz wants to merge 4 commits intostagingfrom
ci/network_delay_test

Conversation

@ljedrz
Copy link
Collaborator

@ljedrz ljedrz commented Jan 13, 2026

This PR introduces a chaotic-network-runner.sh script which can be used to simulate chaotic network conditions while running other scripts. As a an example, it is currently applied to devnet_ci.sh as a new CI job.

All the other CI jobs are temporarily removed until this is ready.

Filing this as a draft until it's decided which CI should be used for it, and which jobs to enhance with it (and whether those would be new jobs, or updates to the existing ones).

@ljedrz ljedrz force-pushed the ci/network_delay_test branch 9 times, most recently from f73f7e0 to 78c3242 Compare January 13, 2026 13:18
@vicsn vicsn self-requested a review January 14, 2026 21:30

- name: Run Tests with Chaos
# Arguments: <PORT_RANGE> <TEST_COMMAND>
run: ./scripts/chaotic-network-runner.sh 5000-5003 ./.ci/devnet_ci.sh 4 2 0 45
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original issue asks for a specific set of behaviours which is not the same as what devnet_ci.sh does: #4062

Copy link
Collaborator Author

@ljedrz ljedrz Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we already have a test which restarts validators and potentially removes some of their ledgers? if not, we should have one that's separate from this script - it's designed to apply random network stress to existing scenarios, which allows us to use it in tandem with any base scenario

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, can you write such a script? .ci/upgrade_nodes_ci.sh comes closest in functionality.

There's an open question of what machine size / number of validators / runtime reliably triggers issues. If we can reliably trigger issues on a heavy machine within 30 minutes, that would be amazing, and a 40x improvement over our status quo (where we launch 40 machines).

I would first test the basic functionality on the free default github actions machine, and then migrate it back to CircleCI where we make part of the release job (and we can manually trigger it).

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
@ljedrz ljedrz force-pushed the ci/network_delay_test branch from 78c3242 to 4cd115a Compare January 22, 2026 08:39
@ljedrz
Copy link
Collaborator Author

ljedrz commented Jan 22, 2026

Updated the description and rebased now that the PR this had been built on (#4059) has been merged. There were no changes to the contents of the script or the job.

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
@ljedrz
Copy link
Collaborator Author

ljedrz commented Jan 22, 2026

I added a script according to the following:

repeatedly remove the ledger of a quorum minority (f out of (3f+1)) of the validators and restart them.

@vicsn vicsn requested a review from meddle0x53 January 22, 2026 13:12
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
@ljedrz ljedrz force-pushed the ci/network_delay_test branch from b49b451 to 53af408 Compare January 23, 2026 09:47
@ljedrz
Copy link
Collaborator Author

ljedrz commented Jan 23, 2026

I added a script according to the following:

once stop a quorum majority (f+1 out of (3f+1)) for 30 seconds and restart

Copy link
Contributor

@meddle0x53 meddle0x53 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So some suggestions/comments from me.

@@ -0,0 +1,90 @@
#!/bin/bash

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good idea to add strict mode at the top to catch unset vars/pipeline failures/whitespace issues. If there is a bug it will make it easier to fail and debug. I learned this from bad experience with my last scripts.

Suggested change
set -euo pipefail
IFS=$'\n\t'

# ==========================================

# 1. Require Port Range (Argument 1)
TARGET_PORTS=$1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good to quote all the vars (as I did in the PR for the delay-network.sh), because unquoted variables change meaning depending on their contents, and Bash won’t warn us. We can pass them in a bad way in the CI and then debug why some white spaces are there.


# 1. Require Port Range (Argument 1)
TARGET_PORTS=$1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add

die() {
  echo "Error: $*" >&2
  exit 1
}

function for the errors and also log function for clear logs - like I did for the delay-network.sh script. Helps debugging, not important, of course.

Also

need_cmd sudo
need_cmd shuf

NETWORK_SCRIPT="./scripts/delay-network.sh"
[[ -f "$NETWORK_SCRIPT" ]] || die "Network script not found: $NETWORK_SCRIPT"
[[ -x "$NETWORK_SCRIPT" ]] || die "Network script is not executable: $NETWORK_SCRIPT"

would be a good check.


# Cleanup
echo "[Chaos Runner] 🏁 Main command finished with exit code $EXIT_CODE."
kill $CHAOS_PID 2>/dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it can be part of a trap. It is not important as it is a CI run and the CI job will just exit and all will be cleared, I guess, but for local runs later is a good way to cleanup:

cleanup() {
    if [ -n "${CHAOS_PID:-}" ]; then
        kill "$CHAOS_PID" 2>/dev/null || true
        wait "$CHAOS_PID" 2>/dev/null || true
    fi
    reset_network
}

trap cleanup EXIT

# Check if PID exists (is not empty) and is currently running
if [[ -n "$pid" ]] && kill -0 "$pid" 2>/dev/null; then
echo "Killing PIDS[$i] -> $pid"
kill -9 "$pid" 2>/dev/null || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to kill the nodes with -9, better SIGTERM? Maybe it can lead to some problems with the ledger and random failures, maybe not, just asking?

Otherwise these two functions seem great to me.

@@ -0,0 +1,84 @@
#!/bin/bash

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script itself seems to do its idea right - in a great way.

If you want you can do some of my suggestions I added to chaotic-network-runner.sh.
The strict mode, variables use helpers for logging, actually shared helper functions for logging and exiting are coming from utils.sh in this PR. Maybe use them + strict mode + quoting the $ vars.

@@ -0,0 +1,85 @@
#!/bin/bash

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as on the minority script.


# Network parameters
total_validators=7
majority=$(( (total_validators - 1) / 3 + 1 ))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be over 50% to be majority?

@kaimast
Copy link
Contributor

kaimast commented Feb 5, 2026

I am closing this in favor of #4095.

@kaimast kaimast closed this Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants