[CI] Chaotic devnet test by ljedrz · Pull Request #4069 · ProvableHQ/snarkOS

ljedrz · 2026-01-13T10:07:49Z

This PR introduces a chaotic-network-runner.sh script which can be used to simulate chaotic network conditions while running other scripts. As a an example, it is currently applied to devnet_ci.sh as a new CI job.

All the other CI jobs are temporarily removed until this is ready.

Filing this as a draft until it's decided which CI should be used for it, and which jobs to enhance with it (and whether those would be new jobs, or updates to the existing ones).

scripts/chaotic-network-runner.sh

vicsn · 2026-01-20T10:35:32Z

.github/workflows/chaotic-devnet.yml

+
+      - name: Run Tests with Chaos
+        # Arguments: <PORT_RANGE> <TEST_COMMAND>
+        run: ./scripts/chaotic-network-runner.sh 5000-5003 ./.ci/devnet_ci.sh 4 2 0 45


The original issue asks for a specific set of behaviours which is not the same as what devnet_ci.sh does: #4062

do we already have a test which restarts validators and potentially removes some of their ledgers? if not, we should have one that's separate from this script - it's designed to apply random network stress to existing scenarios, which allows us to use it in tandem with any base scenario

Not yet, can you write such a script? .ci/upgrade_nodes_ci.sh comes closest in functionality.

There's an open question of what machine size / number of validators / runtime reliably triggers issues. If we can reliably trigger issues on a heavy machine within 30 minutes, that would be amazing, and a 40x improvement over our status quo (where we launch 40 machines).

I would first test the basic functionality on the free default github actions machine, and then migrate it back to CircleCI where we make part of the release job (and we can manually trigger it).

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>

ljedrz · 2026-01-22T08:45:55Z

Updated the description and rebased now that the PR this had been built on (#4059) has been merged. There were no changes to the contents of the script or the job.

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>

ljedrz · 2026-01-22T10:38:54Z

I added a script according to the following:

repeatedly remove the ledger of a quorum minority (f out of (3f+1)) of the validators and restart them.

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>

ljedrz · 2026-01-23T09:47:32Z

I added a script according to the following:

once stop a quorum majority (f+1 out of (3f+1)) for 30 seconds and restart

meddle0x53

So some suggestions/comments from me.

meddle0x53 · 2026-01-23T14:14:44Z

scripts/chaotic-network-runner.sh

@@ -0,0 +1,90 @@
+#!/bin/bash
+


It is a good idea to add strict mode at the top to catch unset vars/pipeline failures/whitespace issues. If there is a bug it will make it easier to fail and debug. I learned this from bad experience with my last scripts.

Suggested change

set -euo pipefail

IFS=$'\n\t'

meddle0x53 · 2026-01-23T14:17:21Z

scripts/chaotic-network-runner.sh

+# ==========================================
+
+# 1. Require Port Range (Argument 1)
+TARGET_PORTS=$1


It is good to quote all the vars (as I did in the PR for the delay-network.sh), because unquoted variables change meaning depending on their contents, and Bash won’t warn us. We can pass them in a bad way in the CI and then debug why some white spaces are there.

meddle0x53 · 2026-01-23T14:21:08Z

scripts/chaotic-network-runner.sh

+
+# 1. Require Port Range (Argument 1)
+TARGET_PORTS=$1
+


We can add

die() { echo "Error: $*" >&2 exit 1 }

function for the errors and also log function for clear logs - like I did for the delay-network.sh script. Helps debugging, not important, of course.

Also

need_cmd sudo need_cmd shuf NETWORK_SCRIPT="./scripts/delay-network.sh" [[ -f "$NETWORK_SCRIPT" ]] || die "Network script not found: $NETWORK_SCRIPT" [[ -x "$NETWORK_SCRIPT" ]] || die "Network script is not executable: $NETWORK_SCRIPT"

would be a good check.

meddle0x53 · 2026-01-23T14:24:06Z

scripts/chaotic-network-runner.sh

+
+# Cleanup
+echo "[Chaos Runner] 🏁 Main command finished with exit code $EXIT_CODE."
+kill $CHAOS_PID 2>/dev/null


Maybe it can be part of a trap. It is not important as it is a CI run and the CI job will just exit and all will be cleared, I guess, but for local runs later is a good way to cleanup:

cleanup() { if [ -n "${CHAOS_PID:-}" ]; then kill "$CHAOS_PID" 2>/dev/null || true wait "$CHAOS_PID" 2>/dev/null || true fi reset_network } trap cleanup EXIT

meddle0x53 · 2026-01-23T14:28:12Z

.ci/utils.sh

+    # Check if PID exists (is not empty) and is currently running
+    if [[ -n "$pid" ]] && kill -0 "$pid" 2>/dev/null; then
+      echo "Killing PIDS[$i] -> $pid"
+      kill -9 "$pid" 2>/dev/null || true


Do you want to kill the nodes with -9, better SIGTERM? Maybe it can lead to some problems with the ledger and random failures, maybe not, just asking?

Otherwise these two functions seem great to me.

meddle0x53 · 2026-01-23T14:32:23Z

.ci/reset_quorum_minority.sh

@@ -0,0 +1,84 @@
+#!/bin/bash
+


The script itself seems to do its idea right - in a great way.

If you want you can do some of my suggestions I added to chaotic-network-runner.sh.
The strict mode, variables use helpers for logging, actually shared helper functions for logging and exiting are coming from utils.sh in this PR. Maybe use them + strict mode + quoting the $ vars.

meddle0x53 · 2026-01-23T14:36:24Z

.ci/reset_quorum_majority.sh

@@ -0,0 +1,85 @@
+#!/bin/bash
+


Same comment as on the minority script.

meddle0x53 · 2026-01-23T14:38:11Z

.ci/reset_quorum_majority.sh

+
+# Network parameters
+total_validators=7
+majority=$(( (total_validators - 1) / 3 + 1 ))


Shouldn't this be over 50% to be majority?

kaimast · 2026-02-05T02:58:28Z

I am closing this in favor of #4095.

ljedrz force-pushed the ci/network_delay_test branch 9 times, most recently from f73f7e0 to 78c3242 Compare January 13, 2026 13:18

ljedrz mentioned this pull request Jan 14, 2026

[Feature] Improve (dis-)connect handling between peers #3902

Merged

vicsn self-requested a review January 14, 2026 21:30

vicsn requested changes Jan 20, 2026

View reviewed changes

This was referenced Jan 21, 2026

[Fix] Make prepare_next_quorum_block atomic ProvableHQ/snarkVM#3070

Open

[Fix] Pause BFT block advancement during sync #4039

Draft

ljedrz added 2 commits January 22, 2026 09:38

ci: add a chaotic devnet job

3bf1a93

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>

temp: disable current CI jobs

4cd115a

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>

ljedrz force-pushed the ci/network_delay_test branch from 78c3242 to 4cd115a Compare January 22, 2026 08:39

ci: add a script repeatedly resetting the quorum minority

8ebadae

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>

vicsn requested a review from meddle0x53 January 22, 2026 13:12

ci: add a script resetting the quorum majority

53af408

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>

ljedrz force-pushed the ci/network_delay_test branch from b49b451 to 53af408 Compare January 23, 2026 09:47

meddle0x53 reviewed Jan 23, 2026

View reviewed changes

kaimast closed this Feb 5, 2026

Conversation

ljedrz commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ljedrz Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ljedrz commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ljedrz commented Jan 22, 2026

Uh oh!

ljedrz commented Jan 23, 2026

Uh oh!

meddle0x53 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaimast commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ljedrz commented Jan 13, 2026 •

edited

Loading

ljedrz Jan 20, 2026 •

edited

Loading

ljedrz commented Jan 22, 2026 •

edited

Loading