Conversation
f73f7e0 to
78c3242
Compare
.github/workflows/chaotic-devnet.yml
Outdated
|
|
||
| - name: Run Tests with Chaos | ||
| # Arguments: <PORT_RANGE> <TEST_COMMAND> | ||
| run: ./scripts/chaotic-network-runner.sh 5000-5003 ./.ci/devnet_ci.sh 4 2 0 45 |
There was a problem hiding this comment.
The original issue asks for a specific set of behaviours which is not the same as what devnet_ci.sh does: #4062
There was a problem hiding this comment.
do we already have a test which restarts validators and potentially removes some of their ledgers? if not, we should have one that's separate from this script - it's designed to apply random network stress to existing scenarios, which allows us to use it in tandem with any base scenario
There was a problem hiding this comment.
Not yet, can you write such a script? .ci/upgrade_nodes_ci.sh comes closest in functionality.
There's an open question of what machine size / number of validators / runtime reliably triggers issues. If we can reliably trigger issues on a heavy machine within 30 minutes, that would be amazing, and a 40x improvement over our status quo (where we launch 40 machines).
I would first test the basic functionality on the free default github actions machine, and then migrate it back to CircleCI where we make part of the release job (and we can manually trigger it).
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
78c3242 to
4cd115a
Compare
|
Updated the description and rebased now that the PR this had been built on (#4059) has been merged. There were no changes to the contents of the script or the job. |
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
|
I added a script according to the following:
|
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
b49b451 to
53af408
Compare
|
I added a script according to the following:
|
meddle0x53
left a comment
There was a problem hiding this comment.
So some suggestions/comments from me.
| @@ -0,0 +1,90 @@ | |||
| #!/bin/bash | |||
|
|
|||
There was a problem hiding this comment.
It is a good idea to add strict mode at the top to catch unset vars/pipeline failures/whitespace issues. If there is a bug it will make it easier to fail and debug. I learned this from bad experience with my last scripts.
| set -euo pipefail | |
| IFS=$'\n\t' |
| # ========================================== | ||
|
|
||
| # 1. Require Port Range (Argument 1) | ||
| TARGET_PORTS=$1 |
There was a problem hiding this comment.
It is good to quote all the vars (as I did in the PR for the delay-network.sh), because unquoted variables change meaning depending on their contents, and Bash won’t warn us. We can pass them in a bad way in the CI and then debug why some white spaces are there.
|
|
||
| # 1. Require Port Range (Argument 1) | ||
| TARGET_PORTS=$1 | ||
|
|
There was a problem hiding this comment.
We can add
die() {
echo "Error: $*" >&2
exit 1
}
function for the errors and also log function for clear logs - like I did for the delay-network.sh script. Helps debugging, not important, of course.
Also
need_cmd sudo
need_cmd shuf
NETWORK_SCRIPT="./scripts/delay-network.sh"
[[ -f "$NETWORK_SCRIPT" ]] || die "Network script not found: $NETWORK_SCRIPT"
[[ -x "$NETWORK_SCRIPT" ]] || die "Network script is not executable: $NETWORK_SCRIPT"
would be a good check.
|
|
||
| # Cleanup | ||
| echo "[Chaos Runner] 🏁 Main command finished with exit code $EXIT_CODE." | ||
| kill $CHAOS_PID 2>/dev/null |
There was a problem hiding this comment.
Maybe it can be part of a trap. It is not important as it is a CI run and the CI job will just exit and all will be cleared, I guess, but for local runs later is a good way to cleanup:
cleanup() {
if [ -n "${CHAOS_PID:-}" ]; then
kill "$CHAOS_PID" 2>/dev/null || true
wait "$CHAOS_PID" 2>/dev/null || true
fi
reset_network
}
trap cleanup EXIT
| # Check if PID exists (is not empty) and is currently running | ||
| if [[ -n "$pid" ]] && kill -0 "$pid" 2>/dev/null; then | ||
| echo "Killing PIDS[$i] -> $pid" | ||
| kill -9 "$pid" 2>/dev/null || true |
There was a problem hiding this comment.
Do you want to kill the nodes with -9, better SIGTERM? Maybe it can lead to some problems with the ledger and random failures, maybe not, just asking?
Otherwise these two functions seem great to me.
| @@ -0,0 +1,84 @@ | |||
| #!/bin/bash | |||
|
|
|||
There was a problem hiding this comment.
The script itself seems to do its idea right - in a great way.
If you want you can do some of my suggestions I added to chaotic-network-runner.sh.
The strict mode, variables use helpers for logging, actually shared helper functions for logging and exiting are coming from utils.sh in this PR. Maybe use them + strict mode + quoting the $ vars.
| @@ -0,0 +1,85 @@ | |||
| #!/bin/bash | |||
|
|
|||
There was a problem hiding this comment.
Same comment as on the minority script.
|
|
||
| # Network parameters | ||
| total_validators=7 | ||
| majority=$(( (total_validators - 1) / 3 + 1 )) |
There was a problem hiding this comment.
Shouldn't this be over 50% to be majority?
|
I am closing this in favor of #4095. |
This PR introduces a
chaotic-network-runner.shscript which can be used to simulate chaotic network conditions while running other scripts. As a an example, it is currently applied todevnet_ci.shas a new CI job.All the other CI jobs are temporarily removed until this is ready.
Filing this as a draft until it's decided which CI should be used for it, and which jobs to enhance with it (and whether those would be new jobs, or updates to the existing ones).