Skip to content

chore: add disaster simulation script#259

Merged
jason-lynch merged 3 commits intomainfrom
chore/disaster-simulation
Mar 5, 2026
Merged

chore: add disaster simulation script#259
jason-lynch merged 3 commits intomainfrom
chore/disaster-simulation

Conversation

@jason-lynch
Copy link
Member

@jason-lynch jason-lynch commented Jan 28, 2026

Summary

Adds a script to simulate losing a host. This script has three different ways of simulating that loss to enable us to develop recovery steps for Swarm and Control Plane/Etcd in parallel.

Testing

NOTE - You can get this script without checking out this whole branch by doing:

git fetch origin chore/disaster-simulation:chore/disaster-simulation
git restore --source chore/disaster-simulation hack/simulate-disaster.sh

Then to use the script:

# Simulate losing Swarm on two hosts in order to lose quorum
./hack/simulate-disaster.sh swarm host-1 host-3

# Simulate losing Control Plane/Etcd on two hosts in order to lose quorum
./hack/simulate-disaster.sh etcd host-1 host-3

# Reset the fixture back to its initial state
./hack/simulate-disaster.sh reset

# Remember to include the fixture variant if you're using a non-default one
FIXTURE_VARIANT=small ./hack/simulate-disaster.sh reset

# Print the included help text to see more examples
./hack/simulate-disaster.sh --help

Adds a script to simulate losing a host. This script has three different
ways of simulating that loss to enable us to develop recovery steps for
Swarm and Control Plane/Etcd in parallel.
@coderabbitai
Copy link

coderabbitai bot commented Jan 28, 2026

📝 Walkthrough

Walkthrough

Adds a new Bash script hack/simulate-disaster.sh to automate disaster simulations against Lima-based test fixtures, providing commands to simulate Swarm node loss, etcd node loss, full host loss, and a reset workflow for teardown and redeployment.

Changes

Cohort / File(s) Summary
Disaster Simulation Script
hack/simulate-disaster.sh
New Bash script implementing simulation functions: simulate_swarm_node_loss, simulate_etcd_node_loss, simulate_full_loss, reset, usage, and main. Handles SSH commands, Swarm/etcd service cleanup, Lima host stop/delete, fixture deployment, and control-plane redeployment via make targets.

Poem

I hop in the night with a clipboard and code,
Pulling wires where Lima test fixtures bode,
Nodes vanish, services scatter—then bloom,
I patch and I reset, bring order to gloom,
A rabbit, a script, and a cluster renewed. 🐇✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description covers Summary and Testing sections with clear usage examples, but lacks Changes list, Checklist completion, and Notes for Reviewers as specified in the template. Add a Changes section with a bulleted list of what was added, complete the Checklist, and optionally include Notes for Reviewers about limitations or special usage considerations.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main change: adding a disaster simulation script for development workflows.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chore/disaster-simulation

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Adds an option to reset the Lima E2E fixture back to its initial state
without tearing it down entirely. This can save a significant amount of
time between tests.
@jason-lynch jason-lynch marked this pull request as ready for review February 24, 2026 15:43
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hack/simulate-disaster.sh`:
- Around line 171-173: The dispatch case labeled "full" calls a nonexistent
function simulate_full_node_loss; update that case to call the actual function
simulate_full_loss (replace simulate_full_node_loss with simulate_full_loss) so
the "full" branch invokes the defined function and won't fail at runtime.
- Around line 121-125: Update the usage text in the simulate-disaster.sh header:
fix the duplicated word "different different" and include the missing `reset`
option in the synopsis string (change "Usage: $1 <swarm|etcd|full> <host-id>
[host-id ...]" to include `reset`, e.g. "Usage: $1 <swarm|etcd|full|reset>
<host-id> [host-id ...]") and adjust the descriptive paragraph to remove the
duplicate word so it reads "three different types of disasters" (or similar).
Ensure you update both the usage line and the description near that header.
- Line 17: Fix the unquoted $@ expansions in the for-loops and correct the
misnamed function call: change the three loops that iterate host_id (the ones
using for host_id in $@) to use quoted expansion (for host_id in "$@") to
prevent word-splitting and globbing, update the calls that forward args to use
quoted slices where shown (simulate_swarm_node_loss "${@:2}",
simulate_etcd_node_loss "${@:2}"), replace the invalid simulate_full_node_loss
call with the actual function name simulate_full_loss and pass quoted args
(simulate_full_loss "${@:2}"), and ensure the script's entry call uses main "$@"
instead of unquoted arguments.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aceb011 and 07d41dc.

📒 Files selected for processing (1)
  • hack/simulate-disaster.sh

- Fix etcd simulation for client-mode servers
- Remove database services in etcd simulation
- Rebuild control-plane in reset
- Remove `set -x`
@jason-lynch jason-lynch force-pushed the chore/disaster-simulation branch from 07d41dc to decc80d Compare March 5, 2026 14:54
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hack/simulate-disaster.sh`:
- Line 121: Update the usage synopsis in simulate-disaster.sh to show
mode-specific forms rather than a single line that implies every mode needs
<host-id>; replace the current "Usage: $1 <swarm|etcd|full|reset> <host-id>
[host-id ...]" with two clearer variants (e.g. "Usage: $1 <swarm|etcd|full>
<host-id> [host-id ...]" and "       $1 reset") or an equivalent multi-line help
message so that "reset" is shown as not requiring host-id(s); ensure the updated
usage string appears where the script prints help/usage so callers see the
correct mode-specific argument requirements.
- Around line 163-175: The case branches for "swarm", "etcd", and "full" should
validate that at least one host ID argument was passed before calling
simulate_swarm_node_loss, simulate_etcd_node_loss, or simulate_full_loss; if no
host IDs are provided, print a clear error to stderr (e.g., "error: <mode>
requires at least one <host-id>") and exit with a non-zero status instead of
silently returning success. Update the case block to check the argument count or
"${@:2}" emptiness before invoking those functions and call exit 1 on failure so
the script fails fast and surfaces the misuse.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: aff32cbc-3cf5-4b59-9e9a-1678dbea60dd

📥 Commits

Reviewing files that changed from the base of the PR and between 07d41dc and decc80d.

📒 Files selected for processing (1)
  • hack/simulate-disaster.sh

@jason-lynch jason-lynch merged commit 88864d7 into main Mar 5, 2026
3 checks passed
@jason-lynch jason-lynch deleted the chore/disaster-simulation branch March 5, 2026 16:08
@shiftyp
Copy link

shiftyp commented Mar 5, 2026

LGTM

Fallout Thumbs up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants