Skip to content

fix(bootstrap,server): persist sandbox state across gateway stop/start cycles#739

Merged
drew merged 3 commits intomainfrom
738-sandbox-persistence-across-gateway-restart
Apr 3, 2026
Merged

fix(bootstrap,server): persist sandbox state across gateway stop/start cycles#739
drew merged 3 commits intomainfrom
738-sandbox-persistence-across-gateway-restart

Conversation

@drew
Copy link
Copy Markdown
Collaborator

@drew drew commented Apr 2, 2026

Summary

Sandbox pod data was lost whenever the gateway was stopped and restarted. Two independent bugs caused this: k3s used the container ID as its node name (which changes on container recreation, triggering PVC deletion), and sandbox pods had no persistent storage by default.

Related Issue

Fixes #738

Changes

  • Deterministic k3s node name: Added node_name() to constants.rs and pass OPENSHELL_NODE_NAME env var to the gateway container. The entrypoint script uses --node-name so the k3s node identity survives container recreation. clean_stale_nodes() now compares against the expected node name instead of running hostname inside the container.
  • Default workspace PVC: Sandbox pods now get a default 1Gi volumeClaimTemplate named "workspace" mounted at /sandbox. This ensures user files, installed packages, etc. survive pod rescheduling across gateway restarts.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated (2 new tests for workspace mount injection logic)
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@drew drew requested a review from a team as a code owner April 2, 2026 19:21
@drew drew self-assigned this Apr 2, 2026
@drew drew added the test:e2e Requires end-to-end coverage label Apr 2, 2026
pimlock
pimlock previously approved these changes Apr 2, 2026
@drew drew force-pushed the 738-sandbox-persistence-across-gateway-restart branch 6 times, most recently from 871adc9 to 6da0e77 Compare April 3, 2026 04:37
…t cycles

Two changes to preserve sandbox state across gateway restarts:

1. Deterministic k3s node identity: Set the Docker container hostname to
   a deterministic name derived from the gateway name (openshell-{name}).
   Pass OPENSHELL_NODE_NAME env var and --node-name flag to k3s via the
   cluster entrypoint as belt-and-suspenders.  Update clean_stale_nodes()
   to prefer the deterministic name with a fallback to the container
   hostname for backward compatibility with older cluster images.

   This prevents clean_stale_nodes() from deleting PVCs (including the
   server's SQLite database) when the container is recreated after an
   image upgrade.

2. Default workspace persistence: Inject a 2Gi PVC and init container
   into every sandbox pod so the /sandbox directory survives pod
   rescheduling.  The init container uses the same sandbox image, mounts
   the PVC at a temporary path, and copies the image's /sandbox contents
   (Python venv, dotfiles, skills) into the PVC on first use — guarded
   by a sentinel file so subsequent restarts are instant.  The agent
   container then mounts the populated PVC at /sandbox.  Users who
   supply custom volumeClaimTemplates are unaffected — the default
   workspace is skipped.

Fixes #738
@drew drew force-pushed the 738-sandbox-persistence-across-gateway-restart branch from 6da0e77 to 38892c3 Compare April 3, 2026 04:40
drew added 2 commits April 2, 2026 22:24
On resume, skip the image pull only when the container still exists.
When only the volume survives (container was removed), the image pull
must proceed so the container can be recreated. This fixes the
'container removal resumes' e2e scenario where the image was not
available after the container was force-removed.
@drew drew merged commit 491c5d8 into main Apr 3, 2026
15 of 16 checks passed
@drew drew deleted the 738-sandbox-persistence-across-gateway-restart branch April 3, 2026 06:15
ericksoa added a commit to NVIDIA/NemoClaw that referenced this pull request Apr 4, 2026
OpenShell v0.0.22 adds sandbox persistence across gateway restarts:
- Deterministic k3s node name (NVIDIA/OpenShell#739)
- Default workspace PVC at /sandbox (NVIDIA/OpenShell#739)
- Gateway resume from Docker volume state (NVIDIA/OpenShell#488)
- SSH handshake secret persistence (NVIDIA/OpenShell#488)

This unblocks sandbox survival when Docker restarts (e.g., laptop
close/open) — workspace data, SSH keys, and sandbox pods all survive.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
ericksoa added a commit to NVIDIA/NemoClaw that referenced this pull request Apr 4, 2026
## Summary

- **Bump minimum OpenShell to v0.0.22** — enables sandbox persistence
across gateway restarts (deterministic k3s node name + workspace PVC +
gateway resume from volume)
- **Auto-recover OpenClaw processes** — when the sandbox pod survives a
restart but the OpenClaw gateway didn't re-run, `nemoclaw connect` and
`nemoclaw status` detect it and transparently restart via SSH
- **E2E test** — proves the full survival scenario with real NVIDIA
inference: onboard → baseline inference → plant marker file → stop
gateway → restart gateway → verify sandbox survived → verify marker
persisted → verify inference works post-restart (24/24 tests passed)

### How it works (joint OpenShell + NemoClaw solution)

OpenShell v0.0.22 persists the infrastructure layer:
- Gateway resumes from Docker volume state (PR NVIDIA/OpenShell#488)
- SSH handshake secrets survive as K8s Secrets (PR NVIDIA/OpenShell#488)
- Deterministic k3s node name prevents PVC orphaning (PR
NVIDIA/OpenShell#739)
- Default 1Gi workspace PVC at `/sandbox` (PR NVIDIA/OpenShell#739)

NemoClaw restores the application layer:
- Detects "sandbox alive, OpenClaw dead" via HTTP probe (curl
localhost:18789)
- Cleans stale lock/temp files, restarts gateway via SSH
- Re-establishes dashboard port forward (18789)
- `nemoclaw status` shows `OpenClaw: running | recovered | not running`
with guidance

### User experience after this PR

```
laptop closes → Docker stops → laptop opens
→ Docker auto-restarts container
→ OpenShell gateway resumes, sandbox pod reschedules with workspace intact
→ user runs: nemoclaw my-assistant connect
→ NemoClaw detects OpenClaw not running, auto-restarts, reconnects port forward
→ user is back where they left off
```

### Context

Reported by @SenthilKumar-Ravichandran after testing OpenShell v0.0.22
on Brev VMs — PVC persistence works, but the user-defined ENTRYPOINT
(`nemoclaw-start`) does not re-run after pod restart on some platforms.

## Test plan

- [x] Unit tests pass (826/826 in main working directory)
- [x] E2E sandbox survival test passes with real NVIDIA inference
(24/24)
- [x] `nemoclaw status` shows `OpenClaw: running` when gateway is alive
- [x] `nemoclaw status` shows `OpenClaw: recovered` after auto-restart
- [x] ShellCheck passes on new E2E test
- [ ] Validate on Brev VM (where ENTRYPOINT doesn't re-run)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Adds automatic gateway health monitoring with in-sandbox recovery
attempts, re-establishes dashboard port-forwarding, and provides clearer
status output with actionable recovery instructions.

* **Tests**
* Adds an end-to-end test validating sandbox persistence and continuity
across gateway stop/start cycles, including live inference and
marker-file persistence checks.

* **Chores**
  * Bumped minimum OpenShell version requirement to 0.0.22.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: sandbox pod state lost across gateway stop/start cycles

2 participants