fix(bootstrap,server): persist sandbox state across gateway stop/start cycles by drew · Pull Request #739 · NVIDIA/OpenShell

drew · 2026-04-02T19:21:16Z

Summary

Sandbox pod data was lost whenever the gateway was stopped and restarted. Two independent bugs caused this: k3s used the container ID as its node name (which changes on container recreation, triggering PVC deletion), and sandbox pods had no persistent storage by default.

Related Issue

Fixes #738

Changes

Deterministic k3s node name: Added node_name() to constants.rs and pass OPENSHELL_NODE_NAME env var to the gateway container. The entrypoint script uses --node-name so the k3s node identity survives container recreation. clean_stale_nodes() now compares against the expected node name instead of running hostname inside the container.
Default workspace PVC: Sandbox pods now get a default 1Gi volumeClaimTemplate named "workspace" mounted at /sandbox. This ensures user files, installed packages, etc. survive pod rescheduling across gateway restarts.

Testing

mise run pre-commit passes
Unit tests added/updated (2 new tests for workspace mount injection logic)
E2E tests added/updated (if applicable)

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

…t cycles Two changes to preserve sandbox state across gateway restarts: 1. Deterministic k3s node identity: Set the Docker container hostname to a deterministic name derived from the gateway name (openshell-{name}). Pass OPENSHELL_NODE_NAME env var and --node-name flag to k3s via the cluster entrypoint as belt-and-suspenders. Update clean_stale_nodes() to prefer the deterministic name with a fallback to the container hostname for backward compatibility with older cluster images. This prevents clean_stale_nodes() from deleting PVCs (including the server's SQLite database) when the container is recreated after an image upgrade. 2. Default workspace persistence: Inject a 2Gi PVC and init container into every sandbox pod so the /sandbox directory survives pod rescheduling. The init container uses the same sandbox image, mounts the PVC at a temporary path, and copies the image's /sandbox contents (Python venv, dotfiles, skills) into the PVC on first use — guarded by a sentinel file so subsequent restarts are instant. The agent container then mounts the populated PVC at /sandbox. Users who supply custom volumeClaimTemplates are unaffected — the default workspace is skipped. Fixes #738

On resume, skip the image pull only when the container still exists. When only the volume survives (container was removed), the image pull must proceed so the container can be recreated. This fixes the 'container removal resumes' e2e scenario where the image was not available after the container was force-removed.

OpenShell v0.0.22 adds sandbox persistence across gateway restarts: - Deterministic k3s node name (NVIDIA/OpenShell#739) - Default workspace PVC at /sandbox (NVIDIA/OpenShell#739) - Gateway resume from Docker volume state (NVIDIA/OpenShell#488) - SSH handshake secret persistence (NVIDIA/OpenShell#488) This unblocks sandbox survival when Docker restarts (e.g., laptop close/open) — workspace data, SSH keys, and sandbox pods all survive. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

## Summary - **Bump minimum OpenShell to v0.0.22** — enables sandbox persistence across gateway restarts (deterministic k3s node name + workspace PVC + gateway resume from volume) - **Auto-recover OpenClaw processes** — when the sandbox pod survives a restart but the OpenClaw gateway didn't re-run, `nemoclaw connect` and `nemoclaw status` detect it and transparently restart via SSH - **E2E test** — proves the full survival scenario with real NVIDIA inference: onboard → baseline inference → plant marker file → stop gateway → restart gateway → verify sandbox survived → verify marker persisted → verify inference works post-restart (24/24 tests passed) ### How it works (joint OpenShell + NemoClaw solution) OpenShell v0.0.22 persists the infrastructure layer: - Gateway resumes from Docker volume state (PR NVIDIA/OpenShell#488) - SSH handshake secrets survive as K8s Secrets (PR NVIDIA/OpenShell#488) - Deterministic k3s node name prevents PVC orphaning (PR NVIDIA/OpenShell#739) - Default 1Gi workspace PVC at `/sandbox` (PR NVIDIA/OpenShell#739) NemoClaw restores the application layer: - Detects "sandbox alive, OpenClaw dead" via HTTP probe (curl localhost:18789) - Cleans stale lock/temp files, restarts gateway via SSH - Re-establishes dashboard port forward (18789) - `nemoclaw status` shows `OpenClaw: running | recovered | not running` with guidance ### User experience after this PR ``` laptop closes → Docker stops → laptop opens → Docker auto-restarts container → OpenShell gateway resumes, sandbox pod reschedules with workspace intact → user runs: nemoclaw my-assistant connect → NemoClaw detects OpenClaw not running, auto-restarts, reconnects port forward → user is back where they left off ``` ### Context Reported by @SenthilKumar-Ravichandran after testing OpenShell v0.0.22 on Brev VMs — PVC persistence works, but the user-defined ENTRYPOINT (`nemoclaw-start`) does not re-run after pod restart on some platforms. ## Test plan - [x] Unit tests pass (826/826 in main working directory) - [x] E2E sandbox survival test passes with real NVIDIA inference (24/24) - [x] `nemoclaw status` shows `OpenClaw: running` when gateway is alive - [x] `nemoclaw status` shows `OpenClaw: recovered` after auto-restart - [x] ShellCheck passes on new E2E test - [ ] Validate on Brev VM (where ENTRYPOINT doesn't re-run)  ## Summary by CodeRabbit * **New Features** * Adds automatic gateway health monitoring with in-sandbox recovery attempts, re-establishes dashboard port-forwarding, and provides clearer status output with actionable recovery instructions. * **Tests** * Adds an end-to-end test validating sandbox persistence and continuity across gateway stop/start cycles, including live inference and marker-file persistence checks. * **Chores** * Bumped minimum OpenShell version requirement to 0.0.22.  --------- Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

drew requested a review from a team as a code owner April 2, 2026 19:21

drew self-assigned this Apr 2, 2026

drew added the test:e2e Requires end-to-end coverage label Apr 2, 2026

pimlock previously approved these changes Apr 2, 2026

View reviewed changes

drew dismissed pimlock’s stale review via 28695b3 April 2, 2026 20:14

drew force-pushed the 738-sandbox-persistence-across-gateway-restart branch 6 times, most recently from 871adc9 to 6da0e77 Compare April 3, 2026 04:37

drew force-pushed the 738-sandbox-persistence-across-gateway-restart branch from 6da0e77 to 38892c3 Compare April 3, 2026 04:40

drew added 2 commits April 2, 2026 22:24

wip

4cc7e31

drew merged commit 491c5d8 into main Apr 3, 2026
15 of 16 checks passed

drew deleted the 738-sandbox-persistence-across-gateway-restart branch April 3, 2026 06:15

johntmyers mentioned this pull request Apr 3, 2026

feat(sandbox): openshell sandbox sync — continuous host-sandbox file sync with state backup #636

Closed

2 tasks

ericksoa mentioned this pull request Apr 4, 2026

feat: sandbox survival across gateway restarts NVIDIA/NemoClaw#1466

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bootstrap,server): persist sandbox state across gateway stop/start cycles#739

fix(bootstrap,server): persist sandbox state across gateway stop/start cycles#739
drew merged 3 commits intomainfrom
738-sandbox-persistence-across-gateway-restart

drew commented Apr 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drew commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drew commented Apr 2, 2026 •

edited

Loading