-
Notifications
You must be signed in to change notification settings - Fork 422
OpenShell gateway start fails on Ubuntu 22.04.5 ARM64: K8s namespace not ready, flannel subnet.env missing #716
Description
Agent Diagnostic
empty
Description
What happened? What did you expect to happen?
- On an Apple Silicon Mac, create an Ubuntu 22.04.5 ARM64 VM in UTM/QEMU.
- Install Docker in the VM and verify Docker is running.
- Install OpenShell via the NemoClaw onboarding flow, or run
openshell gateway startdirectly. - Wait for gateway initialization to begin.
- Observe that gateway startup fails with:
K8s namespace not readytimed out waiting for namespace 'openshell' to exist
- While
openshell gateway startis still running, inspect the embedded k3s cluster inside the gateway container:kubectl get nsshows only the default namespaces andagent-sandbox-system, but notopenshellkubectl get pods -Ashows core pods stuck inContainerCreating
- Describe the stuck pods and observe repeated sandbox/network errors such as:
FailedCreatePodSandBoxplugin type="flannel" failed (add)failed to load flannel 'subnet.env' file: open /run/flannel/subnet.env: no such file or directory
- After the timeout, the gateway is torn down and
openshell statusreports:Status: No gateway configured.
Reproduction Steps
I expected openshell gateway start to initialize the gateway successfully and create a working OpenShell environment, including the openshell Kubernetes namespace.
Instead, gateway startup consistently fails during initialization with:
K8s namespace not readytimed out waiting for namespace 'openshell' to exist
While the gateway container is briefly running, the embedded k3s cluster only creates the default namespaces plus agent-sandbox-system, but never creates the openshell namespace. Core pods remain stuck in ContainerCreating, and pod inspection shows repeated sandbox/network failures caused by flannel not being able to load /run/flannel/subnet.env.
After the timeout, the gateway container is torn down and openshell status reports No gateway configured.
In short: I expected a successful gateway bootstrap, but instead the embedded k3s/flannel networking appears to fail during startup, which prevents the openshell namespace and related services from ever becoming ready.
Environment
Host:
- Apple Silicon Mac
- UTM QEMU VM
Guest OS:
- Ubuntu 22.04.5 LTS
- Kernel: 5.15.0-173-generic
- Architecture: aarch64 / ARM64
Docker:
- Docker Engine / Server Version: 28.2.2
OpenShell:
- openshell CLI: 0.0.19
- Gateway image: ghcr.io/nvidia/openshell/cluster:0.0.19
VM resources:
- 8 CPUs
- 15.59 GiB RAM
- 4 GiB swap
- ~84 GiB free disk during testing
Networking / runtime notes:
- Docker runtime was working and accessible to the non-root user after adding the user to the docker group
- The OpenShell gateway container briefly appeared as
openshell-cluster-openshellduring startup, then disappeared after the timeout
Additional context:
- I also previously tried the macOS host path with Docker Desktop and Colima
Logs
Gateway startup repeatedly fails with:
openshell gateway start
✓ Checking Docker
✓ Downloading gateway
x Initializing environment x Gateway failed: openshell
Gateway failed to start
Error: × K8s namespace not ready
╰─▶ timed out waiting for namespace 'openshell' to exist: Error from server
(NotFound): namespaces "openshell" not found
Representative container log lines:
time="2026-03-31T20:39:35Z" level=info msg="Connecting to proxy" url="wss://172.18.0.2:6443/v1-k3s/connect"
time="2026-03-31T20:39:35Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
E0331 20:40:15.722444 117 handler_proxy.go:143] error resolving kube-system/metrics-server: no endpoints available for service "metrics-server"
E0331 20:40:30.975563 117 controller.go:102] "Unhandled Error" err=<loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable>
Live pod inspection while the gateway container was still running showed pods stuck in ContainerCreating. The most useful pod-level errors were:
coredns:
Warning FailedCreatePodSandBox kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "...": plugin type="flannel" failed (add): failed to load flannel 'subnet.env' file: open /run/flannel/subnet.env: no such file or directory. Check the flannel pod log for this node.
helm-install-openshell:
Warning FailedCreatePodSandBox kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "...": plugin type="flannel" failed (add): failed to load flannel 'subnet.env' file: open /run/flannel/subnet.env: no such file or directory. Check the flannel pod log for this node.
local-path-provisioner:
Warning FailedMount kubelet MountVolume.SetUp failed for volume "config-volume" : configmap "local-path-config" not found
Warning FailedCreatePodSandBox kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "...": plugin type="flannel" failed (add): failed to load flannel 'subnet.env' file: open /run/flannel/subnet.env: no such file or directory. Check the flannel pod log for this node.
metrics-server:
Warning FailedCreatePodSandBox kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "...": plugin type="flannel" failed (add): failed to load flannel 'subnet.env' file: open /run/flannel/subnet.env: no such file or directory. Check the flannel pod log for this node.
agent-sandbox-controller:
Warning FailedCreatePodSandBox kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "...": plugin type="flannel" failed (add): failed to load flannel 'subnet.env' file: open /run/flannel/subnet.env: no such file or directory. Check the flannel pod log for this node.
Additional notes:
- The gateway container briefly exists, then disappears after the timeout.
- After the timeout, `openshell status` reports: `Status: No gateway configured.`
- When trying to inspect events after teardown, Docker returns:
`Error response from daemon: No such container: openshell-cluster-openshell`Agent-First Checklist
- I pointed my agent at the repo and had it investigate this issue
- I loaded relevant skills (e.g.,
debug-openshell-cluster,debug-inference,openshell-cli) - My agent could not resolve this — the diagnostic above explains why