Skip to content

OpenShell gateway start fails on Ubuntu 22.04.5 ARM64: K8s namespace not ready, flannel subnet.env missing #716

@extanglement

Description

@extanglement

Agent Diagnostic

empty

Description

What happened? What did you expect to happen?

  1. On an Apple Silicon Mac, create an Ubuntu 22.04.5 ARM64 VM in UTM/QEMU.
  2. Install Docker in the VM and verify Docker is running.
  3. Install OpenShell via the NemoClaw onboarding flow, or run openshell gateway start directly.
  4. Wait for gateway initialization to begin.
  5. Observe that gateway startup fails with:
    • K8s namespace not ready
    • timed out waiting for namespace 'openshell' to exist
  6. While openshell gateway start is still running, inspect the embedded k3s cluster inside the gateway container:
    • kubectl get ns shows only the default namespaces and agent-sandbox-system, but not openshell
    • kubectl get pods -A shows core pods stuck in ContainerCreating
  7. Describe the stuck pods and observe repeated sandbox/network errors such as:
    • FailedCreatePodSandBox
    • plugin type="flannel" failed (add)
    • failed to load flannel 'subnet.env' file: open /run/flannel/subnet.env: no such file or directory
  8. After the timeout, the gateway is torn down and openshell status reports:
    • Status: No gateway configured.

Reproduction Steps

I expected openshell gateway start to initialize the gateway successfully and create a working OpenShell environment, including the openshell Kubernetes namespace.

Instead, gateway startup consistently fails during initialization with:

  • K8s namespace not ready
  • timed out waiting for namespace 'openshell' to exist

While the gateway container is briefly running, the embedded k3s cluster only creates the default namespaces plus agent-sandbox-system, but never creates the openshell namespace. Core pods remain stuck in ContainerCreating, and pod inspection shows repeated sandbox/network failures caused by flannel not being able to load /run/flannel/subnet.env.

After the timeout, the gateway container is torn down and openshell status reports No gateway configured.

In short: I expected a successful gateway bootstrap, but instead the embedded k3s/flannel networking appears to fail during startup, which prevents the openshell namespace and related services from ever becoming ready.

Environment

Host:

  • Apple Silicon Mac
  • UTM QEMU VM

Guest OS:

  • Ubuntu 22.04.5 LTS
  • Kernel: 5.15.0-173-generic
  • Architecture: aarch64 / ARM64

Docker:

  • Docker Engine / Server Version: 28.2.2

OpenShell:

  • openshell CLI: 0.0.19
  • Gateway image: ghcr.io/nvidia/openshell/cluster:0.0.19

VM resources:

  • 8 CPUs
  • 15.59 GiB RAM
  • 4 GiB swap
  • ~84 GiB free disk during testing

Networking / runtime notes:

  • Docker runtime was working and accessible to the non-root user after adding the user to the docker group
  • The OpenShell gateway container briefly appeared as openshell-cluster-openshell during startup, then disappeared after the timeout

Additional context:

  • I also previously tried the macOS host path with Docker Desktop and Colima

Logs

Gateway startup repeatedly fails with:

openshell gateway start
✓ Checking Docker
✓ Downloading gateway
x Initializing environment                                                      x Gateway failed: openshell

Gateway failed to start

Error: × K8s namespace not ready
╰─▶ timed out waiting for namespace 'openshell' to exist: Error from server
    (NotFound): namespaces "openshell" not found

Representative container log lines:
time="2026-03-31T20:39:35Z" level=info msg="Connecting to proxy" url="wss://172.18.0.2:6443/v1-k3s/connect"
time="2026-03-31T20:39:35Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
E0331 20:40:15.722444     117 handler_proxy.go:143] error resolving kube-system/metrics-server: no endpoints available for service "metrics-server"
E0331 20:40:30.975563     117 controller.go:102] "Unhandled Error" err=<loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable>

Live pod inspection while the gateway container was still running showed pods stuck in ContainerCreating. The most useful pod-level errors were:

coredns:
Warning  FailedCreatePodSandBox  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "...": plugin type="flannel" failed (add): failed to load flannel 'subnet.env' file: open /run/flannel/subnet.env: no such file or directory. Check the flannel pod log for this node.

helm-install-openshell:
Warning  FailedCreatePodSandBox  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "...": plugin type="flannel" failed (add): failed to load flannel 'subnet.env' file: open /run/flannel/subnet.env: no such file or directory. Check the flannel pod log for this node.

local-path-provisioner:
Warning  FailedMount             kubelet  MountVolume.SetUp failed for volume "config-volume" : configmap "local-path-config" not found
Warning  FailedCreatePodSandBox  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "...": plugin type="flannel" failed (add): failed to load flannel 'subnet.env' file: open /run/flannel/subnet.env: no such file or directory. Check the flannel pod log for this node.

metrics-server:
Warning  FailedCreatePodSandBox  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "...": plugin type="flannel" failed (add): failed to load flannel 'subnet.env' file: open /run/flannel/subnet.env: no such file or directory. Check the flannel pod log for this node.

agent-sandbox-controller:
Warning  FailedCreatePodSandBox  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "...": plugin type="flannel" failed (add): failed to load flannel 'subnet.env' file: open /run/flannel/subnet.env: no such file or directory. Check the flannel pod log for this node.

Additional notes:
- The gateway container briefly exists, then disappears after the timeout.
- After the timeout, `openshell status` reports: `Status: No gateway configured.`
- When trying to inspect events after teardown, Docker returns:
  `Error response from daemon: No such container: openshell-cluster-openshell`

Agent-First Checklist

  • I pointed my agent at the repo and had it investigate this issue
  • I loaded relevant skills (e.g., debug-openshell-cluster, debug-inference, openshell-cli)
  • My agent could not resolve this — the diagnostic above explains why

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions