Skip to content

[Bug]: Container loses access to GPU with error: "Failed to initiailize NVML: Unknown Error" #1744

@jasonbeach

Description

@jasonbeach

Describe the bug
After a container has been running awhile, access to the GPU appears to be lost. Running nvidia-smi inside the container produces the "Failed to initiailize NVML: Unknown Error" message. Running nvidia-smi outside the container works fine. Rebooting laptop temporarily corrects the issue. It seems similar to issue #1133, but running sudo systemctl daemon-reload doesn't trigger the behavior. All of the software versions (listed below) are considerably newer than in that post. Starting a new container with GPU access also works. It's just long running existing containers that fail.

To Reproduce
Start a persistent container with access to GPU. Verify nvidia-smi works. Wait a while (12-24 hours). at some point access to GPU is lost and nvidia-smi no longer works.

Expected behavior
GPU access maintained.

Environment (please provide the following information):

  • nvidia-container-toolkit version: 1.18.1-1
  • NVIDIA Driver Version: 580.126.20
  • Host OS: Ubuntu 22.04
  • Kernel Version: 6.17.0-1011-oem
  • Container Runtime Version: 2.2.1
  • CPU Architecture: x86_64
  • GPU Model(s): RTX A2000 8GB
  • CUDA Version: 13.0
$ cat /etc/nvidia-container-runtime/config.toml
#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["runc", "crun"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime.modes.legacy]
cuda-compat-mode = "ldconfig"

[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false

[nvidia-ctk]
path = "nvidia-ctk"
# outside of container
$ nvidia-smi
Tue Mar 24 10:10:16 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A2000 8GB Lap...    On  |   00000000:01:00.0  On |                  N/A |
| N/A   48C    P8              5W /   60W |     217MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           12581      G   /usr/lib/xorg/Xorg                       84MiB |
|    0   N/A  N/A           12992      G   /usr/bin/gnome-shell                     78MiB |
+-----------------------------------------------------------------------------------------+

Metadata

Metadata

Assignees

Labels

bugIssue/PR to expose/discuss/fix a bugneeds-triageissue or PR has not been assigned a priority-px label

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions