Describe the bug
After a container has been running awhile, access to the GPU appears to be lost. Running nvidia-smi inside the container produces the "Failed to initiailize NVML: Unknown Error" message. Running nvidia-smi outside the container works fine. Rebooting laptop temporarily corrects the issue. It seems similar to issue #1133, but running sudo systemctl daemon-reload doesn't trigger the behavior. All of the software versions (listed below) are considerably newer than in that post. Starting a new container with GPU access also works. It's just long running existing containers that fail.
To Reproduce
Start a persistent container with access to GPU. Verify nvidia-smi works. Wait a while (12-24 hours). at some point access to GPU is lost and nvidia-smi no longer works.
Expected behavior
GPU access maintained.
Environment (please provide the following information):
nvidia-container-toolkit version: 1.18.1-1
- NVIDIA Driver Version: 580.126.20
- Host OS: Ubuntu 22.04
- Kernel Version: 6.17.0-1011-oem
- Container Runtime Version: 2.2.1
- CPU Architecture: x86_64
- GPU Model(s): RTX A2000 8GB
- CUDA Version: 13.0
$ cat /etc/nvidia-container-runtime/config.toml
#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"
[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["runc", "crun"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
[nvidia-container-runtime.modes.legacy]
cuda-compat-mode = "ldconfig"
[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false
[nvidia-ctk]
path = "nvidia-ctk"
# outside of container
$ nvidia-smi
Tue Mar 24 10:10:16 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A2000 8GB Lap... On | 00000000:01:00.0 On | N/A |
| N/A 48C P8 5W / 60W | 217MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 12581 G /usr/lib/xorg/Xorg 84MiB |
| 0 N/A N/A 12992 G /usr/bin/gnome-shell 78MiB |
+-----------------------------------------------------------------------------------------+
Describe the bug
After a container has been running awhile, access to the GPU appears to be lost. Running nvidia-smi inside the container produces the "Failed to initiailize NVML: Unknown Error" message. Running nvidia-smi outside the container works fine. Rebooting laptop temporarily corrects the issue. It seems similar to issue #1133, but running
sudo systemctl daemon-reloaddoesn't trigger the behavior. All of the software versions (listed below) are considerably newer than in that post. Starting a new container with GPU access also works. It's just long running existing containers that fail.To Reproduce
Start a persistent container with access to GPU. Verify nvidia-smi works. Wait a while (12-24 hours). at some point access to GPU is lost and nvidia-smi no longer works.
Expected behavior
GPU access maintained.
Environment (please provide the following information):
nvidia-container-toolkitversion: 1.18.1-1