-
Notifications
You must be signed in to change notification settings - Fork 474
[Bug]: NVIDIADriver CRs remain stuck after deleting conflicting default CR due to missing requeue of peer NVIDIADriver resources #2259
Description
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
When multiple NVIDIADriver CRs are present, conflict validation is evaluated globally across all CR node selectors. If the default NVIDIADriver (no explicit selector) overlaps with custom CRs, custom CRs fail validation and exit reconcile without requeue.
After deleting the default/conflicting CR, the previously blocked CRs do not automatically reconcile, so they remain stuck until a separate unrelated event triggers them.
To Reproduce
Deploy GPU Operator with default NVIDIADriver CR.
Create 2+ custom NVIDIADriver CRs with unique nodeSelector labels.
Label nodes to match custom CR selectors.
Observe custom CRs are blocked due to conflict with default CR.
Delete the default NVIDIADriver CR.
Observe blocked custom CRs do not progress automatically.
Expected behavior
Deleting/updating a NVIDIADriver CR that changes global conflict state should trigger reconciliation of other NVIDIADriver CRs so they revalidate and proceed.
Environment (please provide the following information):
- GPU Operator Version: v26.3.0
- OS: ubuntu24.04, ubuntu22.04
- Kubernetes Distro and Version: kubeadm 1.34
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com