Skip to content

[Bug]: NVIDIADriver CRs remain stuck after deleting conflicting default CR due to missing requeue of peer NVIDIADriver resources #2259

@rahulait

Description

@rahulait

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

Describe the bug
When multiple NVIDIADriver CRs are present, conflict validation is evaluated globally across all CR node selectors. If the default NVIDIADriver (no explicit selector) overlaps with custom CRs, custom CRs fail validation and exit reconcile without requeue.
After deleting the default/conflicting CR, the previously blocked CRs do not automatically reconcile, so they remain stuck until a separate unrelated event triggers them.

To Reproduce
Deploy GPU Operator with default NVIDIADriver CR.
Create 2+ custom NVIDIADriver CRs with unique nodeSelector labels.
Label nodes to match custom CR selectors.
Observe custom CRs are blocked due to conflict with default CR.
Delete the default NVIDIADriver CR.
Observe blocked custom CRs do not progress automatically.

Expected behavior
Deleting/updating a NVIDIADriver CR that changes global conflict state should trigger reconciliation of other NVIDIADriver CRs so they revalidate and proceed.

Environment (please provide the following information):

  • GPU Operator Version: v26.3.0
  • OS: ubuntu24.04, ubuntu22.04
  • Kubernetes Distro and Version: kubeadm 1.34

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: operator_feedback@nvidia.com

Metadata

Metadata

Assignees

Labels

bugIssue/PR to expose/discuss/fix a bug

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions