reconcile all nvidiadrivers when any nvidiadriver is changed by rahulait · Pull Request #2258 · NVIDIA/gpu-operator

rahulait · 2026-03-31T14:34:02Z

Description

Issue

When multiple NVIDIADriver CRs exist, conflict validation is global (overlapping node selection across all driver CRs). A common case is the default CR (no nodeSelector) selecting all GPU nodes, which blocks custom CRs with targeted selectors.
Blocked CRs currently exit reconcile from the validation path with no requeue, so they remain stuck until an external event re-triggers them.
Deleting the conflicting default CR changes global validity, but previously only the deleted object was enqueued by the primary watch, so remaining CRs were not retried immediately.

Options Considered

Option 1 (implemented):

On NVIDIADriver events, enqueue reconcile for all NVIDIADriver CRs via TypedEnqueueRequestsFromMapFunc (fan-out).

Option 2 (not implemented):

Return error from validation failure so each conflicting CR self-requeues via the rate-limiter.
Basically here. switch from nil to error:

gpu-operator/controllers/nvidiadriver_controller.go

Line 151 in d5750f2

return reconcile.Result{}, nil

Why Option 1

Retries happen when global conflict state actually changes (create/update/delete of relevant CRs), which is the right signal.
Avoids continuous error-driven retry loops while conflict is expected/user-caused.
Reduces log noise and unnecessary reconcile churn, especially with many conflicting CRs and MaxConcurrentReconciles=1.
Provides immediate unstick behavior after deleting the blocking CR, without waiting for backoff retry windows.

Why Option 2 is less useful

Treats a configuration conflict as a transient controller failure, causing repeated retries that usually cannot succeed until user action.
Generates sustained error logs and queue pressure; with single-worker reconcile this can delay useful work for other requests.
Many conflicting CRs could keep that single worker busy with repeated retries.
Converges eventually, but less efficiently and with poorer operational signal-to-noise.

Expected Outcome

After removing a conflicting NVIDIADriver (for example, deleting default CR), remaining previously blocked CRs are automatically revalidated and can proceed without manual nudges.

Checklist

No secrets, sensitive information, or unrelated changes
Lint checks passing (make lint)
Generated assets in-sync (make validate-generated-assets)
Go mod artifacts in-sync (make validate-modules)
Test cases are added for new code paths

Testing

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>

copy-pr-bot · 2026-03-31T19:58:26Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

rahulait · 2026-03-31T20:01:43Z

/ok to test 6497fe8

karthikvetrivel · 2026-04-01T13:35:43Z

This mostly looks good to me, Rahul. Have you tested the case where instead of deleting the default CR, you edit it to narrow the scope to limit overlap with a new, custom CR?

rahulait · 2026-04-01T14:50:03Z

This mostly looks good to me, Rahul. Have you tested the case where instead of deleting the default CR, you edit it to narrow the scope to limit overlap with a new, custom CR?

Yup, I tested this scenario as well and it works fine. Tested by having 3 node cluster with default nvd, then adding two more nvds targetting individual nodes and then limiting scope of default to one node. All nvds were able to bring up drivers correctly on each node.

# kubectl get nvidiadriver
NAME          STATUS   AGE
default       ready    2026-04-01T14:36:54Z
demo-gold     ready    2026-04-01T14:45:44Z
demo-silver   ready    2026-04-01T14:45:44Z

rahulait requested review from cdesiniotis, karthikvetrivel, rajathagasthya, shivamerla and tariq1890 as code owners March 31, 2026 14:34

rahulait force-pushed the fix-nvidiadriver-reconcile branch from bf127d3 to 44050bf Compare March 31, 2026 14:46

rahulait added the bug Issue/PR to expose/discuss/fix a bug label Mar 31, 2026

rahulait added this to the v26.3.1 milestone Mar 31, 2026

myeolenv assigned rahulait Mar 31, 2026

reconcile all nvidiadriver CRs when any nvidiadriver CR is changed

6497fe8

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>

rahulait force-pushed the fix-nvidiadriver-reconcile branch from 44050bf to 6497fe8 Compare March 31, 2026 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reconcile all nvidiadrivers when any nvidiadriver is changed#2258

reconcile all nvidiadrivers when any nvidiadriver is changed#2258
rahulait wants to merge 1 commit intoNVIDIA:mainfrom
rahulait:fix-nvidiadriver-reconcile

rahulait commented Mar 31, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 31, 2026

Uh oh!

rahulait commented Mar 31, 2026

Uh oh!

karthikvetrivel commented Apr 1, 2026 •

edited

Loading

Uh oh!

rahulait commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rahulait commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue

Options Considered

Option 1 (implemented):

Option 2 (not implemented):

Why Option 1

Why Option 2 is less useful

Expected Outcome

Checklist

Testing

Uh oh!

copy-pr-bot bot commented Mar 31, 2026

Uh oh!

rahulait commented Mar 31, 2026

Uh oh!

karthikvetrivel commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rahulait commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rahulait commented Mar 31, 2026 •

edited

Loading

karthikvetrivel commented Apr 1, 2026 •

edited

Loading