Skip to content

reconcile all nvidiadrivers when any nvidiadriver is changed#2258

Open
rahulait wants to merge 1 commit intoNVIDIA:mainfrom
rahulait:fix-nvidiadriver-reconcile
Open

reconcile all nvidiadrivers when any nvidiadriver is changed#2258
rahulait wants to merge 1 commit intoNVIDIA:mainfrom
rahulait:fix-nvidiadriver-reconcile

Conversation

@rahulait
Copy link
Copy Markdown
Contributor

@rahulait rahulait commented Mar 31, 2026

Description

Fixes: #2259

Issue

When multiple NVIDIADriver CRs exist, conflict validation is global (overlapping node selection across all driver CRs). A common case is the default CR (no nodeSelector) selecting all GPU nodes, which blocks custom CRs with targeted selectors.
Blocked CRs currently exit reconcile from the validation path with no requeue, so they remain stuck until an external event re-triggers them.
Deleting the conflicting default CR changes global validity, but previously only the deleted object was enqueued by the primary watch, so remaining CRs were not retried immediately.

Options Considered

Option 1 (implemented):

On NVIDIADriver events, enqueue reconcile for all NVIDIADriver CRs via TypedEnqueueRequestsFromMapFunc (fan-out).

Option 2 (not implemented):

Return error from validation failure so each conflicting CR self-requeues via the rate-limiter.
Basically here. switch from nil to error:

return reconcile.Result{}, nil

Why Option 1

  • Retries happen when global conflict state actually changes (create/update/delete of relevant CRs), which is the right signal.
  • Avoids continuous error-driven retry loops while conflict is expected/user-caused.
  • Reduces log noise and unnecessary reconcile churn, especially with many conflicting CRs and MaxConcurrentReconciles=1.
  • Provides immediate unstick behavior after deleting the blocking CR, without waiting for backoff retry windows.

Why Option 2 is less useful

  • Treats a configuration conflict as a transient controller failure, causing repeated retries that usually cannot succeed until user action.
  • Generates sustained error logs and queue pressure; with single-worker reconcile this can delay useful work for other requests.
  • Many conflicting CRs could keep that single worker busy with repeated retries.
  • Converges eventually, but less efficiently and with poorer operational signal-to-noise.

Expected Outcome

After removing a conflicting NVIDIADriver (for example, deleting default CR), remaining previously blocked CRs are automatically revalidated and can proceed without manual nudges.

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

@rahulait rahulait force-pushed the fix-nvidiadriver-reconcile branch from bf127d3 to 44050bf Compare March 31, 2026 14:46
@rahulait rahulait added the bug Issue/PR to expose/discuss/fix a bug label Mar 31, 2026
@rahulait rahulait added this to the v26.3.1 milestone Mar 31, 2026
Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
@rahulait rahulait force-pushed the fix-nvidiadriver-reconcile branch from 44050bf to 6497fe8 Compare March 31, 2026 19:58
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 31, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rahulait
Copy link
Copy Markdown
Contributor Author

/ok to test 6497fe8

@karthikvetrivel
Copy link
Copy Markdown
Member

karthikvetrivel commented Apr 1, 2026

This mostly looks good to me, Rahul. Have you tested the case where instead of deleting the default CR, you edit it to narrow the scope to limit overlap with a new, custom CR?

@rahulait
Copy link
Copy Markdown
Contributor Author

rahulait commented Apr 1, 2026

This mostly looks good to me, Rahul. Have you tested the case where instead of deleting the default CR, you edit it to narrow the scope to limit overlap with a new, custom CR?

Yup, I tested this scenario as well and it works fine. Tested by having 3 node cluster with default nvd, then adding two more nvds targetting individual nodes and then limiting scope of default to one node. All nvds were able to bring up drivers correctly on each node.

# kubectl get nvidiadriver
NAME          STATUS   AGE
default       ready    2026-04-01T14:36:54Z
demo-gold     ready    2026-04-01T14:45:44Z
demo-silver   ready    2026-04-01T14:45:44Z

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Issue/PR to expose/discuss/fix a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: NVIDIADriver CRs remain stuck after deleting conflicting default CR due to missing requeue of peer NVIDIADriver resources

2 participants