dcgm-exporter: enable ServiceMonitor by default and skip gracefully when Prometheus CRD is absent by VincentG1234 · Pull Request #2262 · NVIDIA/gpu-operator

VincentG1234 · 2026-04-01T18:43:48Z

Problem

The DCGM Exporter ServiceMonitor has been opt-in (enabled: false) since the feature was introduced in 2022. Every user who wants GPU metrics scraped by Prometheus must remember to set serviceMonitor.enabled: true — a silent misconfiguration that causes hours of debugging (#305 #363) .

On top of that, when enabled: true is set and the Prometheus ServiceMonitor CRD is absent, the operator returns NotReady and blocks the entire reconcile loop (re-queuing every 5 seconds). This was documented in release 23.3 with added logging, but the underlying blocking behavior was never fixed. As a result, missing Prometheus CRDs prevent GFD pods from starting.

Solution

Two minimal changes:

Set serviceMonitor.enabled: true by default in values.yaml.
Fix the blocking behavior: when the Prometheus ServiceMonitor CRD is absent, the operator now returns Ready and skips gracefully instead of blocking the reconcile loop. This aligns state-dcgm-exporter with the existing behavior of state-operator-metrics, which already handled this case correctly.

An explicit enabled: false continues to disable and remove the resource.

Scenario	Before	After
`enabled: true` + CRD absent	`NotReady` (blocks reconcile loop, GFD stalls)	`Ready` (silent skip)
`enabled: true` + CRD present	created	created (unchanged)
`enabled: false` explicit	`Disabled`	`Disabled` (unchanged)

Changes

deployments/gpu-operator/values.yaml — serviceMonitor.enabled: false → enabled: true.
controllers/object_controls.go — ServiceMonitor(): CRD-absent path returns Ready instead of NotReady, consistent with state-operator-metrics.
api/nvidia/v1/clusterpolicy_types.go — DCGMExporterServiceMonitorConfig.IsEnabled(): nil defaults to true, consistent with the rest of the codebase.

Testing

go test ./controllers/... -run TestServiceMonitor -v

…present Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>

copy-pr-bot · 2026-04-01T18:43:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

dcgm-exporter: auto-enable ServiceMonitor when ServiceMonitor CRD is …

939dd3c

…present Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>

VincentG1234 requested review from cdesiniotis, karthikvetrivel, rahulait, rajathagasthya, shivamerla and tariq1890 as code owners April 1, 2026 18:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dcgm-exporter: enable ServiceMonitor by default and skip gracefully when Prometheus CRD is absent#2262

dcgm-exporter: enable ServiceMonitor by default and skip gracefully when Prometheus CRD is absent#2262
VincentG1234 wants to merge 1 commit intoNVIDIA:mainfrom
VincentG1234:feat/dcgm-exporter-servicemonitor-auto-mode

VincentG1234 commented Apr 1, 2026

Uh oh!

copy-pr-bot bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

VincentG1234 commented Apr 1, 2026

Problem

Solution

Changes

Testing

Uh oh!

copy-pr-bot bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant