Skip to content

dcgm-exporter: enable ServiceMonitor by default and skip gracefully when Prometheus CRD is absent#2262

Open
VincentG1234 wants to merge 1 commit intoNVIDIA:mainfrom
VincentG1234:feat/dcgm-exporter-servicemonitor-auto-mode
Open

dcgm-exporter: enable ServiceMonitor by default and skip gracefully when Prometheus CRD is absent#2262
VincentG1234 wants to merge 1 commit intoNVIDIA:mainfrom
VincentG1234:feat/dcgm-exporter-servicemonitor-auto-mode

Conversation

@VincentG1234
Copy link
Copy Markdown

Problem

The DCGM Exporter ServiceMonitor has been opt-in (enabled: false) since the feature was introduced in 2022. Every user who wants GPU metrics scraped by Prometheus must remember to set serviceMonitor.enabled: truea silent misconfiguration that causes hours of debugging (#305 #363) .

On top of that, when enabled: true is set and the Prometheus ServiceMonitor CRD is absent, the operator returns NotReady and blocks the entire reconcile loop (re-queuing every 5 seconds). This was documented in release 23.3 with added logging, but the underlying blocking behavior was never fixed. As a result, missing Prometheus CRDs prevent GFD pods from starting.

Solution

Two minimal changes:

  1. Set serviceMonitor.enabled: true by default in values.yaml.
  2. Fix the blocking behavior: when the Prometheus ServiceMonitor CRD is absent, the operator now returns Ready and skips gracefully instead of blocking the reconcile loop. This aligns state-dcgm-exporter with the existing behavior of state-operator-metrics, which already handled this case correctly.

An explicit enabled: false continues to disable and remove the resource.

Scenario Before After
enabled: true + CRD absent NotReady (blocks reconcile loop, GFD stalls) Ready (silent skip)
enabled: true + CRD present created created (unchanged)
enabled: false explicit Disabled Disabled (unchanged)

Changes

  • deployments/gpu-operator/values.yamlserviceMonitor.enabled: falseenabled: true.
  • controllers/object_controls.goServiceMonitor(): CRD-absent path returns Ready instead of NotReady, consistent with state-operator-metrics.
  • api/nvidia/v1/clusterpolicy_types.goDCGMExporterServiceMonitorConfig.IsEnabled(): nil defaults to true, consistent with the rest of the codebase.

Testing

go test ./controllers/... -run TestServiceMonitor -v

…present

Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 1, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant