Skip to content

Fabric Manager shared-nvswitch virtualization model support#166

Open
mresvanis wants to merge 7 commits intoNVIDIA:masterfrom
mresvanis:fabric-manager-support
Open

Fabric Manager shared-nvswitch virtualization model support#166
mresvanis wants to merge 7 commits intoNVIDIA:masterfrom
mresvanis:fabric-manager-support

Conversation

@mresvanis
Copy link
Copy Markdown

Summary

This PR adds the following changes:

  • Add NVIDIA Fabric Manager integration for multi-GPU NVSwitch-based systems (e.g., DGX/HGX), enabling automatic fabric partition management during device allocation
  • Introduce CGO bindings for libnvfm and a partition manager that coordinates GPU grouping via NVLink fabric partitions
  • Refactor GetPreferredAllocation to prefer devices belonging to the same fabric partition when FM is enabled, falling back to NUMA-based selection otherwise

Related NVIDIA GPU Operator changes: NVIDIA/gpu-driver-container#538 and NVIDIA/gpu-operator#2045

Changes

This change adds optional Fabric Manager support behind the ENABLE_FABRIC_MANAGER environment variable (disabled by default). When enabled, the device plugin:

  1. Connects to the FM daemon over a Unix socket at startup
  2. Uses a PCI-to-module mapping to resolve GPU physical IDs to FM module IDs
  3. Selects preferred allocations that align with FM partition boundaries and NUMA locality
  4. Activates the appropriate fabric partition during Allocate, ensuring NVLink connectivity between allocated GPUs

New packages:

  • pkg/nvfm -- CGO bindings for the libnvfm shared library
  • pkg/fabricmanager -- High-level FM client, partition manager, and PCI module mapping utilities

Test plan

  • Unit tests added for pkg/fabricmanager (client, partition manager) and pkg/device_plugin
  • Verify device plugin starts and operates normally with ENABLE_FABRIC_MANAGER=false (default)
  • Verify FM partition activation on an NVSwitch node with ENABLE_FABRIC_MANAGER=true

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@alaypatel07
Copy link
Copy Markdown

@mresvanis I am interested in reviewing this PR, once it is ready please ping me.

@mresvanis mresvanis force-pushed the fabric-manager-support branch 3 times, most recently from a487b1f to 0761aa3 Compare March 17, 2026 16:54
Signed-off-by: Michail Resvanis <mresvani@redhat.com>
… image

Signed-off-by: Michail Resvanis <mresvani@redhat.com>
Signed-off-by: Michail Resvanis <mresvani@redhat.com>
Signed-off-by: Michail Resvanis <mresvani@redhat.com>
@mresvanis mresvanis force-pushed the fabric-manager-support branch from 0761aa3 to 41f840e Compare March 17, 2026 17:04
@mresvanis mresvanis changed the title Fabric manager Shared NVSwitch virt model support Fabric Manager shared-nvswitch virtualization model support Mar 17, 2026
@mresvanis mresvanis marked this pull request as ready for review March 17, 2026 17:06
@mresvanis
Copy link
Copy Markdown
Author

@mresvanis I am interested in reviewing this PR, once it is ready please ping me.

@alaypatel07 this PR is up for review.

@fanzhangio
Copy link
Copy Markdown
Contributor

@mresvanis Thanks for updating the PR.
I reviewed again, the direction is correct for passthrough: partition-aware selection in GetPreferredAllocation(), activation in Allocate(), and serialized FM mutations in PartitionManager are the right primitives.
I had a few open-questions:

  1. In startup, FM connect or PCI-module-mapping failures are downgraded to warnings log.Print("Falling back to legacy device plugin mode") and the device plugin silently falls back to legacy allocation. Then the following Allocate() proceeds without any FM partition activation at all, even FM is enabled. On a shared-NVSwitch HGX node, that means the plugin can hand out GPUs while no component is authoritatively managing partition state.. Is it expected by design ?
  2. ActivateForDevices() only deactivates active partitions that overlap the current request. Stale partitions from pod deletion, kubelet restart, DP restart, or external FM changes are never cleaned up unless a later overlapping allocation happens. This is fine as the start. As the next phase I would hope adding a reconciliation loop (treats kubelet PodResources as the desired/observed workload state and FM as the actuated state.) is both defensible and, in my opinion, necessary if the KubeVirt GPU device plugin must be the partition control plane to avoid partition status inconsistency (Until we have fully moved to DRA) 😄

@mresvanis
Copy link
Copy Markdown
Author

mresvanis commented Mar 18, 2026

@fanzhangio thank you for bringing those points up! My intention was to start those discussions as soon as you had reviewed at least the direction we started with :)

  1. In startup, FM connect or PCI-module-mapping failures are downgraded to warnings log.Print("Falling back to legacy device plugin mode") and the device plugin silently falls back to legacy allocation. Then the following Allocate() proceeds without any FM partition activation at all, even FM is enabled. On a shared-NVSwitch HGX node, that means the plugin can hand out GPUs while no component is authoritatively managing partition state.. Is it expected by design ?

That is indeed by design, as I thought I might be missing some use cases for this device plugin. But I agree — when the node is set up with FM shared-NVSwitch virtualization mode, letting kubelet randomly allocate GPUs is a no-go. I'll change this behavior from a warning to an error and add a clear log message explaining what's happening.

  1. ActivateForDevices() only deactivates active partitions that overlap the current request. Stale partitions from pod deletion, kubelet restart, DP restart, or external FM changes are never cleaned up unless a later overlapping allocation happens. This is fine as the start. As the next phase I would hope adding a reconciliation loop (treats kubelet PodResources as the desired/observed workload state and FM as the actuated state.) is both defensible and, in my opinion, necessary if the KubeVirt GPU device plugin must be the partition control plane to avoid partition status inconsistency (Until we have fully moved to DRA) 😄

That is spot on, this is exactly our intention — start by ensuring that all partitions needed at any time can be activated properly, then move to a more robust solution: a reconciler watching Node PodResources that deactivates partitions based on observed workloads. I chose not to include the Node PodResources reconciler in this PR to keep it small and easily reviewable. That said, I'm happy to add it here if you think the complete solution belongs in a single PR. WDYT?

@mresvanis mresvanis force-pushed the fabric-manager-support branch from 41f840e to 1d728db Compare March 19, 2026 10:12
Signed-off-by: Michail Resvanis <mresvani@redhat.com>
…hen FM enabled

Extract NUMA-based device selection into a standalone preferDevicesByNUMA
method. When a partition manager is active, GetPreferredAllocation now
delegates to it for FM-aware selection with NUMA locality; otherwise it
falls back to the original NUMA-only logic. Add comprehensive tests for
the FM-aware path covering partition matching, NUMA tie-breaking, error
cases, unavailable GPUs, and must-include device ordering.

Signed-off-by: Michail Resvanis <mresvani@redhat.com>
When the fabric manager is enabled, the Allocate handler now
activates partitions for the requested device IDs before
returning the allocation response, failing the request if the
connection is lost or activation errors out.

Signed-off-by: Michail Resvanis <mresvani@redhat.com>
@mresvanis mresvanis closed this Mar 19, 2026
@mresvanis mresvanis force-pushed the fabric-manager-support branch from 1d728db to a83992a Compare March 19, 2026 10:17
@mresvanis mresvanis reopened this Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants