Fabric Manager shared-nvswitch virtualization model support by mresvanis · Pull Request #166 · NVIDIA/kubevirt-gpu-device-plugin

mresvanis · 2026-02-19T16:55:31Z

Summary

This PR adds the following changes:

Add NVIDIA Fabric Manager integration for multi-GPU NVSwitch-based systems (e.g., DGX/HGX), enabling automatic fabric partition management during device allocation
Introduce CGO bindings for libnvfm and a partition manager that coordinates GPU grouping via NVLink fabric partitions
Refactor GetPreferredAllocation to prefer devices belonging to the same fabric partition when FM is enabled, falling back to NUMA-based selection otherwise

Related NVIDIA GPU Operator changes: NVIDIA/gpu-driver-container#538 and NVIDIA/gpu-operator#2045

Changes

This change adds optional Fabric Manager support behind the ENABLE_FABRIC_MANAGER environment variable (disabled by default). When enabled, the device plugin:

Connects to the FM daemon over a Unix socket at startup
Uses a PCI-to-module mapping to resolve GPU physical IDs to FM module IDs
Selects preferred allocations that align with FM partition boundaries and NUMA locality
Activates the appropriate fabric partition during Allocate, ensuring NVLink connectivity between allocated GPUs

New packages:

pkg/nvfm -- CGO bindings for the libnvfm shared library
pkg/fabricmanager -- High-level FM client, partition manager, and PCI module mapping utilities

Test plan

Unit tests added for pkg/fabricmanager (client, partition manager) and pkg/device_plugin
Verify device plugin starts and operates normally with ENABLE_FABRIC_MANAGER=false (default)
Verify FM partition activation on an NVSwitch node with ENABLE_FABRIC_MANAGER=true

copy-pr-bot · 2026-02-19T16:55:35Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

alaypatel07 · 2026-02-24T17:25:15Z

@mresvanis I am interested in reviewing this PR, once it is ready please ping me.

pkg/device_plugin/generic_device_plugin.go

Signed-off-by: Michail Resvanis <mresvani@redhat.com>

… image Signed-off-by: Michail Resvanis <mresvani@redhat.com>

Signed-off-by: Michail Resvanis <mresvani@redhat.com>

mresvanis · 2026-03-17T17:08:10Z

@mresvanis I am interested in reviewing this PR, once it is ready please ping me.

@alaypatel07 this PR is up for review.

fanzhangio · 2026-03-18T17:41:23Z

@mresvanis Thanks for updating the PR.
I reviewed again, the direction is correct for passthrough: partition-aware selection in GetPreferredAllocation(), activation in Allocate(), and serialized FM mutations in PartitionManager are the right primitives.
I had a few open-questions:

In startup, FM connect or PCI-module-mapping failures are downgraded to warnings log.Print("Falling back to legacy device plugin mode") and the device plugin silently falls back to legacy allocation. Then the following Allocate() proceeds without any FM partition activation at all, even FM is enabled. On a shared-NVSwitch HGX node, that means the plugin can hand out GPUs while no component is authoritatively managing partition state.. Is it expected by design ?
ActivateForDevices() only deactivates active partitions that overlap the current request. Stale partitions from pod deletion, kubelet restart, DP restart, or external FM changes are never cleaned up unless a later overlapping allocation happens. This is fine as the start. As the next phase I would hope adding a reconciliation loop (treats kubelet PodResources as the desired/observed workload state and FM as the actuated state.) is both defensible and, in my opinion, necessary if the KubeVirt GPU device plugin must be the partition control plane to avoid partition status inconsistency (Until we have fully moved to DRA) 😄

mresvanis · 2026-03-18T18:12:28Z

@fanzhangio thank you for bringing those points up! My intention was to start those discussions as soon as you had reviewed at least the direction we started with :)

In startup, FM connect or PCI-module-mapping failures are downgraded to warnings log.Print("Falling back to legacy device plugin mode") and the device plugin silently falls back to legacy allocation. Then the following Allocate() proceeds without any FM partition activation at all, even FM is enabled. On a shared-NVSwitch HGX node, that means the plugin can hand out GPUs while no component is authoritatively managing partition state.. Is it expected by design ?

That is indeed by design, as I thought I might be missing some use cases for this device plugin. But I agree — when the node is set up with FM shared-NVSwitch virtualization mode, letting kubelet randomly allocate GPUs is a no-go. I'll change this behavior from a warning to an error and add a clear log message explaining what's happening.

ActivateForDevices() only deactivates active partitions that overlap the current request. Stale partitions from pod deletion, kubelet restart, DP restart, or external FM changes are never cleaned up unless a later overlapping allocation happens. This is fine as the start. As the next phase I would hope adding a reconciliation loop (treats kubelet PodResources as the desired/observed workload state and FM as the actuated state.) is both defensible and, in my opinion, necessary if the KubeVirt GPU device plugin must be the partition control plane to avoid partition status inconsistency (Until we have fully moved to DRA) 😄

That is spot on, this is exactly our intention — start by ensuring that all partitions needed at any time can be activated properly, then move to a more robust solution: a reconciler watching Node PodResources that deactivates partitions based on observed workloads. I chose not to include the Node PodResources reconciler in this PR to keep it small and easily reviewable. That said, I'm happy to add it here if you think the complete solution belongs in a single PR. WDYT?

Signed-off-by: Michail Resvanis <mresvani@redhat.com>

…hen FM enabled Extract NUMA-based device selection into a standalone preferDevicesByNUMA method. When a partition manager is active, GetPreferredAllocation now delegates to it for FM-aware selection with NUMA locality; otherwise it falls back to the original NUMA-only logic. Add comprehensive tests for the FM-aware path covering partition matching, NUMA tie-breaking, error cases, unavailable GPUs, and must-include device ordering. Signed-off-by: Michail Resvanis <mresvani@redhat.com>

When the fabric manager is enabled, the Allocate handler now activates partitions for the requested device IDs before returning the allocation response, failing the request if the connection is lost or activation errors out. Signed-off-by: Michail Resvanis <mresvani@redhat.com>

fanzhangio reviewed Feb 27, 2026

View reviewed changes

pkg/device_plugin/generic_device_plugin.go Show resolved Hide resolved

fanzhangio reviewed Feb 27, 2026

View reviewed changes

pkg/device_plugin/generic_device_plugin.go Outdated Show resolved Hide resolved

mresvanis force-pushed the fabric-manager-support branch 3 times, most recently from a487b1f to 0761aa3 Compare March 17, 2026 16:54

mresvanis added 4 commits March 17, 2026 17:56

Add fabric manager SDK CGO bindings

a82de97

Signed-off-by: Michail Resvanis <mresvani@redhat.com>

Add driver version and fabric manager shared library to the container…

6f9d71b

… image Signed-off-by: Michail Resvanis <mresvani@redhat.com>

Add FM client based on libnvfm CGO bindings

51e6e6f

Signed-off-by: Michail Resvanis <mresvani@redhat.com>

Add Fabric Manager partition manager

fb739c3

Signed-off-by: Michail Resvanis <mresvani@redhat.com>

mresvanis force-pushed the fabric-manager-support branch from 0761aa3 to 41f840e Compare March 17, 2026 17:04

mresvanis changed the title ~~Fabric manager Shared NVSwitch virt model support~~ Fabric Manager shared-nvswitch virtualization model support Mar 17, 2026

mresvanis marked this pull request as ready for review March 17, 2026 17:06

mresvanis force-pushed the fabric-manager-support branch from 41f840e to 1d728db Compare March 19, 2026 10:12

mresvanis added 3 commits March 19, 2026 11:17

Integrate fabric manager into generic device plugin

629074f

Signed-off-by: Michail Resvanis <mresvani@redhat.com>

mresvanis closed this Mar 19, 2026

mresvanis force-pushed the fabric-manager-support branch from 1d728db to a83992a Compare March 19, 2026 10:17

mresvanis reopened this Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fabric Manager shared-nvswitch virtualization model support#166

Fabric Manager shared-nvswitch virtualization model support#166
mresvanis wants to merge 7 commits intoNVIDIA:masterfrom
mresvanis:fabric-manager-support

mresvanis commented Feb 19, 2026

Uh oh!

copy-pr-bot bot commented Feb 19, 2026

Uh oh!

alaypatel07 commented Feb 24, 2026

Uh oh!

Uh oh!

Uh oh!

mresvanis commented Mar 17, 2026

Uh oh!

fanzhangio commented Mar 18, 2026

Uh oh!

mresvanis commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mresvanis commented Feb 19, 2026

Summary

Changes

Test plan

Uh oh!

copy-pr-bot bot commented Feb 19, 2026

Uh oh!

alaypatel07 commented Feb 24, 2026

Uh oh!

Uh oh!

Uh oh!

mresvanis commented Mar 17, 2026

Uh oh!

fanzhangio commented Mar 18, 2026

Uh oh!

mresvanis commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mresvanis commented Mar 18, 2026 •

edited

Loading