From e5cf9556be5a50c525de33c44bd496fb045db3d0 Mon Sep 17 00:00:00 2001
From: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>
Date: Wed, 25 Mar 2026 11:16:12 -0400
Subject: [PATCH 1/2] Add Kata docs
Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>
---
gpu-operator/deploy-kata-containers.rst | 456 ++++++++++++++++++++++++
gpu-operator/getting-started.rst | 7 +-
gpu-operator/index.rst | 1 +
3 files changed, 463 insertions(+), 1 deletion(-)
create mode 100644 gpu-operator/deploy-kata-containers.rst
diff --git a/gpu-operator/deploy-kata-containers.rst b/gpu-operator/deploy-kata-containers.rst
new file mode 100644
index 000000000..8d107e635
--- /dev/null
+++ b/gpu-operator/deploy-kata-containers.rst
@@ -0,0 +1,456 @@
+.. license-header
+ SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ SPDX-License-Identifier: Apache-2.0
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+.. headings (h1/h2/h3/h4/h5) are # * = -
+
+..
+ lingo:
+
+ It is "Kata Containers" when referring to the software component.
+ It is "Kata container" when it is a container that uses the Kata Containers runtime.
+ Treat our operands as proper nouns and use title case.
+
+###########################
+Deploy with Kata Containers
+###########################
+
+
+***************************************
+About the Operator with Kata Containers
+***************************************
+
+`Kata Containers `_ is an open source project that creates lightweight Virtual Machines (VMs) that feel and perform like traditional containers such as a Docker container.
+A traditional container packages software for user-space isolation from the host,
+but the container runs on the host and shares the operating system kernel with the host.
+Sharing the operating system kernel is a potential vulnerability.
+
+A Kata container runs in a virtual machine on the host.
+The virtual machine has a separate operating system and operating system kernel.
+Hardware virtualization and a separate kernel provide improved workload isolation
+in comparison with traditional containers.
+
+The NVIDIA GPU Operator works with the Kata container runtime.
+Kata uses a hypervisor, such as QEMU, to provide a lightweight virtual machine with a single purpose: to run a Kubernetes pod.
+
+The following diagram shows the software components that Kubernetes uses to run a Kata container.
+
+.. mermaid::
+ :caption: Software Components with Kata Container Runtime
+ :alt: Logical diagram of software components between Kubelet and containers when using Kata Containers.
+
+ flowchart LR
+ a[Kubelet] --> b[CRI] --> c[Kata\nRuntime] --> d[Lightweight\nQEMU VM] --> e[Lightweight\nGuest OS] --> f[Pod] --> g[Container]
+
+
+To enable Kata Containers for GPUs on your cluster, install the upstream ``kata-deploy`` Helm chart, which deploys all Kata runtime classes, including NVIDIA-specific runtime classes.
+The ``kata-qemu-nvidia-gpu`` runtime class is used with Kata Containers.
+Other runtime classes such as ``kata-qemu-nvidia-gpu-snp`` and ``kata-qemu-nvidia-gpu-tdx`` are also installed with ``kata-deploy``, but are used to deploy Confidential Containers.
+
+Then you configure the GPU Operator to use Kata mode for sandbox workloads and, optionally, configure the Kata device plugin.
+
+.. tip::
+ This page describes deploying with Kata containers.
+ Refer to the `Confidential Containers documentation `_ if you are interested in deploying Confidential Containers with Kata Containers.
+
+*********************************
+Benefits of Using Kata Containers
+*********************************
+
+The primary benefits of Kata Containers are as follows:
+
+* Running untrusted workloads in a container.
+ The virtual machine provides a layer of defense against the untrusted code.
+
+* Limiting access to hardware devices such as NVIDIA GPUs.
+ The virtual machine is provided access to specific devices.
+ This approach ensures that the workload cannot access additional devices.
+
+* Transparent deployment of unmodified containers.
+
+****************************
+Limitations and Restrictions
+****************************
+
+* GPUs are available to containers as a single GPU in passthrough mode only.
+ vGPU is not supported.
+
+* Support is limited to initial installation and configuration only.
+ Upgrade and configuration of existing clusters for Kata Containers is not supported.
+
+* Support for Kata Containers is limited to the implementation described on this page.
+ The Operator offers Technology Preview support for Red Hat OpenShift sandbox containers.
+
+* NVIDIA supports the Operator and Kata Containers with the containerd runtime only.
+
+
+*******************************
+Cluster Topology Considerations
+*******************************
+
+You can configure all the worker nodes in your cluster for Kata Containers or you can configure some nodes for Kata Containers and others for traditional containers.
+Consider the following example where node A is configured to run traditional containers and node B is configured to run Kata Containers.
+
+.. list-table::
+ :widths: 50 50
+ :header-rows: 1
+
+ * - Node A - Traditional Containers receives the following software components
+ - Node B - Kata Containers receives the following software components
+ * - * ``NVIDIA Driver Manager for Kubernetes`` -- to install the data-center driver.
+ * ``NVIDIA Container Toolkit`` -- to ensure that containers can access GPUs.
+ * ``NVIDIA Device Plugin for Kubernetes`` -- to discover and advertise GPU resources to kubelet.
+ * ``NVIDIA DCGM and DCGM Exporter`` -- to monitor GPUs.
+ * ``NVIDIA MIG Manager for Kubernetes`` -- to manage MIG-capable GPUs.
+ * ``Node Feature Discovery (NFD)`` -- to detect CPU, kernel, and host features and label worker nodes.
+ * ``NVIDIA GPU Feature Discovery`` -- to detect NVIDIA GPUs and label worker nodes.
+ - * ``NVIDIA Kata Sandbox Device Plugin`` -- to discover and advertise the passthrough GPUs to kubelet.
+ * ``NVIDIA VFIO Manager`` -- to load the vfio-pci device driver and bind it to all GPUs on the node.
+ * ``Node Feature Discovery`` -- to detect CPU security features, NVIDIA GPUs, and label worker nodes.
+ * ``NVIDIA Confidential Computing Manager for Kubernetes`` -- to set the confidential computing (CC) mode on the NVIDIA GPUs.
+ This component is deployed to all nodes configured for Kata Containers, even if you are not planning to run Confidential Containers.
+ Refer to `Confidential Containers `_ for more details.
+
+
+**********************************************
+Configure the GPU Operator for Kata Containers
+**********************************************
+
+Overview of Installation and Configuration
+===========================================
+
+Installing and configuring your cluster to support the NVIDIA GPU Operator with Kata Containers is as follows:
+
+#. Configure prerequisites.
+
+#. Label the worker nodes that you want to use with Kata Containers.
+ If you want Kata as the default sandbox workload on every worker node, set ``sandboxWorkloads.defaultWorkload=vm-passthrough`` when you install the GPU Operator and skip this step.
+
+#. Install kata-deploy Helm chart.
+
+#. Install the NVIDIA GPU Operator.
+ During install, you can specify options to deploy the operands that are required for Kata Containers.
+
+After installation, you can run a sample workload that uses the Kata runtime class.
+
+Prerequisites
+=============
+
+* Your hosts are configured to enable hardware virtualization and Access Control Services (ACS).
+ With some AMD CPUs and BIOSes, ACS might be grouped under Advanced Error Reporting (AER).
+ Enabling these features is typically performed by configuring the host BIOS.
+
+* Your hosts are configured to support IOMMU.
+
+ If the output from running ``ls /sys/kernel/iommu_groups`` includes ``0``, ``1``, and so on,
+ then your host is configured for IOMMU.
+
+ If a host is not configured or you are unsure, add the ``intel_iommu=on`` (or ``amd_iommu=on`` for AMD CPUs) Linux kernel command-line argument.
+ For most Linux distributions, you add the argument to the ``/etc/default/grub`` file:
+
+ .. code-block:: text
+
+ ...
+ GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on modprobe.blacklist=nouveau"
+ ...
+
+ On Ubuntu systems, run ``sudo update-grub`` after making the change to configure the bootloader.
+ On other systems, you might need to run ``sudo dracut`` after making the change.
+ Refer to the documentation for your operating system.
+ Reboot the host after configuring the bootloader.
+
+* You have cluster administrator privileges on the cluster you are installing the GPU Operator on.
+
+* The NVIDIA GPU runtime classes use VFIO cold-plug which requires the Kata runtime to query Kubelet's Pod Resources API to discover allocated GPU devices during sandbox creation.
+ For Kubernetes versions older than 1.34, you must explicitly enable the ``KubeletPodResourcesGet`` feature gate in your Kubelet configuration.
+ For Kubernetes 1.34 and later, the ``KubeletPodResourcesGet`` feature gate is enabled by default.
+
+
+.. _label-nodes-kata-containers:
+
+Label Nodes to use Kata Containers
+==================================
+
+#. Label the nodes that you want to use with Kata Containers:
+
+ .. code-block:: console
+
+ $ kubectl label node nvidia.com/gpu.workload.config=vm-passthrough
+
+ The GPU Operator uses this label to determine what software components to deploy to a node.
+ You can use this label on nodes for Kata workloads, and run traditional container workloads with GPU on other nodes in your cluster.
+
+ .. tip::
+
+ Skip this section if you plan to set ``sandboxWorkloads.defaultWorkload=vm-passthrough`` when you install the GPU Operator.
+
+Install Kata-deploy
+===================
+
+Install the kata-deploy Helm chart.
+Minimum required version is 3.28.0.
+
+#. Get the latest version of the ``kata-deploy`` Helm chart:
+
+ .. code-block:: console
+
+ $ export VERSION="3.28.0"
+ $ export CHART="oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy"
+
+
+#. Install the kata-deploy Helm chart:
+
+ .. code-block:: console
+
+ $ helm install kata-deploy "${CHART}" \
+ --namespace kata-system --create-namespace \
+ --set nfd.enabled=false \
+ --wait --timeout 10m \
+ --version "${VERSION}"
+
+
+#. Optional: Verify that the kata-deploy pod is running:
+
+ .. code-block:: console
+
+ $ kubectl get pods -n kata-system | grep kata-deploy
+
+ *Example Output*
+
+ .. code-block:: output
+
+ NAME READY STATUS RESTARTS AGE
+ kata-deploy-b2lzs 1/1 Running 0 6m37s
+
+#. Optional: Verify that the ``kata-qemu-nvidia-gpu`` runtime class is available:
+
+ .. code-block:: console
+
+ $ kubectl get runtimeclass | grep kata-qemu-nvidia-gpu
+
+ *Example Output*
+
+ .. code-block:: output
+
+ NAME HANDLER AGE
+ kata-qemu-nvidia-gpu kata-qemu-nvidia-gpu 53s
+
+ ``kata-deploy`` installs several runtime classes. The ``kata-qemu-nvidia-gpu`` runtime class is used with Kata Containers.
+ Other runtime classes like ``kata-qemu-nvidia-gpu-snp`` and ``kata-qemu-nvidia-gpu-tdx`` are used to deploy `Confidential Containers `_.
+
+
+Install the NVIDIA GPU Operator
+===============================
+
+Perform the following steps to install the Operator for use with Kata Containers:
+
+#. Add and update the NVIDIA Helm repository:
+
+ .. code-block:: console
+
+ $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
+ && helm repo update
+
+#. Install the GPU Operator.
+ The following configures the GPU Operator to deploy the operands that are required for Kata Containers.
+ Refer to :ref:`Common Chart Customization Options ` for more details on the additional configuration options you can specify when installing the GPU Operator.
+
+ .. code-block:: console
+
+ $ helm install --wait --generate-name \
+ -n gpu-operator --create-namespace \
+ nvidia/gpu-operator \
+ --set sandboxWorkloads.enabled=true \
+ --set sandboxWorkloads.mode=kata \
+ --set nfd.enabled=true \
+ --set nfd.nodefeaturerules=true
+
+ .. tip::
+
+ Add ``--set sandboxWorkloads.defaultWorkload=vm-passthrough`` if every worker node should use Kata by default.
+
+ *Example Output*
+
+ .. code-block:: output
+
+ NAME: gpu-operator
+ LAST DEPLOYED: Tue Mar 10 17:58:12 2026
+ NAMESPACE: gpu-operator
+ STATUS: deployed
+ REVISION: 1
+ TEST SUITE: None
+
+Verification
+============
+
+#. Verify that the Kata and VFIO Manager operands are running:
+
+ .. code-block:: console
+
+ $ kubectl get pods -n gpu-operator
+
+ *Example Output*
+
+ .. code-block:: output
+
+ NAME READY STATUS RESTARTS AGE
+ gpu-operator-5b69cf449c-mjmhv 1/1 Running 0 78s
+ gpu-operator-v26-1773935562-node-feature-discovery-gc-95b4pnpbh 1/1 Running 0 78s
+ gpu-operator-v26-1773935562-node-feature-discovery-master-kxzxg 1/1 Running 0 78s
+ gpu-operator-v26-1773935562-node-feature-discovery-worker-8bx68 1/1 Running 0 78s
+ nvidia-cc-manager-bnmlh 1/1 Running 0 62s
+ nvidia-kata-sandbox-device-plugin-daemonset-df7jt 1/1 Running 0 63s
+ nvidia-sandbox-validator-4bxgl 1/1 Running 0 53s
+ nvidia-vfio-manager-cxlz5 1/1 Running 0 63s
+
+
+ .. note::
+
+ The NVIDIA Confidential Computing Manager for Kubernetes (`nvidia-cc-manager`) is deployed to all nodes :ref:`configured to run Kata containers `, even if you are not planning to run Confidential Containers.
+ This manager sets the confidential computing (CC) mode on the NVIDIA GPUs.
+ Refer to `Confidential Containers `_ for more details.
+
+
+#. Verify that the ``kata-qemu-nvidia-gpu`` runtime class is available:
+
+ .. code-block:: console
+
+ $ kubectl get runtimeclass | grep kata-qemu-nvidia-gpu
+
+ *Example Output*
+
+ .. code-block:: output
+
+
+ NAME HANDLER AGE
+ kata-qemu-nvidia-gpu kata-qemu-nvidia-gpu 53s
+
+ ``kata-deploy`` installs several runtime classes. The ``kata-qemu-nvidia-gpu`` runtime class is used with Kata Containers.
+ Other runtime classes like ``kata-qemu-nvidia-gpu-snp`` and ``kata-qemu-nvidia-gpu-tdx`` are used to deploy `Confidential Containers `_.
+
+#. Optional: If you have host access to the worker node, you can perform the following steps:
+
+ #. Confirm that the host uses the ``vfio-pci`` device driver for GPUs:
+
+ .. code-block:: console
+
+ $ lspci -nnk -d 10de:
+
+ *Example Output*
+
+ .. code-block:: output
+ :emphasize-lines: 3
+
+ 65:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [A10] [10de:2236] (rev a1)
+ Subsystem: NVIDIA Corporation GA102GL [A10] [10de:1482]
+ Kernel driver in use: vfio-pci
+ Kernel modules: nvidiafb, nouveau
+
+*********************
+Run a Sample Workload
+*********************
+
+A pod specification for a Kata container requires the following:
+
+* Specify a Kata runtime class.
+
+* Specify a passthrough GPU resource.
+
+
+#. Create a file, such as ``cuda-vectoradd-kata.yaml``, with the following content:
+
+ .. code-block:: yaml
+ :emphasize-lines: 6,13
+
+ apiVersion: v1
+ kind: Pod
+ metadata:
+ name: cuda-vectoradd-kata
+ spec:
+ runtimeClassName: kata-qemu-nvidia-gpu
+ restartPolicy: OnFailure
+ containers:
+ - name: cuda-vectoradd
+ image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubuntu22.04"
+ resources:
+ limits:
+ nvidia.com/pgpu: "1"
+ memory: 16Gi
+
+#. Create the pod:
+
+ .. code-block:: console
+
+ $ kubectl apply -f cuda-vectoradd-kata.yaml
+
+#. View the pod logs:
+
+ .. code-block:: console
+
+ $ kubectl logs -n default cuda-vectoradd-kata
+
+ *Example Output*
+
+ .. code-block:: output
+
+ [Vector addition of 50000 elements]
+ Copy input data from the host memory to the CUDA device
+ CUDA kernel launch with 196 blocks of 256 threads
+ Copy output data from the CUDA device to the host memory
+ Test PASSED
+ Done
+
+#. Delete the pod:
+
+ .. code-block:: console
+
+ $ kubectl delete -f cuda-vectoradd-kata.yaml
+
+
+Troubleshooting Workloads
+=========================
+
+If the sample workload does not run, confirm that you labeled nodes to run virtual machines in containers:
+
+.. code-block:: console
+
+ $ kubectl get nodes -l nvidia.com/gpu.workload.config=vm-passthrough
+
+*Example Output*
+
+.. code-block:: output
+
+ NAME STATUS ROLES AGE VERSION
+ kata-worker-1 Ready 10d v1.35.3
+ kata-worker-2 Ready 10d v1.35.3
+ kata-worker-3 Ready 10d v1.35.3
+
+You might have configured ``vm-passthrough`` as the default sandbox workload in the ClusterPolicy resource.
+That setting applies the default sandbox workload cluster-wide, including for Kata when ``mode`` is ``kata``.
+Also confirm in the ClusterPolicy that ``sandboxWorkloads`` is configured for Kata as shown in the following example.
+
+.. code-block:: console
+
+ $ kubectl describe clusterpolicy | grep sandboxWorkloads
+
+*Example Output*
+
+.. code-block:: output
+
+ sandboxWorkloads:
+ enabled: true
+ defaultWorkload: vm-passthrough
+ mode: kata
+
diff --git a/gpu-operator/getting-started.rst b/gpu-operator/getting-started.rst
index f15420d96..bfd581042 100644
--- a/gpu-operator/getting-started.rst
+++ b/gpu-operator/getting-started.rst
@@ -313,16 +313,21 @@ To view all the options, run ``helm show values nvidia/gpu-operator``.
- The GPU Operator deploys ``PodSecurityPolicies`` if enabled.
- ``false``
+ * - ``sandboxWorkloads.enabled``
+ - Specifies if sandbox containers are enabled.
+ - ``false``
+
* - ``sandboxWorkloads.defaultWorkload``
- Specifies the default type of workload for the cluster, one of ``container``, ``vm-passthrough``, or ``vm-vgpu``.
Setting ``vm-passthrough`` or ``vm-vgpu`` can be helpful if you plan to run all or mostly virtual machines in your cluster.
+ Refer to :doc:`KubeVirt `, :doc:`Kata Containers ` for more details on deploying different workload containers.
- ``container``
* - ``sandboxWorkloads.mode``
- Specifies the sandbox mode to use when deploying sandbox workloads.
Accepted values are ``kubevirt`` (default) and ``kata``.
- Refer to the :doc:`KubeVirt ` page for more information on using KubeVirt based workloads.
+ Refer to the :doc:`KubeVirt ` or the :doc:`Kata Containers ` pages for more information on using KubeVirt or Kata based workloads.
- ``kubevirt``
* - ``toolkit.enabled``
- By default, the Operator deploys the NVIDIA Container Toolkit (``nvidia-docker2`` stack)
diff --git a/gpu-operator/index.rst b/gpu-operator/index.rst
index b3d2546c5..a9643fbd5 100644
--- a/gpu-operator/index.rst
+++ b/gpu-operator/index.rst
@@ -56,6 +56,7 @@
:hidden:
KubeVirt
+ Kata Containers
.. toctree::
:caption: Specialized Networks
From eb78d443849f08a475f361d3908489a8f33ebad8 Mon Sep 17 00:00:00 2001
From: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>
Date: Wed, 25 Mar 2026 12:58:48 -0400
Subject: [PATCH 2/2] Update release notes for kata
Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>
---
gpu-operator/release-notes.rst | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/gpu-operator/release-notes.rst b/gpu-operator/release-notes.rst
index f1247e2b2..64584a940 100644
--- a/gpu-operator/release-notes.rst
+++ b/gpu-operator/release-notes.rst
@@ -134,6 +134,16 @@ Improvements
* Driver validation now waits for all enabled additional drivers (such as GDS and GDRCopy) to be installed before proceeding, and each node records a node-local view of enabled features when using multiple NVIDIADriver CRs or optional components. (`PR #2014 `_)
+* Improved support for Kata Containers.
+ Changes in this release include:
+
+ * Deprecating the NVIDIA Kata Manager.
+ You now use ``kata-deploy`` to install the Kata Container and the Kata runtime class
+ * Adding support for the NVIDIA Kata Sandbox Device Plugin.
+ * Configure ``sandboxWorkload.mode=kata`` during installation or in the ClusterPolicy to enable Kata Containers.
+
+ Refer to the :doc:`Kata Containers documentation ` for full details on configuring the GPU Operator to use Kata Containers.
+
Fixed Issues
------------
@@ -157,6 +167,8 @@ Removals and Deprecations
-------------------------
* Marked unused field ``defaultRuntime`` as optional in the ClusterPolicy. (`PR #2000 `_)
+* NVIDIA Kata Manager is now deprecated.
+ Refer to the :doc:`Kata Containers documentation ` for more information on using Kata Containers without this component.