Skip to content

Update docs for coco GA release#365

Open
a-mccarthy wants to merge 9 commits intoNVIDIA:mainfrom
a-mccarthy:coco-26.3.0
Open

Update docs for coco GA release#365
a-mccarthy wants to merge 9 commits intoNVIDIA:mainfrom
a-mccarthy:coco-26.3.0

Conversation

@a-mccarthy
Copy link
Copy Markdown
Collaborator

No description provided.

@github-actions
Copy link
Copy Markdown

Documentation preview

https://nvidia.github.io/cloud-native-docs/review/pr-365

#. Specify at least the following options when you install the Operator.
If you want to run Kata Containers by default on all worker nodes, also specify ``--set sandboxWorkloads.defaultWorkload=vm-passthrough``.

.. code-block:: console
Copy link
Copy Markdown
Collaborator Author

@a-mccarthy a-mccarthy Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the upstream doc calls out enabling NFD in the install command (and also disabling it in the kata-deploy install). Is that needed? can you elaborate on why users should include those?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jojimt - can you help here? see https://github.com/kata-containers/kata-containers/pull/12651/changes on what we currently suggest in the Kata docs

Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>
Prerequisites
=============

* Use a supported platform for Confidential Containers.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms or other services needed, should we call out that folks need to have a secure container registry? or any of the other services mentioned in the architecture image, https://nvidia.github.io/cloud-native-docs/review/pr-365/confidential-containers/latest/overview.html#architecture-overview? We talk about hardware, kata and GPU operator, but dont have as much details about additional services and setup. @Hema-Bontha-NV @manuelh-dev

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I defer to @Hema-Bontha-NV here. This is a good question. Ideally they would sign their container images or use a registry they trust with signed images, and ideally they'd have a trusted environment in which they are running trustee. This is however more for the production end-to-end scenario. Since this is our general deployment guide, we don't explain this in detail. Referring to such aspects though can make sense.

Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>

.. _coco-supported-platforms:

Limitations and Restrictions
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hema-Bontha-NV @manuelh-dev are there any more limitations we need to call out? Also, we dont currently mention anything for openshift

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>

During attestation, the GPU will be set to ready. As such, when running a workload that does attestation, it is not necessary to set the ``nvrc.smi.srs=1`` and ``RUST_LOG=debug`` kernel parameters.

If attestation does not succeed, debugging is best done through the Trustee log. Debug mode can be enabled by setting the ``nvrc.smi.srs=1`` and ``RUST_LOG=debug`` kernel parameters in the Trustee environment. No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, here I would not mention nvrc.smi.srs=1 in turn.

This parameter transitions the GPU into ready state. This is done automatically during attestation. I don't think we need to set this to debug attestation failures

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this has nothing to do with debugging, but note that we do need this to be set in general now.


.. code-block:: console

$ export VERSION="3.29.0"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we intentionally decide against using the command from https://github.com/kata-containers/kata-containers/blob/main/docs/use-cases/NVIDIA-GPU-passthrough-and-Kata-QEMU.md#kata-containers

export VERSION=$(curl -sSL https://api.github.com/repos/kata-containers/kata-containers/releases/latest | jq .tag_name | tr -d '"')

uses the github API to determine the latest version. If we have newer versions we either need to update here or rely on users to not use this outdated version in a few months

Next Steps
==========

* Refer to the :doc:`Attestation <attestation>` page for more information on configuringattestation.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configuringattestation - missing whitespace

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding the pod security policy to protect the shim to agent interface using the genpolicy tool is related to attestation - at the place where we reference attestation we could at least mention something like "and pod security policies" and refer to relevant documentation from the kata-containers repository: https://github.com/kata-containers/kata-containers/blob/main/docs/how-to/how-to-use-the-kata-agent-policy.md


During attestation, the GPU will be set to ready. As such, when running a workload that does attestation, it is not necessary to set the ``nvrc.smi.srs=1`` and ``RUST_LOG=debug`` kernel parameters.

If attestation does not succeed, debugging is best done through the Trustee log. Debug mode can be enabled by setting the ``nvrc.smi.srs=1`` and ``RUST_LOG=debug`` kernel parameters in the Trustee environment. No newline at end of file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think nvrc.smi.srs only applies to the pod / coco uvm - it's saying set ready state true for the GPU.
And the the rust log level would be for trustee.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Also, RUST_LOG=debug is not a kernel parameter. It's an environment variable. There is some info about enabling debug here.

Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>
.. image:: graphics/CoCo-Sample-Workflow.png
:alt: Sample Workflow for Securing Model IP on Untrusted Infrastructure with CoCo

*Sample Workflow for Securing Model IP on Untrusted Infrastructure with CoCo*
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hema-Bontha-NV can you share more about the workflow in this diagram. there is the 1-3 steps, but we dont describe them in much detail

Configure Image Pull Timeouts
-----------------------------

Using the guest-pull mechanism to securly manage images in your deployment scenarios means that pulling large images may take a significant amount of time and may delay container start.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the guest pull mechanism? we reference it, but dont really explain it that well.

Copy link
Copy Markdown

@fitzthum fitzthum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on the attestation stuff.

To enable the remote verifier, add the following lines to the Trustee configuration file::

[attestation_service.verifier_config.nvidia_verifier]
type = "Remote"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed. Remote verifier is set by default for docker compose.


Now, the guest can be used with attestation. For more information on how to provision Trustee with resources and policies, refer to the `Trustee documentation <https://confidentialcontainers.org/docs/attestation/>`_.

During attestation, the GPU will be set to ready. As such, when running a workload that does attestation, it is not necessary to set the ``nvrc.smi.srs=1`` kernel parameters.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer true. You need to set nvrc.smi.srs=1 for the GPU to be set to ready.


During attestation, the GPU will be set to ready. As such, when running a workload that does attestation, it is not necessary to set the ``nvrc.smi.srs=1`` and ``RUST_LOG=debug`` kernel parameters.

If attestation does not succeed, debugging is best done through the Trustee log. Debug mode can be enabled by setting the ``nvrc.smi.srs=1`` and ``RUST_LOG=debug`` kernel parameters in the Trustee environment. No newline at end of file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this has nothing to do with debugging, but note that we do need this to be set in general now.


During attestation, the GPU will be set to ready. As such, when running a workload that does attestation, it is not necessary to set the ``nvrc.smi.srs=1`` and ``RUST_LOG=debug`` kernel parameters.

If attestation does not succeed, debugging is best done through the Trustee log. Debug mode can be enabled by setting the ``nvrc.smi.srs=1`` and ``RUST_LOG=debug`` kernel parameters in the Trustee environment. No newline at end of file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Also, RUST_LOG=debug is not a kernel parameter. It's an environment variable. There is some info about enabling debug here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants