Nvidia container on Proxmox LXC

Rationale

Before or without the nvidia-container-toolkit the GPU passtrought used to require a rather complex flow to make the device available to containers and in particular to the LXC. This used essentially a device mount and a cgroup allow.

For example a working lxc config was the following:

lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 509:* rwm
lxc.cgroup2.devices.allow: c 234:* rwm
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-caps/nvidia-cap1 none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-caps/nvidia-cap2 none bind,optional,create=file
lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file
lxc.cgroup2.devices.allow: c 226:128 rwm

this will be changed to this (device independent !):

lxc.hook.pre-start: sh -c '[ ! -f /dev/nvidia0 ] && /usr/bin/nvidia-modprobe -c0 -u'
lxc.environment: NVIDIA_VISIBLE_DEVICES=all
lxc.environment: NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
lxc.hook.mount: /usr/share/lxc/hooks/nvidia

The new nvidia-container-toolkit approach

1768218701262

The NVIDIA Container Toolkit is a collection of libraries and utilities enabling users to build and run GPU-accelerated containers. It currently includes:

  • The NVIDIA Container Runtime (nvidia-container-runtime)
  • The NVIDIA Container Toolkit CLI (nvidia-ctk)
  • The NVIDIA CDI Hooks (nvidia-cdi-hook)
  • The NVIDIA Container Runtime Hook (nvidia-container-runtime-hook)
  • The NVIDIA Container CLI (nvidia-container-cli)
  • The NVIDIA Container Library (libnvidia-container1)

The NVIDIA container stack is architected so that it can be targeted to support any container runtime in the ecosystem.

How these components are used depends on the container runtime being used. For docker or containerd, the NVIDIA Container Runtime (nvidia-container-runtime) is configured as an OCI-compliant runtime, with the flow through the various components is shown in the following diagram:

1768222646242

The flow through components for cri-o and lxc are shown in the following diagram. It should be noted that in this case the NVIDIA Container Runtime component is not required.

1768222679428

The lxc backend actually depends on a script that is activated by hook:

https://github.com/lxc/lxc/blob/main/hooks/nvidia

the hook parses configuration from env and eventually trigger the command nvidia-container-cli configure

Usage: nvidia-container-cli configure [-?cDgnuvV] [-d ID] [-l PATH] [-p PID]
            [-r EXPR] [--help] [--compat32] [--cuda-compat-mode=MODE]
            [--compute] [--device=ID] [--display] [--graphics]
            [--imex-channel=CHANNEL] [--ldconfig=PATH] [--mig-config=ID]
            [--mig-monitor=ID] [--no-cgroups] [--no-cntlibs] [--no-devbind]
            [--no-fabricmanager] [--no-gsp-firmware] [--no-persistenced]
            [--ngx] [--pid=PID] [--require=EXPR] [--usage] [--utility]
            [--video] [--version] ROOTFS

Install library in the host:

  1. Install the prerequisites for the instructions below:

sudo apt-get update && sudo apt-get install -y --no-install-recommends \ curl \ gnupg2 2. Configure the production repository:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Optionally, configure the repository to use experimental packages:

sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list 3. Update the packages list from the repository:

sudo apt-get update 4. Install the NVIDIA Container Toolkit packages:

export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.18.1-1 sudo apt-get install -y \ nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}

verify installation with:

nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/usr/lib/nvidia/current/nvidia-smi
/usr/lib/nvidia/current/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.535.183.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.535.183.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.535.183.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.535.183.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-allocator.so.535.183.01
/usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.535.183.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.535.183.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.535.183.01
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.183.01
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.535.183.01
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.535.183.01
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.535.183.01
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.535.183.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.535.183.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libEGL_nvidia.so.535.183.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv2_nvidia.so.535.183.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv1_CM_nvidia.so.535.183.01
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.535.183.01
/lib/firmware/nvidia/535.183.01/gsp_ga10x.bin
/lib/firmware/nvidia/535.183.01/gsp_tu10x.bin

prepare container

  • unprivileged
  • activate ( fuse=1, keyctl=1, mknod=1, nesting=1 )

add toolkit hook:

arch: amd64
cores: 52
features: fuse=1,keyctl=1,mknod=1,nesting=1
hostname: mycontainer
memory: 250000
net0: name=eth0,bridge=vmbr0,firewall=1,hwaddr=BC:24:11:D0:78:01,ip=dhcp,type=veth
ostype: debian
rootfs: nfsha01:120/vm-120-disk-0.raw,size=48G
swap: 0
unprivileged: 1
lxc.mount.entry: /mnt/vivado01 mnt/vivado01 none bind,rw,create=dir 0 0
lxc.hook.pre-start: sh -c '[ ! -f /dev/nvidia0 ] && /usr/bin/nvidia-modprobe -c0 -u'
lxc.environment: NVIDIA_VISIBLE_DEVICES=all
lxc.environment: NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
lxc.hook.mount: /usr/share/lxc/hooks/nvidia