NVIDIA GB200 NVL72 正在将 AI 基础设施推向新的极限,使大规模语言模型训练和低延迟推理工作负载成为可能。随着 Kubernetes 在部署和扩展这些工作负载中的核心作用日益增强,快速演进的 AI 工作负载、基础设施需求和新硬件架构为 Kubernetes 编排和资源管理带来了新的挑战。
在本文中,我们将深入探讨如何通过 Kubernetes DRA (Dynamic Resource Allocation) 和 NVIDIA DRA Driver 在 GB200 平台上启用 Multi-Node NVLink (MNNVL),实现跨节点的 GPU 到 GPU 高带宽通信。
核心概念
Kubernetes DRA (Dynamic Resource Allocation) 简介
DRA (Dynamic Resource Allocation,动态资源分配) 是 Kubernetes v1.30 引入的革命性功能,用于解决传统 Device Plugin 框架在处理复杂异构硬件时的局限性。
DRA 最初始的版本(KEP-3063),于 1.26 版本引入,因可用性问题,在 1.32 版本被撤回
当前的 DRA 是第二个版本(KEP-4381) 于 1.30 引入。
为什么需要 DRA?
传统 Device Plugin 将硬件资源抽象为简单整数计数器,无法表达:
DRA 核心设计
DRA 遵循 Kubernetes 声明式原则,将硬件资源特性作为一等公民纳入 API:
ResourceClass :定义硬件资源类型和特性(类似 StorageClass)ResourceClaim :工作负载的声明式资源请求(类似 PVC)ResourceClaimTemplate :创建多个相似 ResourceClaim 的模板DRA Driver :由硬件厂商实现的资源分配逻辑DRA 的优势
声明式管理 :通过标准 API 声明复杂资源需求参数化配置 :支持设备特定参数和资源共享拓扑感知 :原生支持硬件拓扑关系的调度决策跨节点分配 :支持分布式资源分配场景GB200 MNNVL (Multi-Node NVLink) 的价值
GB200 NVL72 通过引入 Multi-Node NVLink (MNNVL) 技术,将单机 GPU 性能限制扩展到整个机架层面,为分布式 AI 工作负载带来革命性改进。
传统的单节点 DGX 系统受限于单机物理限制,MNNVL 改变了这一局面:
全 NVLink 带宽跨节点通信 :通过 NVIDIA NVLink Switch 实现全 NVLink 带宽通信无缝扩展 :将整个机架转换为统一的 GPU 架构性能倍增 :实现超快分布式训练和推理ComputeDomains:连接底层硬件与 Kubernetes
IMEX (Internode Memory Exchange)
NVIDIA Internode Memory Exchange Service (IMEX) 是 GPU 驱动层面的软件,允许 GPU 跨节点通信。IMEX 对每个单独的 GPU 内存导出/导入操作进行细粒度访问控制,并在称为 IMEX 域的节点组中运行。
ComputeDomains 核心概念
作为 NVIDIA GPUs 的 DRA 驱动程序的一部分提供的 ComputeDomains,将底层 GPU 构造(NVIDIA NVLink 和 NVIDIA IMEX)与现代 Kubernetes 原生调度概念(动态资源分配,简称 DRA)连接起来,为在现代 GPU 硬件上运行分布式多节点工作负载提供所需的基础支持。
如果没有 ComputeDomains,多节点 NVLink 设置将不得不手动定义并固定到位,这限制了 Kubernetes 旨在提供的灵活性,并以牺牲安全隔离、故障隔离和成本效率为代价。
ComputeDomains 通过以下方式工作:
动态创建 IMEX 域 :根据工作负载调度自动形成 IMEX 域安全隔离 :每个工作负载获得专用的隔离通信环境自动清理 :工作负载完成后自动释放资源拓扑感知 :理解并优化 GPU 连接关系通过 ComputeDomains,运行分布式训练或跨复杂 NVLink 连接 GPU 架构的推理变得像部署标准 Kubernetes 工作负载一样简单。
环境部署
版本要求
软件信息:
Kubernetes :1.32 及以上,推荐 1.34Containerd :支持 DRA 的版本 1.7.x,推荐 1.7.29GPU Operator :25.3.x 及以上,推荐 25.10.0DRA Driver :推荐部署最新的 25.8.0NVIDIA GPU Driver :需要 565 或更新版本如果使用 DRA Driver 25.8.0,则需要驱动版本 >= 570.158.1 硬件信息:
系统 :GB200 NVL72(一柜系统中的 2 节点子集)节点数量 :2 个节点每个节点配置 :GPU :4 个 GB200 GPU(本次测试共 8 个 GPU)GPU 显存 :每个 GPU 192GB(189471 MiB)CPU :2 个 Grace CPUGPU 基本信息:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
root@GB200-POD2-F06-Node05:~/lixd/nccl-demo# nvidia-smi
Wed Dec 10 10:24:44 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| ========================================= +======================== +====================== |
| 0 NVIDIA GB200 On | 00000008:01:00.0 Off | 0 |
| N/A 37C P0 169W / 1200W | 0MiB / 189471MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GB200 On | 00000009:01:00.0 Off | 0 |
| N/A 37C P0 157W / 1200W | 0MiB / 189471MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GB200 On | 00000018:01:00.0 Off | 0 |
| N/A 38C P0 158W / 1200W | 0MiB / 189471MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GB200 On | 00000019:01:00.0 Off | 0 |
| N/A 37C P0 166W / 1200W | 0MiB / 189471MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
| ========================================================================================= |
| No running processes found |
+-----------------------------------------------------------------------------------------+
GPU 拓扑:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
root@GB200-POD2-F06-Node05:~/lixd/nccl-demo# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS SYS SYS 0-71 0,2-17 N/A
GPU1 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS SYS SYS 0-71 0,2-17 N/A
GPU2 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS SYS SYS 72-143 1,18-33 N/A
GPU3 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS SYS SYS 72-143 1,18-33 N/A
NIC0 SYS SYS SYS SYS X SYS SYS SYS SYS SYS SYS SYS
NIC1 SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS SYS
NIC2 SYS SYS SYS SYS SYS SYS X PIX SYS SYS SYS SYS
NIC3 SYS SYS SYS SYS SYS SYS PIX X SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS
NIC5 SYS SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS
NIC6 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX
NIC7 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes ( e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge ( typically the CPU)
PXB = Connection traversing multiple PCIe bridges ( without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
环境准备
GPU Operator
部署
1
2
3
4
5
6
7
8
9
10
11
# 添加 NVIDIA Helm 仓库
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
# 部署 GPU Operator
helm install --wait gpu-operator \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version= v25.10.0 \
--set driver.enabled= true \
--set dcgmExporter.serviceMonitor.enabled= true \
--set dcgm.enabled= true
提示 :部署过程可能需要 5-10 分钟,请耐心等待。可以通过 kubectl get pods -n gpu-operator 监控部署进度。
验证
如果部署成功,那么节点上可以看到 nvidia.com/gpu 资源
1
2
3
4
5
6
7
8
9
10
11
root@GB200-POD2-F06-Node05:~/lixd# kubectl describe node gb200-pod2-f06-node05| grep Capacity -C 7 Addresses:
InternalIP: 10.0.6.41
Hostname: gb200-pod2-f06-node05
Capacity:
cpu: 144
ephemeral-storage: 1840577300Ki
hugepages-16Gi: 0
hugepages-2Mi: 0
hugepages-512Mi: 0
memory: 1002717120Ki
nvidia.com/gpu: 4
NVIDIA DRA Driver
部署
1
2
3
4
5
6
7
# 部署 NVIDIA DRA Driver
helm upgrade --install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
--version= "25.8.0" \
--create-namespace \
--namespace nvidia-dra-driver-gpu \
--set resources.gpus.enabled= false \
--set nvidiaDriverRoot = /run/nvidia/driver
重要参数说明 :
resources.gpus.enabled=false:禁用默认 GPU 资源管理,由 GPU Operator 处理nvidiaDriverRoot=/run/nvidia/driver:指定 NVIDIA 驱动路径验证
正常情况下,可以看到节点间的 nvidia.com/gpu.clique label。
1
2
3
4
5
6
root@GB200-POD2-F06-Node05:~/lixd# ( echo -e "NODE\tLABEL\tCLIQUE" ; kubectl get nodes -o json | \
/usr/bin/jq -r '.items[] | [.metadata.name, "nvidia.com/gpu.clique", .metadata.labels["nvidia.com/gpu.clique"]] | @tsv' ) | \
column -t
NODE LABEL CLIQUE
gb200-pod2-f06-node05 nvidia.com/gpu.clique 69a19a31-f41c-45a5-8245-579b6bce5bdd.32766
gb200-pod2-f06-node06 nvidia.com/gpu.clique 69a19a31-f41c-45a5-8245-579b6bce5bdd.32766
创建 IMEX 负载
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
cat <<EOF > imex-channel-injection.yaml
---
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
name: imex-channel-injection
spec:
numNodes: 1
channel:
resourceClaimTemplate:
name: imex-channel-0
---
apiVersion: v1
kind: Pod
metadata:
name: imex-channel-injection
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.clique
operator: Exists
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: imex-channel-0
resourceClaims:
- name: imex-channel-0
resourceClaimTemplateName: imex-channel-0
EOF
查看日志,正常能看到注入的 imex channel
1
2
3
4
5
6
7
8
9
10
11
root@GB200-POD2-F06-Node05:~/lixd/demo# kubectl apply -f imex-channel-injection.yaml
computedomain.resource.nvidia.com/imex-channel-injection created
pod/imex-channel-injection created
root@GB200-POD2-F06-Node05:~/lixd/demo# kubectl get pods
NAME READY STATUS RESTARTS AGE
imex-channel-injection 1/1 Running 0 5s
root@GB200-POD2-F06-Node05:~/lixd/demo# kubectl logs imex-channel-injection
total 0
drwxr-xr-x 2 root root 60 Jan 5 08:31 .
drwxr-xr-x 6 root root 380 Jan 5 08:31 ..
crw-rw-rw- 1 root root 501, 0 Jan 5 08:31 channel0
至此,说明 nvidia-dra-driver 部署成功。
验证测试
安装 MPI Operator
首先安装 MPI Operator,用于运行多节点 MPI 作业:
1
kubectl create -f https://github.com/kubeflow/mpi-operator/releases/download/v0.6.0/mpi-operator.yaml
nvbandwidth 测试
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
cat <<EOF > nvbandwidth-test-job.yaml
---
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
name: nvbandwidth-test-compute-domain
spec:
numNodes: 2
channel:
resourceClaimTemplate:
name: nvbandwidth-test-compute-domain-channel
---
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: nvbandwidth-test
spec:
slotsPerWorker: 4
launcherCreationPolicy: WaitForWorkersReady
runPolicy:
cleanPodPolicy: Running
sshAuthMountPath: /home/mpiuser/.ssh
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
metadata:
labels:
nvbandwidth-test-replica: mpi-launcher
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
containers:
- image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
name: mpi-launcher
securityContext:
runAsUser: 1000
command:
- mpirun
args:
- --bind-to
- core
- --map-by
- ppr:4:node
- -np
- "8"
- --report-bindings
- -q
- nvbandwidth
- -t
- multinode_device_to_device_memcpy_read_ce
Worker:
replicas: 2
template:
metadata:
labels:
nvbandwidth-test-replica: mpi-worker
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nvbandwidth-test-replica
operator: In
values:
- mpi-worker
topologyKey: nvidia.com/gpu.clique
containers:
- image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
name: mpi-worker
securityContext:
runAsUser: 1000
env:
command:
- /usr/sbin/sshd
args:
- -De
- -f
- /home/mpiuser/.sshd_config
resources:
limits:
nvidia.com/gpu: 4
claims:
- name: compute-domain-channel
resourceClaims:
- name: compute-domain-channel
resourceClaimTemplateName: nvbandwidth-test-compute-domain-channel
EOF
Apply
1
2
3
root@GB200-POD2-F06-Node05:~/lixd/demo# kubectl apply -f nvbandwidth-test-job.yaml
computedomain.resource.nvidia.com/nvbandwidth-test-compute-domain created
mpijob.kubeflow.org/nvbandwidth-test created
会自动启动 Pod 进行测试
1
2
3
4
5
root@GB200-POD2-F06-Node05:~/lixd/demo# k get po
NAME READY STATUS RESTARTS AGE
nvbandwidth-test-launcher-xl87m 1/1 Running 0 26s
nvbandwidth-test-worker-0 1/1 Running 0 7m41s
nvbandwidth-test-worker-1 1/1 Running 0 7m41s
查看日志
1
kubectl logs --tail= -1 -l job-name= nvbandwidth-test-launcher
测试结果如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
nvbandwidth Version: v0.7
Built from Git version: v0.7
MPI version: Open MPI v4.1.4, package: Debian OpenMPI, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022
CUDA Runtime Version: 13000
CUDA Driver Version: 13000
Driver Version: 580.95.05
Process 0 ( nvbandwidth-test-worker-0) : device 0: NVIDIA GB200 ( 00000008:01:00)
Process 1 ( nvbandwidth-test-worker-0) : device 1: NVIDIA GB200 ( 00000009:01:00)
Process 2 ( nvbandwidth-test-worker-0) : device 2: NVIDIA GB200 ( 00000018:01:00)
Process 3 ( nvbandwidth-test-worker-0) : device 3: NVIDIA GB200 ( 00000019:01:00)
Process 4 ( nvbandwidth-test-worker-1) : device 0: NVIDIA GB200 ( 00000008:01:00)
Process 5 ( nvbandwidth-test-worker-1) : device 1: NVIDIA GB200 ( 00000009:01:00)
Process 6 ( nvbandwidth-test-worker-1) : device 2: NVIDIA GB200 ( 00000018:01:00)
Process 7 ( nvbandwidth-test-worker-1) : device 3: NVIDIA GB200 ( 00000019:01:00)
Running multinode_device_to_device_memcpy_read_ce.
memcpy CE GPU( row) -> GPU( column) bandwidth ( GB/s)
0 1 2 3 4 5 6 7
0 N/A 822.16 821.69 821.92 821.45 821.30 821.53 821.69
1 820.90 N/A 821.92 821.61 821.30 821.06 822.00 821.69
2 820.59 821.77 N/A 821.69 821.45 821.06 821.37 821.30
3 820.51 821.77 821.61 N/A 821.37 821.22 821.30 821.92
4 820.75 821.53 821.45 821.85 N/A 821.37 821.61 821.85
5 820.51 821.53 821.22 821.69 821.69 N/A 821.61 821.77
6 820.35 821.30 821.53 821.37 821.30 820.90 N/A 821.14
7 820.59 821.69 820.98 821.37 821.37 821.14 821.30 N/A
SUM multinode_device_to_device_memcpy_read_ce 45997.93
NOTE: The reported results may not reflect the full capabilities of the platform.
Performance can vary with software drivers, hardware clocks, and system topology.
测试结果显示跨节点 GPU-to-GPU 通信带宽稳定在 820 GB/s 左右,远超传统 InfiniBand 等网络互联方案的性能,为大规模分布式 AI 训练提供了强大的通信基础。
总结
通过本文的实战指南,您已经学会了如何在 GB200 平台上部署和配置 NVIDIA DRA Driver,以启用 Multi-Node NVLink (MNNVL) 功能。主要成就包括:
理解核心概念 :掌握了 DRA、IMEX 和 ComputeDomains 的工作原理完成环境部署 :成功部署 GPU Operator 和 DRA Driver验证功能 :通过 nvbandwidth 测试确认跨节点 GPU 通信正常工作ComputeDomains 技术将复杂的底层 GPU 硬件抽象为 Kubernetes 原生资源,使得多节点分布式 AI 工作负载的管理变得简单而高效。未来,随着更多 NVIDIA 架构的支持,这项技术将在 AI 基础设施领域发挥越来越重要的作用。
参考资料