⚡ AI集群通信革命:GB200 MNNVL通过Kubernetes DRA实现跨节点800Gbps通信

nvidia-dra-gpu.jpeg

NVIDIA GB200 NVL72 正在将 AI 基础设施推向新的极限,使大规模语言模型训练和低延迟推理工作负载成为可能。随着 Kubernetes 在部署和扩展这些工作负载中的核心作用日益增强,快速演进的 AI 工作负载、基础设施需求和新硬件架构为 Kubernetes 编排和资源管理带来了新的挑战。

在本文中,我们将深入探讨如何通过 Kubernetes DRA (Dynamic Resource Allocation) 和 NVIDIA DRA Driver 在 GB200 平台上启用 Multi-Node NVLink (MNNVL),实现跨节点的 GPU 到 GPU 高带宽通信。

核心概念

Kubernetes DRA (Dynamic Resource Allocation) 简介

DRA (Dynamic Resource Allocation,动态资源分配) 是 Kubernetes v1.30 引入的革命性功能,用于解决传统 Device Plugin 框架在处理复杂异构硬件时的局限性。

DRA 最初始的版本(KEP-3063),于 1.26 版本引入,因可用性问题,在 1.32 版本被撤回 当前的 DRA 是第二个版本(KEP-4381) 于 1.30 引入。

为什么需要 DRA?

传统 Device Plugin 将硬件资源抽象为简单整数计数器,无法表达:

  • 设备特定参数(如 GPU 显存、计算能力、拓扑连接)

    • 无论什么 GPU,都展示为 nvidia.com/gpu
  • 资源共享需求(设备无法在容器间共享)

  • 硬件拓扑关系(NVLink 连接、PCIe 亲和性)

DRA 核心设计

DRA 遵循 Kubernetes 声明式原则,将硬件资源特性作为一等公民纳入 API:

  • ResourceClass:定义硬件资源类型和特性(类似 StorageClass)
  • ResourceClaim:工作负载的声明式资源请求(类似 PVC)
  • ResourceClaimTemplate:创建多个相似 ResourceClaim 的模板
  • DRA Driver:由硬件厂商实现的资源分配逻辑

DRA 的优势

  • 声明式管理:通过标准 API 声明复杂资源需求
  • 参数化配置:支持设备特定参数和资源共享
  • 拓扑感知:原生支持硬件拓扑关系的调度决策
  • 跨节点分配:支持分布式资源分配场景

GB200 NVL72 通过引入 Multi-Node NVLink (MNNVL) 技术,将单机 GPU 性能限制扩展到整个机架层面,为分布式 AI 工作负载带来革命性改进。

传统的单节点 DGX 系统受限于单机物理限制,MNNVL 改变了这一局面:

  • 全 NVLink 带宽跨节点通信:通过 NVIDIA NVLink Switch 实现全 NVLink 带宽通信
  • 无缝扩展:将整个机架转换为统一的 GPU 架构
  • 性能倍增:实现超快分布式训练和推理

ComputeDomains:连接底层硬件与 Kubernetes

IMEX (Internode Memory Exchange)

NVIDIA Internode Memory Exchange Service (IMEX) 是 GPU 驱动层面的软件,允许 GPU 跨节点通信。IMEX 对每个单独的 GPU 内存导出/导入操作进行细粒度访问控制,并在称为 IMEX 域的节点组中运行。

ComputeDomains 核心概念

作为 NVIDIA GPUs 的 DRA 驱动程序的一部分提供的 ComputeDomains,将底层 GPU 构造(NVIDIA NVLink 和 NVIDIA IMEX)与现代 Kubernetes 原生调度概念(动态资源分配,简称 DRA)连接起来,为在现代 GPU 硬件上运行分布式多节点工作负载提供所需的基础支持。

如果没有 ComputeDomains,多节点 NVLink 设置将不得不手动定义并固定到位,这限制了 Kubernetes 旨在提供的灵活性,并以牺牲安全隔离、故障隔离和成本效率为代价。

ComputeDomains 通过以下方式工作:

  • 动态创建 IMEX 域:根据工作负载调度自动形成 IMEX 域
  • 安全隔离:每个工作负载获得专用的隔离通信环境
  • 自动清理:工作负载完成后自动释放资源
  • 拓扑感知:理解并优化 GPU 连接关系

通过 ComputeDomains,运行分布式训练或跨复杂 NVLink 连接 GPU 架构的推理变得像部署标准 Kubernetes 工作负载一样简单。

环境部署

版本要求

软件信息:

  • Kubernetes:1.32 及以上,推荐 1.34
  • Containerd:支持 DRA 的版本 1.7.x,推荐 1.7.29
  • GPU Operator:25.3.x 及以上,推荐 25.10.0
  • DRA Driver:推荐部署最新的 25.8.0
  • NVIDIA GPU Driver:需要 565 或更新版本
    • 如果使用 DRA Driver 25.8.0,则需要驱动版本 >= 570.158.1

硬件信息:

  • 系统:GB200 NVL72(一柜系统中的 2 节点子集)
  • 节点数量:2 个节点
  • 每个节点配置
    • GPU:4 个 GB200 GPU(本次测试共 8 个 GPU)
    • GPU 显存:每个 GPU 192GB(189471 MiB)
    • CPU:2 个 Grace CPU

GPU 基本信息:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
root@GB200-POD2-F06-Node05:~/lixd/nccl-demo# nvidia-smi
Wed Dec 10 10:24:44 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB200                   On  |   00000008:01:00.0 Off |                    0 |
| N/A   37C    P0            169W / 1200W |       0MiB / 189471MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GB200                   On  |   00000009:01:00.0 Off |                    0 |
| N/A   37C    P0            157W / 1200W |       0MiB / 189471MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GB200                   On  |   00000018:01:00.0 Off |                    0 |
| N/A   38C    P0            158W / 1200W |       0MiB / 189471MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GB200                   On  |   00000019:01:00.0 Off |                    0 |
| N/A   37C    P0            166W / 1200W |       0MiB / 189471MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

GPU 拓扑:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
root@GB200-POD2-F06-Node05:~/lixd/nccl-demo# nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	NIC6	NIC7	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	0-71	0,2-17		N/A
GPU1	NV18	 X 	NV18	NV18	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	0-71	0,2-17		N/A
GPU2	NV18	NV18	 X 	NV18	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	72-143	1,18-33		N/A
GPU3	NV18	NV18	NV18	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	72-143	1,18-33		N/A
NIC0	SYS	SYS	SYS	SYS	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS
NIC1	SYS	SYS	SYS	SYS	SYS	 X 	SYS	SYS	SYS	SYS	SYS	SYS
NIC2	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	SYS	SYS	SYS	SYS
NIC3	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	SYS	SYS	SYS	SYS
NIC4	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	SYS	SYS	SYS
NIC5	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	SYS	SYS
NIC6	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX
NIC7	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7

环境准备

GPU Operator

部署

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# 添加 NVIDIA Helm 仓库
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update

# 部署 GPU Operator
helm install --wait gpu-operator \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --version=v25.10.0 \
     --set driver.enabled=true \
     --set dcgmExporter.serviceMonitor.enabled=true \
     --set dcgm.enabled=true

提示:部署过程可能需要 5-10 分钟,请耐心等待。可以通过 kubectl get pods -n gpu-operator 监控部署进度。

验证

如果部署成功,那么节点上可以看到 nvidia.com/gpu 资源

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
root@GB200-POD2-F06-Node05:~/lixd# kubectl describe node gb200-pod2-f06-node05|grep Capacity -C 7                     Addresses:
  InternalIP:  10.0.6.41
  Hostname:    gb200-pod2-f06-node05
Capacity:
  cpu:                        144
  ephemeral-storage:          1840577300Ki
  hugepages-16Gi:             0
  hugepages-2Mi:              0
  hugepages-512Mi:            0
  memory:                     1002717120Ki
  nvidia.com/gpu:             4

NVIDIA DRA Driver

部署

1
2
3
4
5
6
7
# 部署 NVIDIA DRA Driver
helm upgrade --install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
    --version="25.8.0" \
    --create-namespace \
    --namespace nvidia-dra-driver-gpu \
    --set resources.gpus.enabled=false \
    --set nvidiaDriverRoot=/run/nvidia/driver

重要参数说明

  • resources.gpus.enabled=false:禁用默认 GPU 资源管理,由 GPU Operator 处理
  • nvidiaDriverRoot=/run/nvidia/driver:指定 NVIDIA 驱动路径

验证

正常情况下,可以看到节点间的 nvidia.com/gpu.clique label。

1
2
3
4
5
6
root@GB200-POD2-F06-Node05:~/lixd# (echo -e "NODE\tLABEL\tCLIQUE"; kubectl get nodes -o json | \
    /usr/bin/jq -r '.items[] | [.metadata.name, "nvidia.com/gpu.clique", .metadata.labels["nvidia.com/gpu.clique"]] | @tsv') | \
    column -t
NODE                   LABEL                  CLIQUE
gb200-pod2-f06-node05  nvidia.com/gpu.clique  69a19a31-f41c-45a5-8245-579b6bce5bdd.32766
gb200-pod2-f06-node06  nvidia.com/gpu.clique  69a19a31-f41c-45a5-8245-579b6bce5bdd.32766

创建 IMEX 负载

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
cat <<EOF > imex-channel-injection.yaml
---
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: imex-channel-injection
spec:
  numNodes: 1
  channel:
    resourceClaimTemplate:
      name: imex-channel-0
---
apiVersion: v1
kind: Pod
metadata:
  name: imex-channel-injection
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.clique
            operator: Exists
  containers:
  - name: ctr
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: imex-channel-0
  resourceClaims:
  - name: imex-channel-0
    resourceClaimTemplateName: imex-channel-0
EOF

查看日志,正常能看到注入的 imex channel

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
root@GB200-POD2-F06-Node05:~/lixd/demo# kubectl apply -f imex-channel-injection.yaml
computedomain.resource.nvidia.com/imex-channel-injection created
pod/imex-channel-injection created
root@GB200-POD2-F06-Node05:~/lixd/demo# kubectl get pods
NAME                         READY   STATUS    RESTARTS   AGE
imex-channel-injection       1/1     Running   0          5s
root@GB200-POD2-F06-Node05:~/lixd/demo# kubectl logs imex-channel-injection
total 0
drwxr-xr-x 2 root root     60 Jan  5 08:31 .
drwxr-xr-x 6 root root    380 Jan  5 08:31 ..
crw-rw-rw- 1 root root 501, 0 Jan  5 08:31 channel0

至此,说明 nvidia-dra-driver 部署成功。

验证测试

安装 MPI Operator

首先安装 MPI Operator,用于运行多节点 MPI 作业:

1
kubectl create -f https://github.com/kubeflow/mpi-operator/releases/download/v0.6.0/mpi-operator.yaml

nvbandwidth 测试

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
cat <<EOF > nvbandwidth-test-job.yaml
---
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: nvbandwidth-test-compute-domain
spec:
  numNodes: 2
  channel:
    resourceClaimTemplate:
      name: nvbandwidth-test-compute-domain-channel

---
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: nvbandwidth-test
spec:
  slotsPerWorker: 4
  launcherCreationPolicy: WaitForWorkersReady
  runPolicy:
    cleanPodPolicy: Running
  sshAuthMountPath: /home/mpiuser/.ssh
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          labels:
            nvbandwidth-test-replica: mpi-launcher
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: node-role.kubernetes.io/control-plane
                    operator: Exists
          containers:
          - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
            name: mpi-launcher
            securityContext:
              runAsUser: 1000
            command:
            - mpirun
            args:
            - --bind-to
            - core
            - --map-by
            - ppr:4:node
            - -np
            - "8"
            - --report-bindings
            - -q
            - nvbandwidth
            - -t
            - multinode_device_to_device_memcpy_read_ce
    Worker:
      replicas: 2
      template:
        metadata:
          labels:
            nvbandwidth-test-replica: mpi-worker
        spec:
          affinity:
            podAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                  - key: nvbandwidth-test-replica
                    operator: In
                    values:
                    - mpi-worker
                topologyKey: nvidia.com/gpu.clique
          containers:
          - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
            name: mpi-worker
            securityContext:
              runAsUser: 1000
            env:
            command:
            - /usr/sbin/sshd
            args:
            - -De
            - -f
            - /home/mpiuser/.sshd_config
            resources:
              limits:
                nvidia.com/gpu: 4
              claims:
              - name: compute-domain-channel
          resourceClaims:
          - name: compute-domain-channel
            resourceClaimTemplateName: nvbandwidth-test-compute-domain-channel
EOF

Apply

1
2
3
root@GB200-POD2-F06-Node05:~/lixd/demo# kubectl apply -f nvbandwidth-test-job.yaml
computedomain.resource.nvidia.com/nvbandwidth-test-compute-domain created
mpijob.kubeflow.org/nvbandwidth-test created

会自动启动 Pod 进行测试

1
2
3
4
5
root@GB200-POD2-F06-Node05:~/lixd/demo# k get po
NAME                              READY   STATUS    RESTARTS   AGE
nvbandwidth-test-launcher-xl87m   1/1     Running   0          26s
nvbandwidth-test-worker-0         1/1     Running   0          7m41s
nvbandwidth-test-worker-1         1/1     Running   0          7m41s

查看日志

1
 kubectl logs --tail=-1 -l job-name=nvbandwidth-test-launcher

测试结果如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
nvbandwidth Version: v0.7
Built from Git version: v0.7

MPI version: Open MPI v4.1.4, package: Debian OpenMPI, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022
CUDA Runtime Version: 13000
CUDA Driver Version: 13000
Driver Version: 580.95.05

Process 0 (nvbandwidth-test-worker-0): device 0: NVIDIA GB200 (00000008:01:00)
Process 1 (nvbandwidth-test-worker-0): device 1: NVIDIA GB200 (00000009:01:00)
Process 2 (nvbandwidth-test-worker-0): device 2: NVIDIA GB200 (00000018:01:00)
Process 3 (nvbandwidth-test-worker-0): device 3: NVIDIA GB200 (00000019:01:00)
Process 4 (nvbandwidth-test-worker-1): device 0: NVIDIA GB200 (00000008:01:00)
Process 5 (nvbandwidth-test-worker-1): device 1: NVIDIA GB200 (00000009:01:00)
Process 6 (nvbandwidth-test-worker-1): device 2: NVIDIA GB200 (00000018:01:00)
Process 7 (nvbandwidth-test-worker-1): device 3: NVIDIA GB200 (00000019:01:00)

Running multinode_device_to_device_memcpy_read_ce.
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
           0         1         2         3         4         5         6         7
 0       N/A    822.16    821.69    821.92    821.45    821.30    821.53    821.69
 1    820.90       N/A    821.92    821.61    821.30    821.06    822.00    821.69
 2    820.59    821.77       N/A    821.69    821.45    821.06    821.37    821.30
 3    820.51    821.77    821.61       N/A    821.37    821.22    821.30    821.92
 4    820.75    821.53    821.45    821.85       N/A    821.37    821.61    821.85
 5    820.51    821.53    821.22    821.69    821.69       N/A    821.61    821.77
 6    820.35    821.30    821.53    821.37    821.30    820.90       N/A    821.14
 7    820.59    821.69    820.98    821.37    821.37    821.14    821.30       N/A

SUM multinode_device_to_device_memcpy_read_ce 45997.93

NOTE: The reported results may not reflect the full capabilities of the platform.
Performance can vary with software drivers, hardware clocks, and system topology.

测试结果显示跨节点 GPU-to-GPU 通信带宽稳定在 820 GB/s 左右,远超传统 InfiniBand 等网络互联方案的性能,为大规模分布式 AI 训练提供了强大的通信基础。

总结

通过本文的实战指南,您已经学会了如何在 GB200 平台上部署和配置 NVIDIA DRA Driver,以启用 Multi-Node NVLink (MNNVL) 功能。主要成就包括:

  • 理解核心概念:掌握了 DRA、IMEX 和 ComputeDomains 的工作原理
  • 完成环境部署:成功部署 GPU Operator 和 DRA Driver
  • 验证功能:通过 nvbandwidth 测试确认跨节点 GPU 通信正常工作

ComputeDomains 技术将复杂的底层 GPU 硬件抽象为 Kubernetes 原生资源,使得多节点分布式 AI 工作负载的管理变得简单而高效。未来,随着更多 NVIDIA 架构的支持,这项技术将在 AI 基础设施领域发挥越来越重要的作用。

参考资料

0%