DRA P1：DRA 能解决什么问题？从部署到使用的完整体验

2026-05-06 20:00:00 约 2800 字预计阅读 6 分钟 - 次阅读

在 Kubernetes 里用 GPU 这类设备，大家习惯走 DevicePlugin。但 AI workload 越来越复杂，DevicePlugin 的短板越来越明显——没法描述设备属性，调度器不参与分配，Pod 经常调度到节点后才发现资源不够。

DRA（Dynamic Resource Allocation，动态资源分配）就是 Kubernetes 针对这些问题推出的新框架。

DRA 核心功能在 Kubernetes 1.34 已 GA，可以放心用在生产环境。

1. DRA 解决什么问题？

DevicePlugin 的局限性

只能上报"数量"，无法描述设备属性
调度器不参与分配，导致调度冲突
不支持资源预留

DRA 的核心改进

设备信息通过 ResourceSlice 进入 API Server，调度器可见
精确的设备选择（通过 CEL 表达式）
支持资源预留和共享

例如：

DevicePlugin：用户申请 1 个 GPU → 调度器随机选节点 → Kubelet 本地分配 -> Pod 启动后发现显存不够 OOM（比如你想要 A100 80GB，但被调度到了只有 V100 16GB 的节点，因为调度器无法区分 GPU 型号）

DRA：用户显式申请一个"显存>40Gi 的 GPU" → 调度器精确匹配 → 预分配成功 → Pod 正常启动

DevicePlugin vs DRA 对比

2. 部署环境

2.1 前置条件

K8s 相关：
- Kubernetes v1.32 or newer.
- DRA and corresponding API groups must be enabled.
Container Runtime 相关
- CDI must be enabled in the underlying container runtime (such as containerd or CRI-O).

ps：K8s 最好 1.34+，Containerd 则是 v1.7.x 及以上。

v1.32+ 已内置（beta），v1.34 GA

2.2 GPU Operator 安装

NVIDIA 官方在推进将 DRA 也集成到 GPU Operator，不过目前还没有完成，需要分别安装。

安装 GPU Operator 时需要指定不安装 DevicePlugin，否则会和后续安装的 DRA GPU Plugin 冲突。

安装命令如下：

1
2
3
4
5
6
7
8
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
 
helm upgrade --install --wait gpu-operator \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --version=v26.3.1 \
     --set driver.enabled=true \
     --set devicePlugin.enabled=false

核心参数

1
2
# 关闭 DevicePlugin
--set devicePlugin.enabled=false

2.3 DRA 安装

NVIDIA 提供的 DRA Driver：dra-driver-nvidia-gpu ，包含了下面两个 plugin：

GPU Plugin：用于替代之前的 DevicePlugin，即：https://github.com/nvidia/k8s-device-plugin
ComputeDomain Plugin：此插件专门用于支持 Multi-Node NVLink (MNNVL)，这是 NVIDIA GB200 NVL72 等新一代系统实现节点间 GPU 高速直连的关键技术。它引入 ComputeDomain 这一抽象概念，用于在多个 Pod（可能跨节点）之间自动建立并保障一个安全、隔离的 NVLink 通信域。
- 大部分主流的 GPU 都是用不上这个功能的，默认关闭即可

安装命令如下：

1
2
3
4
5
6
7
helm upgrade --install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
    --version="25.12.0" \
    --create-namespace \
    --namespace nvidia-dra-driver-gpu \
    --set resources.gpus.enabled=true \
    --set resources.computeDomains.enabled=false \
    --set nvidiaDriverRoot=/run/nvidia/driver

注意：在 KEP 5004 正式发布之前，resources.gpus.enabled=true的 DRA 驱动无法与标准的 NVIDIA GPU DevicePlugin 同时安装。两者会因管理相同的扩展资源（nvidia.com/gpu）而发生冲突。

可以通过 --set 'gpuResourcesEnabledOverride=true' 安装。

核心参数：

1
2
3
4
5
6
# 开启 GPU 插件，关闭 computeDomains 插件
    --set resources.gpus.enabled=true \
    --set resources.computeDomains.enabled=false \
# 对于使用 GPU Operator 安装的驱动，需要通过 nvidiaDriverRoot 参数指定下目录
# /run/nvidia/driver 正好是 GPU Operator 安装的位置
    --set nvidiaDriverRoot=/run/nvidia/driver 

2.4 验证

查看 ResourceSlice

对于 DevicePlugin 来说是通过直接在 node 的 capacity 字段上展示nvidia.com/gpu 来表示该节点上有多少张 GPU。

1
2
3
4
5
6
7
8
9
root@GB200-POD2-F06-Node05:~# kubectl get node gb200-pod2-f06-node05 -oyaml|grep "capacity:" -A 7
  capacity:
    cpu: "144"
    ephemeral-storage: 1840577300Ki
    hugepages-2Mi: "0"
    hugepages-16Gi: "0"
    hugepages-512Mi: "0"
    memory: 1002716864Ki
    nvidia.com/gpu: "4"

比如上面这个节点 capacity 信息，只能看出该节点有 4 张 GPU，但是其他的：型号、显存、拓扑等信息完全无法得知。

DRA 则是有单独的 ResourceSlice 对象记录详细信息

暂时忽略这里面的 compute-domain 资源

1
2
3
4
5
6
root@GB200-POD2-F06-Node05:~# kubectl get ResourceSlice
NAME                                                    NODE                    DRIVER                      POOL                    AGE
gb200-pod2-f06-node05-compute-domain.nvidia.com-ftn4j   gb200-pod2-f06-node05   compute-domain.nvidia.com   gb200-pod2-f06-node05   18s
gb200-pod2-f06-node05-gpu.nvidia.com-j7ggv              gb200-pod2-f06-node05   gpu.nvidia.com              gb200-pod2-f06-node05   18s
gb200-pod2-f06-node06-compute-domain.nvidia.com-bjwf4   gb200-pod2-f06-node06   compute-domain.nvidia.com   gb200-pod2-f06-node06   18s
gb200-pod2-f06-node06-gpu.nvidia.com-bbv4b              gb200-pod2-f06-node06   gpu.nvidia.com              gb200-pod2-f06-node06   18s

以 gb200-pod2-f06-node05-gpu.nvidia.com-j7ggv 为例，查看详情：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
root@GB200-POD2-F06-Node05:~/lixd/dra/demo# kubectl get ResourceSlice gb200-pod2-f06-node05-gpu.nvidia.com-j7ggv -oyaml
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
  name: gb200-pod2-f06-node05-gpu.nvidia.com-j7ggv
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Node
    name: gb200-pod2-f06-node05
spec:
  devices:
  - attributes:
      addressingMode:
        string: ATS
      architecture:
        string: Blackwell
      brand:
        string: Nvidia
      cudaComputeCapability:
        version: 10.0.0
      driverVersion:
        version: 580.126.20
      productName:
        string: NVIDIA GB200
      type:
        string: gpu
      uuid:
        string: GPU-137d7996-cb8e-7683-e47d-9c99e6f49eb5
    capacity:
      memory:
        value: 189471Mi
    name: gpu-0
  # ... 省略 gpu-1, gpu-2, gpu-3 的类似内容
  driver: gpu.nvidia.com
  nodeName: gb200-pod2-f06-node05
  pool:
    name: gb200-pod2-f06-node05

ResourceSlice 库存表

里面记录了节点上所有 GPU 的详细信息，关键属性包括：

type: string: gpu(设备类型)
architecture: string: Blackwell(GPU架构)
brand: string: Nvidia(品牌)
productName: string: NVIDIA GB200(产品名)
cudaComputeCapability: version: 10.0.0(计算能力)
driverVersion: version: 580.126.20(驱动版本)
resource.kubernetes.io/pciBusID(PCI总线地址)

容量 (capacity)：目前只显示了 memory: 189471Mi。

查看 DeviceClass

在 ResourceSlice展示了资源的‘库存清单’后，DeviceClass则定义了如何使用这些资源的‘筛选规则’。

1
2
3
4
5
6
7
root@GB200-POD2-F06-Node05:~# kubectl get deviceclass
NAME                                        AGE
compute-domain-daemon.nvidia.com            122m
compute-domain-default-channel.nvidia.com   122m
gpu.nvidia.com                              122m
mig.nvidia.com                              122m
vfio.gpu.nvidia.com                         122m

例如 gpu.nvidia.com deviceclass 内容如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
root@GB200-POD2-F06-Node05:~# kubectl get DeviceClass gpu.nvidia.com -oyaml
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  annotations:
    meta.helm.sh/release-name: nvidia-dra-driver-gpu
    meta.helm.sh/release-namespace: nvidia-dra-driver-gpu
  creationTimestamp: "2026-04-18T11:10:48Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
  name: gpu.nvidia.com
  resourceVersion: "51521963"
  uid: c5143f4a-0dec-4eae-9bbc-a00c4d806159
spec:
  selectors:
  - cel:
      expression: device.driver == 'gpu.nvidia.com' && device.attributes['gpu.nvidia.com'].type
        == 'gpu'

DeviceClass 中通过 CEL 定义规则

1
2
3
cel:
      expression: device.driver == 'gpu.nvidia.com' && device.attributes['gpu.nvidia.com'].type
        == 'gpu'

含义如下：

设备的驱动必须是 gpu.nvidia.com
设备的类型必须是 gpu

满足这两个条件的设备就属于这个 DeviceClass。

3. 运行第一个 DRA GPU Pod

3.1 ResourceClaimTemplate

创建 ResourceClaimTemplate 申请一个 GPU

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: single-gpu
  namespace: default
spec:
  spec:
    devices:
      requests:
      - name: gpu        
        exactly:
          deviceClassName: gpu.nvidia.com # 匹配前面的 DeviceClass
          allocationMode: ExactCount
          count: 1 # 数量

3.2 Pod 申请 resourceClaim

然后创建 Pod 指定使用这个 resourceClaimTemplate，通过 resourceClaimTemplateName 指定名称进行关联。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# gpu-test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test-pod
  namespace: default
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:12.1.0-base-ubuntu22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L && echo 'GPU allocation successful.' && sleep 3600"]
    # 在容器中引用Pod级别声明的资源
    resources:
      claims:
      - name: gpu-claim
  # 声明Pod需要资源，并指定使用哪个模板
  resourceClaims:
  - name: gpu-claim
    resourceClaimTemplateName: single-gpu

3.3 验证

验证调度结果

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# 查看Pod状态和调度节点
kubectl get pod gpu-test-pod -o wide

# 查看Pod的详细事件，确认资源分配过程
kubectl describe pod gpu-test-pod

# 查看自动生成的ResourceClaim及其状态
kubectl get resourceclaim
kubectl describe resourceclaim <自动生成的claim名称>

# 查看Pod日志，确认GPU信息被正确识别
kubectl logs gpu-test-pod

Pod 状态和调度节点

1
2
3
root@GB200-POD2-F06-Node05:~/lixd/dra/demo# kubectl get pod gpu-test-pod -o wide
NAME           READY   STATUS    RESTARTS   AGE   IP              NODE                    NOMINATED NODE   READINESS GATES
gpu-test-pod   1/1     Running   0          95s   172.25.114.35   gb200-pod2-f06-node05   <none>           <none>

Pod 成功启动，并被调度器精确地分配到了拥有符合条件 GPU 的 gb200-pod2-f06-node05 节点。

resourceclaim 详情

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
root@GB200-POD2-F06-Node05:~/lixd/dra/demo# kubectl get resourceclaim
NAME                           STATE                AGE
gpu-test-pod-gpu-claim-7rqsj   allocated,reserved   37s
root@GB200-POD2-F06-Node05:~/lixd/dra/demo# kubectl describe resourceclaim gpu-test-pod-gpu-claim-7rqsj
Name:         gpu-test-pod-gpu-claim-7rqsj
Namespace:    default
Labels:       <none>
Annotations:  resource.kubernetes.io/pod-claim-name: gpu-claim
API Version:  resource.k8s.io/v1
Kind:         ResourceClaim
Metadata:
  Creation Timestamp:  2026-04-19T06:26:52Z
  Finalizers:
    resource.kubernetes.io/delete-protection
  Generate Name:  gpu-test-pod-gpu-claim-
  Owner References:
    API Version:           v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Pod
    Name:                  gpu-test-pod
    UID:                   f92cb808-a813-419c-b111-7d9e4e402f59
  Resource Version:        51788271
  UID:                     86e7f3f0-052b-4a61-8e4f-36e2020dd2fc
Spec:
  Devices:
    Requests:
      Exactly:
        Allocation Mode:    ExactCount
        Count:              1
        Device Class Name:  gpu.nvidia.com
      Name:                 gpu
Status:
  Allocation:
    Devices:
      Results:
        Device:   gpu-0
        Driver:   gpu.nvidia.com
        Pool:     gb200-pod2-f06-node05
        Request:  gpu
    Node Selector:
      Node Selector Terms:
        Match Fields:
          Key:       metadata.name
          Operator:  In
          Values:
            gb200-pod2-f06-node05
  Reserved For:
    Name:      gpu-test-pod
    Resource:  pods
    UID:       f92cb808-a813-419c-b111-7d9e4e402f59
Events:        <none>

查看 Pod 日志

查看 Pod 日志，成功分配了一个 GPU：

1
2
3
root@GB200-POD2-F06-Node05:~/lixd/dra/demo# kubectl logs gpu-test-pod
GPU 0: NVIDIA GB200 (UUID: GPU-137d7996-cb8e-7683-e47d-9c99e6f49eb5)
GPU allocation successful.

4. 小结

这篇文章带大家从零搭了一套 DRA 环境，跑通了第一个 GPU Pod。几个关键点：

ResourceSlice 是设备的"库存表"，把 GPU 型号、显存、驱动版本都暴露给调度器
DeviceClass 用 CEL 表达式定义筛选规则，决定哪些设备归哪一类
ResourceClaim / ResourceClaimTemplate 是用户侧的"申购单"，Pod 通过它拿到具体设备

和 DevicePlugin 最大的区别：调度器在做调度决策时就能看到设备的全部信息，不再盲选，可以做更精细的调度。

DRA（Dynamic Resource Allocation）的引入，标志着 Kubernetes 在 资源管理 方向的重大演进。它不仅解决了现有资源管理模型的痛点，还为复杂的异构硬件管理提供了原生支持，例如 GPU、FPGA、高性能 NIC 等等。