Volcano VGPU实战：无硬件依赖的Kubernetes GPU共享与隔离方案

2025-08-19 22:00:00 约 5900 字预计阅读 12 分钟 - 次阅读

在上一篇《Volcano初探：批处理调度引擎的云原生实践》中，我们通过Helm快速部署了Volcano集群，并成功运行了首个测试任务，验证了其基础调度能力。本文将进一步探索Volcano的GPU虚拟化功能，聚焦如何通过HAMi vGPU 技术实现GPU资源的细粒度共享与硬隔离。

批处理调度引擎 Volcano 支持 GPU 虚拟化功能，该功能主要由 HAMi 提供。

HAMi vGPU 提供的 GPU 虚拟化包括 HAMi-Core 和 Dynamic MIG 两种模式：

Mode	Isolation	MIG GPU Required	Annotation	Core/Memory Control	Recommended For
HAMI-core	Software (VCUDA)	No	No	Yes	General workloads
Dynamic MIG	Hardware	Yes	Yes	MIG-controlled	Performance-sensitive jobs

如果硬件支持 MIG 同时运行的是性能敏感型任务，那么推荐使用 Dynamic MIG 模型，不支持 MIG 依旧可以使用更加通用，对硬件无要求的 HAMi-Core 模式。

本文主要以 HAMi-Core 进行演示，HAMi vGPU 如何集成到 Volcano。

使用流程：

1）创建集群
2）安装 GPU-Operator，但是不安装 DevicePlugin
3）安装 Volcano，并配置开启 vGPU 插件
4）安装 volcano-vgpu-device-plugin
5）验证

1. 环境准备

1.1 创建集群

使用 KubeClipper 部署一个集群进行验证。

Kubernetes教程(十一)—使用 KubeClipper 通过一条命令快速创建 k8s 集群

1.2 GPU-Operator

参考之前的文章 GPU 环境搭建指南：使用 GPU Operator 加速 Kubernetes GPU 环境搭建，使用 GPU Operator 部署环境。

1.3 Volcano

部署 Volcano

安装 Volcano,部署时需要注意 volcano 和 k8s 的版本兼容性问题，参考官方 README：Kubernetes compatibility

这里部署的 v1.12.0 版本

1
2
3
4
5
6
7
# 添加仓库
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts

helm repo update

# 部署
helm upgrade --install volcano volcano-sh/volcano --version 1.12.0 -n volcano-system --create-namespace

部署完成后 Pod 列表如下：

1
2
3
4
5
root@node5-3:~# kubectl -n volcano-system get po
NAME                                  READY   STATUS    RESTARTS   AGE
volcano-admission-6444dd4fb7-8s8d9    1/1     Running   0          3m
volcano-controllers-75d5b78c7-llcrz   1/1     Running   0          3m
volcano-scheduler-7d46c5b5db-t2k42    1/1     Running   0          3m

修改调度器配置：开启 deviceshare 插件

Volcano 部署完成之后，我们需要编辑调度器配置，开启 deviceshare 插件。

1
kubectl edit cm -n volcano-system volcano-scheduler-configmap

完整内容如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
kind: ConfigMap
apiVersion: v1
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: drf
      - name: deviceshare
        arguments:
          deviceshare.VGPUEnable: true # enable vgpu
          deviceshare.SchedulePolicy: binpack  # scheduling policy. binpack / spread
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack

核心如下：

1
2
3
4
      - name: deviceshare
        arguments:
          deviceshare.VGPUEnable: true # enable vgpu
          deviceshare.SchedulePolicy: binpack  # scheduling policy. binpack / spread

开启 vgpu 同时调度策略我们选择 binpack。

HAMi 调度策略可以阅读这篇文章：HAMi vGPU 原理分析 Part4：Spread&Binpack 高级调度策略实现

修改后，不需要重启，Volcano 会自动检测，当文件变化后自动 reload。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
if opt.SchedulerConf != "" {
    var err error
    path := filepath.Dir(opt.SchedulerConf)
    watcher, err = filewatcher.NewFileWatcher(path)
    if err != nil {
       return nil, fmt.Errorf("failed creating filewatcher for %s: %v", opt.SchedulerConf, err)
    }
}

func (pc *Scheduler) watchSchedulerConf(stopCh <-chan struct{}) {
    if pc.fileWatcher == nil {
       return
    }
    eventCh := pc.fileWatcher.Events()
    errCh := pc.fileWatcher.Errors()
    for {
       select {
       case event, ok := <-eventCh:
          if !ok {
             return
          }
          klog.V(4).Infof("watch %s event: %v", pc.schedulerConf, event)
          if event.Op&fsnotify.Write == fsnotify.Write || event.Op&fsnotify.Create == fsnotify.Create {
             pc.loadSchedulerConf()
             pc.cache.SetMetricsConf(pc.metricsConf)
          }
       case err, ok := <-errCh:
          if !ok {
             return
          }
          klog.Infof("watch %s error: %v", pc.schedulerConf, err)
       case <-stopCh:
          return
       }
    }
}

不过 k8s 将 Configmap 同步到 Pod 中也是有延迟的，不想等的话也可以手动重启下。

1
kubectl -n volcano-system rollout restart deploy volcano-scheduler

1.4 volcano-vgpu-device-plugin

接下来我们部署和 Volcano 集成用到的 DevicePlugin：volcano-vgpu-device-plugin。

DevicePlugin 原理可以阅读这篇文章：HAMi vGPU 原理分析 Part1：hami-device-plugin-nvidia 实现，大致逻辑都是一样的。

部署 DevicePlugin

从项目 volcano-vgpu-device-plugin 根目录获取文件： volcano-vgpu-device-plugin.yml

1
wget https://github.com/Project-HAMi/volcano-vgpu-device-plugin/blob/main/volcano-vgpu-device-plugin.yml

部署

1
kubectl apply -f volcano-vgpu-device-plugin.yml

查看 Pod 列表：

1
2
root@node5-3:~# kubectl -n kube-system get po
volcano-device-plugin-xkwzd                2/2     Running   0          10m

验证 Node 资源

查看 Node 上的 Resource 信息：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
root@node5-3:~/lixd# k describe node node5-3 |grep Cap -A 10
Capacity:
  cpu:                     160
  ephemeral-storage:       3750157048Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  2113442544Ki
  nvidia.com/gpu:          0
  pods:                    110
  volcano.sh/vgpu-cores:   800
  volcano.sh/vgpu-memory:  39312
  volcano.sh/vgpu-number:  80

Volcano 新增了下面三个：

1
2
3
  volcano.sh/vgpu-cores:   800
  volcano.sh/vgpu-memory:  39312
  volcano.sh/vgpu-number:  80

volcano.sh/vgpu-cores: 800：每张 GPU 100 core，8 卡正好 800 core。
volcano.sh/vgpu-memory: 39312 ：由于设置了 factor=10，因此实际代表总显存 39312 * 10 = 393120 MB。
- 当前环境是 L40S*8，单卡显存 49140，49140 * 8 = 393120，正好符合，说明一切正常。
volcano.sh/vgpu-number: 80：默认 --device-split-count=10,将 GPU 数量扩大了 10 倍。

说明插件部署成功。

2. 简单使用

启动 Pod

首先启动一个简单 Pod

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
apiVersion: v1
kind: Pod
metadata:
  name: test1
spec:
  restartPolicy: OnFailure
  schedulerName: volcano
  containers:
  - image: ubuntu:24.04
    name: pod1-ctr
    command: ["sleep"]
    args: ["100000"]
    resources:
      limits:
        volcano.sh/vgpu-memory: 1024
        volcano.sh/vgpu-number: 1

查看效果,vgpu-memory 申请的 1024，如下：

但是因为 factor=10,所以实际是 10240 MB。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
root@node5-3:~/lixd# k exec -it test1 -- nvidia-smi
[HAMI-core Msg(16:140249737447232:libvgpu.c:838)]: Initializing.....
Tue Jul 22 13:52:58 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L40S                    On  | 00000000:91:00.0 Off |                  Off |
| N/A   28C    P8              34W / 350W |      0MiB / 10240MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
[HAMI-core Msg(16:140249737447232:multiprocess_memory_limit.c:499)]: Calling exit handler 16

启动 Volcano Job

启动一个简单的 Volcano Job 试试：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: simple-vgpu-training
spec:
  schedulerName: volcano
  minAvailable: 3  # Gang Scheduling: 确保3个Pod同时启动
  
  tasks:
    - name: worker
      replicas: 3  # 启动2个Worker
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: python-trainer
            image: python:3.9-slim
            command: ["python", "-c"]
            args:
              - |
                # 简化的训练代码
                import os
                import time
                
                worker_id = os.getenv("VC_TASK_INDEX", "0")
                print(f"Worker {worker_id} started with vGPU")
                
                # 模拟训练过程
                for epoch in range(1, 10):
                    time.sleep(6)
                    print(f"Worker {worker_id} completed epoch {epoch}")
                
                print(f"Worker {worker_id} finished training!")
            env:
              # 获取任务索引 (0,1,...)
              - name: VC_TASK_INDEX
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['volcano.sh/task-index']
            resources:
              limits:
                volcano.sh/vgpu-memory: 1024  # 每个Worker分配1024MB显存
                volcano.sh/vgpu-number: 1     # 每个Worker1个vGPU
                cpu: "1"
                memory: "1Gi"

一切正常：

1
2
3
4
5
root@node5-3:~/lixd# k get po -w
NAME                            READY   STATUS      RESTARTS   AGE
simple-vgpu-training-worker-0   1/1     Running     0          5s
simple-vgpu-training-worker-1   1/1     Running     0          5s
simple-vgpu-training-worker-2   1/1     Running     0          5s

查看 Pod 中的 GPU 信息

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
root@node5-3:~/lixd# k exec -it simple-vgpu-training-worker-0 -- nvidia-smi
[HAMI-core Msg(7:140498435086144:libvgpu.c:838)]: Initializing.....
Tue Jul 22 15:02:36 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L40S                    On  | 00000000:91:00.0 Off |                  Off |
| N/A   28C    P8              34W / 350W |      0MiB / 10240MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
[HAMI-core Msg(7:140498435086144:multiprocess_memory_limit.c:499)]: Calling exit handler 7

日志

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
root@node5-3:~/lixd# k logs -f simple-vgpu-training-worker-2
Worker 2 started with vGPU
Worker 2 completed epoch 1
Worker 2 completed epoch 2
Worker 2 completed epoch 3
Worker 2 completed epoch 4
Worker 2 completed epoch 5
Worker 2 completed epoch 6
Worker 2 completed epoch 7
Worker 2 completed epoch 8
Worker 2 completed epoch 9
Worker 2 finished training!

3. 监控

调度器监控

1
curl {volcano scheduler cluster ip}:8080/metrics

包括 GPU core & memory 的分配信息，以及对应 Pod 信息，例如：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# HELP volcano_vgpu_device_allocated_cores The percentage of gpu compute cores allocated in this card
# TYPE volcano_vgpu_device_allocated_cores gauge
volcano_vgpu_device_allocated_cores{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad"} 50

# HELP volcano_vgpu_device_allocated_memory The number of vgpu memory allocated in this card
# TYPE volcano_vgpu_device_allocated_memory gauge
volcano_vgpu_device_allocated_memory{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad"} 2048

# HELP volcano_vgpu_device_core_allocation_for_a_certain_pod The vgpu device core allocated for a certain pod
# TYPE volcano_vgpu_device_core_allocation_for_a_certain_pod gauge
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-0"} 10
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-1"} 10
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-2"} 10
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-3"} 10
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-4"} 10

# HELP volcano_vgpu_device_memory_allocation_for_a_certain_pod The vgpu device memory allocated for a certain pod
# TYPE volcano_vgpu_device_memory_allocation_for_a_certain_pod gauge
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-0"} 128
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-1"} 128
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-2"} 128
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-3"} 128
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-4"} 128

# HELP volcano_vgpu_device_memory_limit The number of total device memory in this card
# TYPE volcano_vgpu_device_memory_limit gauge
volcano_vgpu_device_memory_limit{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad"} 4914

# HELP volcano_vgpu_device_shared_number The number of vgpu tasks sharing this card
# TYPE volcano_vgpu_device_shared_number gauge
volcano_vgpu_device_shared_number{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad"} 5

设备插件监控

直接访问 DevicePlugin Pod 的 9394 端口获取监控信息：

1
curl http://<plugin-pod-ip>:9394/metrics

可以查看到该节点上的 GPU 使用情况，metrics 如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# HELP Device_last_kernel_of_container Container device last kernel description
# TYPE Device_last_kernel_of_container gauge
Device_last_kernel_of_container{ctrname="pod1-ctr",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 257114
# HELP Device_memory_desc_of_container Container device meory description
# TYPE Device_memory_desc_of_container counter
Device_memory_desc_of_container{context="0",ctrname="pod1-ctr",data="0",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",module="0",offset="0",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 0
# HELP Device_utilization_desc_of_container Container device utilization description
# TYPE Device_utilization_desc_of_container gauge
Device_utilization_desc_of_container{ctrname="pod1-ctr",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 0
# HELP HostCoreUtilization GPU core utilization
# TYPE HostCoreUtilization gauge
HostCoreUtilization{deviceidx="0",deviceuuid="GPU-a11fe6d9-3dbe-8a24-34e9-535b2629babd",zone="vGPU"} 0
HostCoreUtilization{deviceidx="1",deviceuuid="GPU-b82090de-5250-44e2-a5ed-b0efc5763f8f",zone="vGPU"} 0
HostCoreUtilization{deviceidx="2",deviceuuid="GPU-8f563a66-d507-583f-59f1-46c2e97a393c",zone="vGPU"} 0
HostCoreUtilization{deviceidx="3",deviceuuid="GPU-1e5a0632-4332-f4d0-adf2-80ebfed56684",zone="vGPU"} 0
HostCoreUtilization{deviceidx="4",deviceuuid="GPU-384027fd-54f2-638b-cdfe-0d5f3b6630f5",zone="vGPU"} 0
HostCoreUtilization{deviceidx="5",deviceuuid="GPU-dbb95093-0147-7b3a-f468-8a3575a8dd4e",zone="vGPU"} 0
HostCoreUtilization{deviceidx="6",deviceuuid="GPU-f3eb6e71-e90a-bfc9-de06-dff90c3093b9",zone="vGPU"} 0
HostCoreUtilization{deviceidx="7",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",zone="vGPU"} 0
# HELP HostGPUMemoryUsage GPU device memory usage
# TYPE HostGPUMemoryUsage gauge
HostGPUMemoryUsage{deviceidx="0",deviceuuid="GPU-a11fe6d9-3dbe-8a24-34e9-535b2629babd",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="1",deviceuuid="GPU-b82090de-5250-44e2-a5ed-b0efc5763f8f",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="2",deviceuuid="GPU-8f563a66-d507-583f-59f1-46c2e97a393c",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="3",deviceuuid="GPU-1e5a0632-4332-f4d0-adf2-80ebfed56684",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="4",deviceuuid="GPU-384027fd-54f2-638b-cdfe-0d5f3b6630f5",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="5",deviceuuid="GPU-dbb95093-0147-7b3a-f468-8a3575a8dd4e",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="6",deviceuuid="GPU-f3eb6e71-e90a-bfc9-de06-dff90c3093b9",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="7",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",zone="vGPU"} 5.14326528e+08
# HELP vGPU_device_memory_limit_in_bytes vGPU device limit
# TYPE vGPU_device_memory_limit_in_bytes gauge
vGPU_device_memory_limit_in_bytes{ctrname="pod1-ctr",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 1.073741824e+10
# HELP vGPU_device_memory_usage_in_bytes vGPU device usage
# TYPE vGPU_device_memory_usage_in_bytes gauge
vGPU_device_memory_usage_in_bytes{ctrname="pod1-ctr",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 0

4. 源码分析

4.1 DevicePlugin

https://github.com/Project-HAMi/volcano-vgpu-device-plugin

资源注册

DevicePlugin 都只会注册一个资源，而 Volcano DevicePlugin 却注册了三个资源，如何做到的？

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
root@node5-3:~/lixd# k describe node node5-3 |grep Cap -A 10
Capacity:
  cpu:                     160
  ephemeral-storage:       3750157048Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  2113442544Ki
  nvidia.com/gpu:          0
  pods:                    110
  volcano.sh/vgpu-cores:   800
  volcano.sh/vgpu-memory:  39312
  volcano.sh/vgpu-number:  80

实际上，Volcano DevicePlugin 内部启动了三个 DevicePlugin 分别使用了三个 ResourceName：

volcano.sh/vgpu-number
volcano.sh/vgpu-memory
volcano.sh/vgpu-cores

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
func (s *migStrategyNone) GetPlugins(cfg *config.NvidiaConfig, cache *DeviceCache) []*NvidiaDevicePlugin {
    return []*NvidiaDevicePlugin{
       NewNvidiaDevicePlugin(
          //"nvidia.com/gpu",
          util.ResourceName,
          cache,
          gpuallocator.NewBestEffortPolicy(),
          pluginapi.DevicePluginPath+"nvidia-gpu.sock",
          cfg),
       NewNvidiaDevicePlugin(
          util.ResourceMem,
          cache,
          gpuallocator.NewBestEffortPolicy(),
          pluginapi.DevicePluginPath+"nvidia-gpu-memory.sock",
          cfg),
       NewNvidiaDevicePlugin(
          util.ResourceCores,
          cache,
          gpuallocator.NewBestEffortPolicy(),
          pluginapi.DevicePluginPath+"nvidia-gpu-cores.sock",
          cfg),
    }
}

对应 sock 文件如下：

1
2
root@node5-3:/var/lib/kubelet/device-plugins# ls /var/lib/kubelet/device-plugins
DEPRECATION  kubelet.sock  kubelet_internal_checkpoint  nvidia-gpu-cores.sock  nvidia-gpu-memory.sock  nvidia-gpu.sock

在获取 Device 时也根据不同的 ResourceName 做了不同实现：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
func (m *NvidiaDevicePlugin) apiDevices() []*pluginapi.Device {
    if strings.Compare(m.migStrategy, "mixed") == 0 {
       var pdevs []*pluginapi.Device
       for _, d := range m.cachedDevices {
          pdevs = append(pdevs, &d.Device)
       }
       return pdevs
    }
    devices := m.Devices()
    var res []*pluginapi.Device

    if strings.Compare(m.resourceName, util.ResourceMem) == 0 {
       for _, dev := range devices {
          i := 0
          klog.Infoln("memory=", dev.Memory, "id=", dev.ID)
          for i < int(32767) {
             res = append(res, &pluginapi.Device{
                ID:       fmt.Sprintf("%v-memory-%v", dev.ID, i),
                Health:   dev.Health,
                Topology: nil,
             })
             i++
          }
       }
       klog.Infoln("res length=", len(res))
       return res
    }
    if strings.Compare(m.resourceName, util.ResourceCores) == 0 {
       for _, dev := range devices {
          i := 0
          for i < 100 {
             res = append(res, &pluginapi.Device{
                ID:       fmt.Sprintf("%v-core-%v", dev.ID, i),
                Health:   dev.Health,
                Topology: nil,
             })
             i++
          }
       }
       return res
    }

    for _, dev := range devices {
       for i := uint(0); i < config.DeviceSplitCount; i++ {
          id := fmt.Sprintf("%v-%v", dev.ID, i)
          res = append(res, &pluginapi.Device{
             ID:       id,
             Health:   dev.Health,
             Topology: nil,
          })
       }
    }
    return res
}

Allocate

具体分配 Device 逻辑：

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
// Allocate which return list of devices.
func (m *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
    if len(reqs.ContainerRequests) > 1 {
       return &pluginapi.AllocateResponse{}, errors.New("multiple Container Requests not supported")
    }
    if strings.Compare(m.migStrategy, "mixed") == 0 {
       return m.MIGAllocate(ctx, reqs)
    }
    responses := pluginapi.AllocateResponse{}

    if strings.Compare(m.resourceName, util.ResourceMem) == 0 || strings.Compare(m.resourceName, util.ResourceCores) == 0 {
       for range reqs.ContainerRequests {
          responses.ContainerResponses = append(responses.ContainerResponses, &pluginapi.ContainerAllocateResponse{})
       }
       return &responses, nil
    }
    nodename := os.Getenv("NODE_NAME")

    current, err := util.GetPendingPod(nodename)
    if err != nil {
       lock.ReleaseNodeLock(nodename, util.VGPUDeviceName)
       return &pluginapi.AllocateResponse{}, err
    }
    if current == nil {
       klog.Errorf("no pending pod found on node %s", nodename)
       lock.ReleaseNodeLock(nodename, util.VGPUDeviceName)
       return &pluginapi.AllocateResponse{}, errors.New("no pending pod found on node")
    }

    for idx := range reqs.ContainerRequests {
       currentCtr, devreq, err := util.GetNextDeviceRequest(util.NvidiaGPUDevice, *current)
       klog.Infoln("deviceAllocateFromAnnotation=", devreq)
       if err != nil {
          klog.Errorln("get device from annotation failed", err.Error())
          util.PodAllocationFailed(nodename, current)
          return &pluginapi.AllocateResponse{}, err
       }
       if len(devreq) != len(reqs.ContainerRequests[idx].DevicesIDs) {
          klog.Errorln("device number not matched", devreq, reqs.ContainerRequests[idx].DevicesIDs)
          util.PodAllocationFailed(nodename, current)
          return &pluginapi.AllocateResponse{}, errors.New("device number not matched")
       }

       response := pluginapi.ContainerAllocateResponse{}
       response.Envs = make(map[string]string)
       response.Envs["NVIDIA_VISIBLE_DEVICES"] = strings.Join(m.GetContainerDeviceStrArray(devreq), ",")

       err = util.EraseNextDeviceTypeFromAnnotation(util.NvidiaGPUDevice, *current)
       if err != nil {
          klog.Errorln("Erase annotation failed", err.Error())
          util.PodAllocationFailed(nodename, current)
          return &pluginapi.AllocateResponse{}, err
       }

       if m.operatingMode != "mig" {

          for i, dev := range devreq {
             limitKey := fmt.Sprintf("CUDA_DEVICE_MEMORY_LIMIT_%v", i)
             response.Envs[limitKey] = fmt.Sprintf("%vm", dev.Usedmem*int32(config.GPUMemoryFactor))
          }
          response.Envs["CUDA_DEVICE_SM_LIMIT"] = fmt.Sprint(devreq[0].Usedcores)
          response.Envs["CUDA_DEVICE_MEMORY_SHARED_CACHE"] = fmt.Sprintf("/tmp/vgpu/%v.cache", uuid.NewUUID())

          cacheFileHostDirectory := "/tmp/vgpu/containers/" + string(current.UID) + "_" + currentCtr.Name
          os.MkdirAll(cacheFileHostDirectory, 0777)
          os.Chmod(cacheFileHostDirectory, 0777)
          os.MkdirAll("/tmp/vgpulock", 0777)
          os.Chmod("/tmp/vgpulock", 0777)
          hostHookPath := os.Getenv("HOOK_PATH")

          response.Mounts = append(response.Mounts,
             &pluginapi.Mount{ContainerPath: "/usr/local/vgpu/libvgpu.so",
                HostPath: hostHookPath + "/libvgpu.so",
                ReadOnly: true},
             &pluginapi.Mount{ContainerPath: "/tmp/vgpu",
                HostPath: cacheFileHostDirectory,
                ReadOnly: false},
             &pluginapi.Mount{ContainerPath: "/tmp/vgpulock",
                HostPath: "/tmp/vgpulock",
                ReadOnly: false},
          )
          found := false
          for _, val := range currentCtr.Env {
             if strings.Compare(val.Name, "CUDA_DISABLE_CONTROL") == 0 {
                found = true
                break
             }
          }
          if !found {
             response.Mounts = append(response.Mounts, &pluginapi.Mount{ContainerPath: "/etc/ld.so.preload",
                HostPath: hostHookPath + "/ld.so.preload",
                ReadOnly: true},
             )
          }
       }
       responses.ContainerResponses = append(responses.ContainerResponses, &response)
    }
    klog.Infoln("Allocate Response", responses.ContainerResponses)
    util.PodAllocationTrySuccess(nodename, current)
    return &responses, nil
}

核心部分：

包括指定环境变量，以及挂载 libvgpu.so 等逻辑。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
for i, dev := range devreq {
    limitKey := fmt.Sprintf("CUDA_DEVICE_MEMORY_LIMIT_%v", i)
    response.Envs[limitKey] = fmt.Sprintf("%vm", dev.Usedmem*int32(config.GPUMemoryFactor))
}
response.Envs["CUDA_DEVICE_SM_LIMIT"] = fmt.Sprint(devreq[0].Usedcores)
response.Envs["CUDA_DEVICE_MEMORY_SHARED_CACHE"] = fmt.Sprintf("/tmp/vgpu/%v.cache", uuid.NewUUID())

response.Mounts = append(response.Mounts,
    &pluginapi.Mount{ContainerPath: "/usr/local/vgpu/libvgpu.so",
       HostPath: hostHookPath + "/libvgpu.so",
       ReadOnly: true},
    &pluginapi.Mount{ContainerPath: "/tmp/vgpu",
       HostPath: cacheFileHostDirectory,
       ReadOnly: false},
    &pluginapi.Mount{ContainerPath: "/tmp/vgpulock",
       HostPath: "/tmp/vgpulock",
       ReadOnly: false},
)

同时由于启动了三个 DevicePlugin，为了避免重复调用，Allocate 方法中根据 ResourceName 进行了判断，只有 volcano.sh/vgpu-number 时才真正执行分配逻辑。

1
2
3
4
5
6
if strings.Compare(m.resourceName, util.ResourceMem) == 0 || strings.Compare(m.resourceName, util.ResourceCores) == 0 {
    for range reqs.ContainerRequests {
       responses.ContainerResponses = append(responses.ContainerResponses, &pluginapi.ContainerAllocateResponse{})
    }
    return &responses, nil
}

4.2 deviceshare 插件分析

https://github.com/volcano-sh/volcano/blob/master/pkg/scheduler/plugins/deviceshare/deviceshare.go

简单分析一下 Volcano 中的 deviceshare 插件。

这块和 HAMi 实现基本一致，可以参考以下两篇文章：

每个插件都要实现 Volcano 定义的 Plugin 接口：

1
2
3
4
5
6
7
type Plugin interface {
    // The unique name of Plugin.
    Name() string

    OnSessionOpen(ssn *Session)
    OnSessionClose(ssn *Session)
}

核心代码在 OnSessionOpen 实现中，包含了调度的两个方法：

Predicate
NodeOrder

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
func (dp *deviceSharePlugin) OnSessionOpen(ssn *framework.Session) {
    // Register event handlers to update task info in PodLister & nodeMap
    ssn.AddPredicateFn(dp.Name(), func(task *api.TaskInfo, node *api.NodeInfo) error {
       predicateStatus := make([]*api.Status, 0)
       // Check PredicateWithCache
       for _, val := range api.RegisteredDevices {
          if dev, ok := node.Others[val].(api.Devices); ok {
             if reflect.ValueOf(dev).IsNil() {
                // TODO When a pod requests a device of the current type, but the current node does not have such a device, an error is thrown
                if dev == nil || dev.HasDeviceRequest(task.Pod) {
                   predicateStatus = append(predicateStatus, &api.Status{
                      Code:   devices.Unschedulable,
                      Reason: "node not initialized with device" + val,
                      Plugin: PluginName,
                   })
                   return api.NewFitErrWithStatus(task, node, predicateStatus...)
                }
                klog.V(4).Infof("pod %s/%s did not request device %s on %s, skipping it", task.Pod.Namespace, task.Pod.Name, val, node.Name)
                continue
             }
             code, msg, err := dev.FilterNode(task.Pod, dp.schedulePolicy)
             if err != nil {
                predicateStatus = append(predicateStatus, createStatus(code, msg))
                return api.NewFitErrWithStatus(task, node, predicateStatus...)
             }
             filterNodeStatus := createStatus(code, msg)
             if filterNodeStatus.Code != api.Success {
                predicateStatus = append(predicateStatus, filterNodeStatus)
                return api.NewFitErrWithStatus(task, node, predicateStatus...)
             }
          } else {
             klog.Warningf("Devices %s assertion conversion failed, skip", val)
          }
       }

       klog.V(4).Infof("checkDevices predicates Task <%s/%s> on Node <%s>: fit ",
          task.Namespace, task.Name, node.Name)

       return nil
    })

    ssn.AddNodeOrderFn(dp.Name(), func(task *api.TaskInfo, node *api.NodeInfo) (float64, error) {
       // DeviceScore
       nodeScore := float64(0)
       if dp.scheduleWeight > 0 {
          score, status := getDeviceScore(context.TODO(), task.Pod, node, dp.schedulePolicy)
          if !status.IsSuccess() {
             klog.Warningf("Node: %s, Calculate Device Score Failed because of Error: %v", node.Name, status.AsError())
             return 0, status.AsError()
          }

          // TODO: we should use a seperate plugin for devices, and seperate them from predicates and nodeOrder plugin.
          nodeScore = float64(score) * float64(dp.scheduleWeight)
          klog.V(5).Infof("Node: %s, task<%s/%s> Device Score weight %d, score: %f", node.Name, task.Namespace, task.Name, dp.scheduleWeight, nodeScore)
       }
       return nodeScore, nil
    })
}

主要实现调度过程中的节点过滤以及打分两部分逻辑。

Predicate

过滤不满足设备需求的节点

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
ssn.AddPredicateFn(dp.Name(), func(task *api.TaskInfo, node *api.NodeInfo) error {
    predicateStatus := make([]*api.Status, 0)
    // Check PredicateWithCache
    for _, val := range api.RegisteredDevices {
       if dev, ok := node.Others[val].(api.Devices); ok {
          if reflect.ValueOf(dev).IsNil() {
             // TODO When a pod requests a device of the current type, but the current node does not have such a device, an error is thrown
             if dev == nil || dev.HasDeviceRequest(task.Pod) {
                predicateStatus = append(predicateStatus, &api.Status{
                   Code:   devices.Unschedulable,
                   Reason: "node not initialized with device" + val,
                   Plugin: PluginName,
                })
                return api.NewFitErrWithStatus(task, node, predicateStatus...)
             }
             klog.V(4).Infof("pod %s/%s did not request device %s on %s, skipping it", task.Pod.Namespace, task.Pod.Name, val, node.Name)
             continue
          }
          code, msg, err := dev.FilterNode(task.Pod, dp.schedulePolicy)
          if err != nil {
             predicateStatus = append(predicateStatus, createStatus(code, msg))
             return api.NewFitErrWithStatus(task, node, predicateStatus...)
          }
          filterNodeStatus := createStatus(code, msg)
          if filterNodeStatus.Code != api.Success {
             predicateStatus = append(predicateStatus, filterNodeStatus)
             return api.NewFitErrWithStatus(task, node, predicateStatus...)
          }
       } else {
          klog.Warningf("Devices %s assertion conversion failed, skip", val)
       }
    }

    klog.V(4).Infof("checkDevices predicates Task <%s/%s> on Node <%s>: fit ",
       task.Namespace, task.Name, node.Name)

    return nil
})

核心逻辑在 FilterNode 方法中：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
code, msg, err := dev.FilterNode(task.Pod, dp.schedulePolicy)
if err != nil {
    predicateStatus = append(predicateStatus, createStatus(code, msg))
    return api.NewFitErrWithStatus(task, node, predicateStatus...)
}
func (gs *GPUDevices) FilterNode(pod *v1.Pod, schedulePolicy string) (int, string, error) {
    if VGPUEnable {
       klog.V(4).Infoln("hami-vgpu DeviceSharing starts filtering pods", pod.Name)
       fit, _, score, err := checkNodeGPUSharingPredicateAndScore(pod, gs, true, schedulePolicy)
       if err != nil || !fit {
          klog.ErrorS(err, "Failed to fitler node to vgpu task", "pod", pod.Name)
          return devices.Unschedulable, "hami-vgpuDeviceSharing error", err
       }
       gs.Score = score
       klog.V(4).Infoln("hami-vgpu DeviceSharing successfully filters pods")
    }
    return devices.Success, "", nil
}

过滤不满足条件的节点，并为剩余节点打分。

节点过滤

从 core、memory 几方面判断 Node 是否有足够资源，不满足则过滤。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
ctrdevs := []ContainerDevices{}
for _, val := range ctrReq {
    devs := []ContainerDevice{}
    if int(val.Nums) > len(gs.Device) {
       return false, []ContainerDevices{}, 0, fmt.Errorf("no enough gpu cards on node %s", gs.Name)
    }
    klog.V(3).InfoS("Allocating device for container", "request", val)

    for i := len(gs.Device) - 1; i >= 0; i-- {
       klog.V(3).InfoS("Scoring pod request", "memReq", val.Memreq, "memPercentageReq", val.MemPercentagereq, "coresReq", val.Coresreq, "Nums", val.Nums, "Index", i, "ID", gs.Device[i].ID)
       klog.V(3).InfoS("Current Device", "Index", i, "TotalMemory", gs.Device[i].Memory, "UsedMemory", gs.Device[i].UsedMem, "UsedCores", gs.Device[i].UsedCore, "replicate", replicate)
       if gs.Device[i].Number <= uint(gs.Device[i].UsedNum) {
          continue
       }
       if val.MemPercentagereq != 101 && val.Memreq == 0 {
          val.Memreq = gs.Device[i].Memory * uint(val.MemPercentagereq/100)
       }
       if int(gs.Device[i].Memory)-int(gs.Device[i].UsedMem) < int(val.Memreq) {
          continue
       }
       if gs.Device[i].UsedCore+val.Coresreq > 100 {
          continue
       }
       // Coresreq=100 indicates it want this card exclusively
       if val.Coresreq == 100 && gs.Device[i].UsedNum > 0 {
          continue
       }
       // You can't allocate core=0 job to an already full GPU
       if gs.Device[i].UsedCore == 100 && val.Coresreq == 0 {
          continue
       }
       if !checkType(pod.Annotations, *gs.Device[i], val) {
          klog.Errorln("failed checktype", gs.Device[i].Type, val.Type)
          continue
       }
       fit, uuid := gs.Sharing.TryAddPod(gs.Device[i], uint(val.Memreq), uint(val.Coresreq))
       if !fit {
          klog.V(3).Info(gs.Device[i].ID, "not fit")
          continue
       }
       //total += gs.Devices[i].Count
       //free += node.Devices[i].Count - node.Devices[i].Used
       if val.Nums > 0 {
          val.Nums--
          klog.V(3).Info("fitted uuid: ", uuid)
          devs = append(devs, ContainerDevice{
             UUID:      uuid,
             Type:      val.Type,
             Usedmem:   val.Memreq,
             Usedcores: val.Coresreq,
          })
          score += GPUScore(schedulePolicy, gs.Device[i])
       }
       if val.Nums == 0 {
          break
       }
    }
    if val.Nums > 0 {
       return false, []ContainerDevices{}, 0, fmt.Errorf("not enough gpu fitted on this node")
    }
    ctrdevs = append(ctrdevs, devs)
}

节点打分

根据配置的调度策略进行打分。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
const (
    binpackMultiplier    = 100
    spreadMultiplier     = 100
 )

func GPUScore(schedulePolicy string, device *GPUDevice) float64 {
    var score float64
    switch schedulePolicy {
    case binpackPolicy:
       score = binpackMultiplier * (float64(device.UsedMem) / float64(device.Memory))
    case spreadPolicy:
       if device.UsedNum == 1 {
          score = spreadMultiplier
       }
    default:
       score = float64(0)
    }
    return score
}

逻辑比较简单：

Binpack ：device 内存使用率越高，得分越高
Spread： device 有被共享使用得 100 分，否则 0 分。

NodeOrder

上一步已经为节点打好分了，这里只需要根据得分排序即可。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
    ssn.AddNodeOrderFn(dp.Name(), func(task *api.TaskInfo, node *api.NodeInfo) (float64, error) {
       // DeviceScore
       nodeScore := float64(0)
       if dp.scheduleWeight > 0 {
          score, status := getDeviceScore(context.TODO(), task.Pod, node, dp.schedulePolicy)
          if !status.IsSuccess() {
             klog.Warningf("Node: %s, Calculate Device Score Failed because of Error: %v", node.Name, status.AsError())
             return 0, status.AsError()
          }

          // TODO: we should use a seperate plugin for devices, and seperate them from predicates and nodeOrder plugin.
          nodeScore = float64(score) * float64(dp.scheduleWeight)
          klog.V(5).Infof("Node: %s, task<%s/%s> Device Score weight %d, score: %f", node.Name, task.Namespace, task.Name, dp.scheduleWeight, nodeScore)
       }
       return nodeScore, nil
    })
}

核心部分：

1
nodeScore = float64(score) * float64(dp.scheduleWeight)

节点得分 * 权重得到最终得分。

5. FAQ

gpu-memory 显示为 0

现象：device-plugin 部署后 gpu-memory 显示为 0 就像这样：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
root@node5-3:~/lixd# k describe node node5-3 |grep Cap -A 10
Capacity:
  cpu:                     160
  ephemeral-storage:       3750157048Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  2113442544Ki
  nvidia.com/gpu:          0
  pods:                    110
  volcano.sh/vgpu-cores:   800
  volcano.sh/vgpu-memory:  0
  volcano.sh/vgpu-number:  80

具体原因：https://github.com/volcano-sh/devices/issues/19

6. 小结

本文主要验证了 Volcano 如何通过集成HAMi vGPU技术实现 Kubernetes 环境下的 GPU 虚拟化，重点验证了HAMi-Core 模式的完整工作流程。

解答前面的问题：Volcano DevicePlugin 如何实现同时注册三个资源的？

通过启动三个 DevicePlugin 以实现注册 volcano.sh/vgpu-number、volcano.sh/vgpu-memory、volcano.sh/vgpu-cores 三种资源。