Contents

Volcano vGPU实战:无硬件依赖的Kubernetes GPU共享与隔离方案

https://img.lixueduan.com/kubernetes/cover/volcano-vgpu.png

在上一篇《Volcano初探:批处理调度引擎的云原生实践》中,我们通过Helm快速部署了Volcano集群,并成功运行了首个测试任务,验证了其基础调度能力。本文将进一步探索Volcano的GPU虚拟化功能,聚焦如何通过HAMi vGPU 技术实现GPU资源的细粒度共享与硬隔离。

批处理调度引擎 Volcano 支持 GPU 虚拟化功能,该功能主要由 HAMi 提供。

HAMi vGPU 提供的 GPU 虚拟化包括 HAMi-Core 和 Dynamic MIG 两种模式:

ModeIsolationMIG GPU RequiredAnnotationCore/Memory ControlRecommended For
HAMI-coreSoftware (VCUDA)NoNoYesGeneral workloads
Dynamic MIGHardwareYesYesMIG-controlledPerformance-sensitive jobs

如果硬件支持 MIG 同时运行的是性能敏感型任务,那么推荐使用 Dynamic MIG 模型,不支持 MIG 依旧可以使用更加通用,对硬件无要求的 HAMi-Core 模式。

本文主要以 HAMi-Core 进行演示,HAMi vGPU 如何集成到 Volcano。

使用流程:

  • 1)创建集群

  • 2)安装 GPU-Operator,但是不安装 DevicePlugin

  • 3)安装 Volcano,并配置开启 vGPU 插件

  • 4)安装 volcano-vgpu-device-plugin

  • 5)验证

1. 环境准备

1.1 创建集群

使用 KubeClipper 部署一个集群进行验证。

Kubernetes教程(十一)—使用 KubeClipper 通过一条命令快速创建 k8s 集群

1.2 GPU-Operator

参考之前的文章 GPU 环境搭建指南:使用 GPU Operator 加速 Kubernetes GPU 环境搭建,使用 GPU Operator 部署环境。

1.3 Volcano

部署 Volcano

安装 Volcano,部署时需要注意 volcano 和 k8s 的版本兼容性问题,参考官方 README:Kubernetes compatibility

这里部署的 v1.12.0 版本

# 添加仓库
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts

helm repo update

# 部署
helm upgrade --install volcano volcano-sh/volcano --version 1.12.0 -n volcano-system --create-namespace

部署完成后 Pod 列表如下:

root@node5-3:~# kubectl -n volcano-system get po
NAME                                  READY   STATUS    RESTARTS   AGE
volcano-admission-6444dd4fb7-8s8d9    1/1     Running   0          3m
volcano-controllers-75d5b78c7-llcrz   1/1     Running   0          3m
volcano-scheduler-7d46c5b5db-t2k42    1/1     Running   0          3m

修改调度器配置:开启 deviceshare 插件

Volcano 部署完成之后,我们需要编辑调度器配置,开启 deviceshare 插件。

kubectl edit cm -n volcano-system volcano-scheduler-configmap

完整内容如下:

kind: ConfigMap
apiVersion: v1
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: drf
      - name: deviceshare
        arguments:
          deviceshare.VGPUEnable: true # enable vgpu
          deviceshare.SchedulePolicy: binpack  # scheduling policy. binpack / spread
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack    

核心如下:

      - name: deviceshare
        arguments:
          deviceshare.VGPUEnable: true # enable vgpu
          deviceshare.SchedulePolicy: binpack  # scheduling policy. binpack / spread

开启 vgpu 同时调度策略我们选择 binpack。

HAMi 调度策略可以阅读这篇文章:HAMi vGPU 原理分析 Part4:Spread&Binpack 高级调度策略实现

修改后,不需要重启,Volcano 会自动检测,当文件变化后自动 reload。

if opt.SchedulerConf != "" {
    var err error
    path := filepath.Dir(opt.SchedulerConf)
    watcher, err = filewatcher.NewFileWatcher(path)
    if err != nil {
       return nil, fmt.Errorf("failed creating filewatcher for %s: %v", opt.SchedulerConf, err)
    }
}

func (pc *Scheduler) watchSchedulerConf(stopCh <-chan struct{}) {
    if pc.fileWatcher == nil {
       return
    }
    eventCh := pc.fileWatcher.Events()
    errCh := pc.fileWatcher.Errors()
    for {
       select {
       case event, ok := <-eventCh:
          if !ok {
             return
          }
          klog.V(4).Infof("watch %s event: %v", pc.schedulerConf, event)
          if event.Op&fsnotify.Write == fsnotify.Write || event.Op&fsnotify.Create == fsnotify.Create {
             pc.loadSchedulerConf()
             pc.cache.SetMetricsConf(pc.metricsConf)
          }
       case err, ok := <-errCh:
          if !ok {
             return
          }
          klog.Infof("watch %s error: %v", pc.schedulerConf, err)
       case <-stopCh:
          return
       }
    }
}

不过 k8s 将 Configmap 同步到 Pod 中也是有延迟的,不想等的话也可以手动重启下。

kubectl -n volcano-system rollout restart deploy volcano-scheduler

1.4 volcano-vgpu-device-plugin

接下来我们部署和 Volcano 集成用到的 DevicePlugin:volcano-vgpu-device-plugin

DevicePlugin 原理可以阅读这篇文章:HAMi vGPU 原理分析 Part1:hami-device-plugin-nvidia 实现,大致逻辑都是一样的。

部署 DevicePlugin

从项目 volcano-vgpu-device-plugin 根目录获取文件: volcano-vgpu-device-plugin.yml

wget https://github.com/Project-HAMi/volcano-vgpu-device-plugin/blob/main/volcano-vgpu-device-plugin.yml

部署

kubectl apply -f volcano-vgpu-device-plugin.yml

查看 Pod 列表:

root@node5-3:~# kubectl -n kube-system get po
volcano-device-plugin-xkwzd                2/2     Running   0          10m

验证 Node 资源

查看 Node 上的 Resource 信息:

root@node5-3:~/lixd# k describe node node5-3 |grep Cap -A 10
Capacity:
  cpu:                     160
  ephemeral-storage:       3750157048Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  2113442544Ki
  nvidia.com/gpu:          0
  pods:                    110
  volcano.sh/vgpu-cores:   800
  volcano.sh/vgpu-memory:  39312
  volcano.sh/vgpu-number:  80

Volcano 新增了下面三个:

  volcano.sh/vgpu-cores:   800
  volcano.sh/vgpu-memory:  39312
  volcano.sh/vgpu-number:  80
  • volcano.sh/vgpu-cores: 800:每张 GPU 100 core,8 卡正好 800 core。

  • volcano.sh/vgpu-memory: 39312 :由于设置了 factor=10,因此实际代表总显存 39312 * 10 = 393120 MB。

    • 当前环境是 L40S*8,单卡显存 49140,49140 * 8 = 393120,正好符合,说明一切正常。
  • volcano.sh/vgpu-number: 80:默认 --device-split-count=10,将 GPU 数量扩大了 10 倍。

说明插件部署成功。

2. 简单使用

启动 Pod

首先启动一个简单 Pod

apiVersion: v1
kind: Pod
metadata:
  name: test1
spec:
  restartPolicy: OnFailure
  schedulerName: volcano
  containers:
  - image: ubuntu:24.04
    name: pod1-ctr
    command: ["sleep"]
    args: ["100000"]
    resources:
      limits:
        volcano.sh/vgpu-memory: 1024
        volcano.sh/vgpu-number: 1

查看效果,vgpu-memory 申请的 1024,如下:

但是因为 factor=10,所以实际是 10240 MB。

root@node5-3:~/lixd# k exec -it test1 -- nvidia-smi
[HAMI-core Msg(16:140249737447232:libvgpu.c:838)]: Initializing.....
Tue Jul 22 13:52:58 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L40S                    On  | 00000000:91:00.0 Off |                  Off |
| N/A   28C    P8              34W / 350W |      0MiB / 10240MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
[HAMI-core Msg(16:140249737447232:multiprocess_memory_limit.c:499)]: Calling exit handler 16

启动 Volcano Job

启动一个简单的 Volcano Job 试试:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: simple-vgpu-training
spec:
  schedulerName: volcano
  minAvailable: 3  # Gang Scheduling: 确保3个Pod同时启动
  
  tasks:
    - name: worker
      replicas: 3  # 启动2个Worker
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: python-trainer
            image: python:3.9-slim
            command: ["python", "-c"]
            args:
              - |
                # 简化的训练代码
                import os
                import time
                
                worker_id = os.getenv("VC_TASK_INDEX", "0")
                print(f"Worker {worker_id} started with vGPU")
                
                # 模拟训练过程
                for epoch in range(1, 10):
                    time.sleep(6)
                    print(f"Worker {worker_id} completed epoch {epoch}")
                
                print(f"Worker {worker_id} finished training!")                
            env:
              # 获取任务索引 (0,1,...)
              - name: VC_TASK_INDEX
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['volcano.sh/task-index']
            resources:
              limits:
                volcano.sh/vgpu-memory: 1024  # 每个Worker分配1024MB显存
                volcano.sh/vgpu-number: 1     # 每个Worker1个vGPU
                cpu: "1"
                memory: "1Gi"

一切正常:

root@node5-3:~/lixd# k get po -w
NAME                            READY   STATUS      RESTARTS   AGE
simple-vgpu-training-worker-0   1/1     Running     0          5s
simple-vgpu-training-worker-1   1/1     Running     0          5s
simple-vgpu-training-worker-2   1/1     Running     0          5s

查看 Pod 中的 GPU 信息

root@node5-3:~/lixd# k exec -it simple-vgpu-training-worker-0 -- nvidia-smi
[HAMI-core Msg(7:140498435086144:libvgpu.c:838)]: Initializing.....
Tue Jul 22 15:02:36 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L40S                    On  | 00000000:91:00.0 Off |                  Off |
| N/A   28C    P8              34W / 350W |      0MiB / 10240MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
[HAMI-core Msg(7:140498435086144:multiprocess_memory_limit.c:499)]: Calling exit handler 7

日志

root@node5-3:~/lixd# k logs -f simple-vgpu-training-worker-2
Worker 2 started with vGPU
Worker 2 completed epoch 1
Worker 2 completed epoch 2
Worker 2 completed epoch 3
Worker 2 completed epoch 4
Worker 2 completed epoch 5
Worker 2 completed epoch 6
Worker 2 completed epoch 7
Worker 2 completed epoch 8
Worker 2 completed epoch 9
Worker 2 finished training!

3. 监控

调度器监控

curl {volcano scheduler cluster ip}:8080/metrics

包括 GPU core & memory 的分配信息,以及对应 Pod 信息,例如:

# HELP volcano_vgpu_device_allocated_cores The percentage of gpu compute cores allocated in this card
# TYPE volcano_vgpu_device_allocated_cores gauge
volcano_vgpu_device_allocated_cores{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad"} 50

# HELP volcano_vgpu_device_allocated_memory The number of vgpu memory allocated in this card
# TYPE volcano_vgpu_device_allocated_memory gauge
volcano_vgpu_device_allocated_memory{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad"} 2048

# HELP volcano_vgpu_device_core_allocation_for_a_certain_pod The vgpu device core allocated for a certain pod
# TYPE volcano_vgpu_device_core_allocation_for_a_certain_pod gauge
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-0"} 10
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-1"} 10
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-2"} 10
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-3"} 10
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-4"} 10

# HELP volcano_vgpu_device_memory_allocation_for_a_certain_pod The vgpu device memory allocated for a certain pod
# TYPE volcano_vgpu_device_memory_allocation_for_a_certain_pod gauge
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-0"} 128
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-1"} 128
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-2"} 128
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-3"} 128
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-4"} 128

# HELP volcano_vgpu_device_memory_limit The number of total device memory in this card
# TYPE volcano_vgpu_device_memory_limit gauge
volcano_vgpu_device_memory_limit{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad"} 4914

# HELP volcano_vgpu_device_shared_number The number of vgpu tasks sharing this card
# TYPE volcano_vgpu_device_shared_number gauge
volcano_vgpu_device_shared_number{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad"} 5

设备插件监控

直接访问 DevicePlugin Pod 的 9394 端口获取监控信息:

curl http://<plugin-pod-ip>:9394/metrics

可以查看到该节点上的 GPU 使用情况,metrics 如下:

# HELP Device_last_kernel_of_container Container device last kernel description
# TYPE Device_last_kernel_of_container gauge
Device_last_kernel_of_container{ctrname="pod1-ctr",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 257114
# HELP Device_memory_desc_of_container Container device meory description
# TYPE Device_memory_desc_of_container counter
Device_memory_desc_of_container{context="0",ctrname="pod1-ctr",data="0",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",module="0",offset="0",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 0
# HELP Device_utilization_desc_of_container Container device utilization description
# TYPE Device_utilization_desc_of_container gauge
Device_utilization_desc_of_container{ctrname="pod1-ctr",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 0
# HELP HostCoreUtilization GPU core utilization
# TYPE HostCoreUtilization gauge
HostCoreUtilization{deviceidx="0",deviceuuid="GPU-a11fe6d9-3dbe-8a24-34e9-535b2629babd",zone="vGPU"} 0
HostCoreUtilization{deviceidx="1",deviceuuid="GPU-b82090de-5250-44e2-a5ed-b0efc5763f8f",zone="vGPU"} 0
HostCoreUtilization{deviceidx="2",deviceuuid="GPU-8f563a66-d507-583f-59f1-46c2e97a393c",zone="vGPU"} 0
HostCoreUtilization{deviceidx="3",deviceuuid="GPU-1e5a0632-4332-f4d0-adf2-80ebfed56684",zone="vGPU"} 0
HostCoreUtilization{deviceidx="4",deviceuuid="GPU-384027fd-54f2-638b-cdfe-0d5f3b6630f5",zone="vGPU"} 0
HostCoreUtilization{deviceidx="5",deviceuuid="GPU-dbb95093-0147-7b3a-f468-8a3575a8dd4e",zone="vGPU"} 0
HostCoreUtilization{deviceidx="6",deviceuuid="GPU-f3eb6e71-e90a-bfc9-de06-dff90c3093b9",zone="vGPU"} 0
HostCoreUtilization{deviceidx="7",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",zone="vGPU"} 0
# HELP HostGPUMemoryUsage GPU device memory usage
# TYPE HostGPUMemoryUsage gauge
HostGPUMemoryUsage{deviceidx="0",deviceuuid="GPU-a11fe6d9-3dbe-8a24-34e9-535b2629babd",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="1",deviceuuid="GPU-b82090de-5250-44e2-a5ed-b0efc5763f8f",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="2",deviceuuid="GPU-8f563a66-d507-583f-59f1-46c2e97a393c",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="3",deviceuuid="GPU-1e5a0632-4332-f4d0-adf2-80ebfed56684",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="4",deviceuuid="GPU-384027fd-54f2-638b-cdfe-0d5f3b6630f5",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="5",deviceuuid="GPU-dbb95093-0147-7b3a-f468-8a3575a8dd4e",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="6",deviceuuid="GPU-f3eb6e71-e90a-bfc9-de06-dff90c3093b9",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="7",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",zone="vGPU"} 5.14326528e+08
# HELP vGPU_device_memory_limit_in_bytes vGPU device limit
# TYPE vGPU_device_memory_limit_in_bytes gauge
vGPU_device_memory_limit_in_bytes{ctrname="pod1-ctr",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 1.073741824e+10
# HELP vGPU_device_memory_usage_in_bytes vGPU device usage
# TYPE vGPU_device_memory_usage_in_bytes gauge
vGPU_device_memory_usage_in_bytes{ctrname="pod1-ctr",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 0

4. 源码分析

4.1 DevicePlugin

https://github.com/Project-HAMi/volcano-vgpu-device-plugin

资源注册

DevicePlugin 都只会注册一个资源,而 Volcano DevicePlugin 却注册了三个资源,如何做到的?

root@node5-3:~/lixd# k describe node node5-3 |grep Cap -A 10
Capacity:
  cpu:                     160
  ephemeral-storage:       3750157048Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  2113442544Ki
  nvidia.com/gpu:          0
  pods:                    110
  volcano.sh/vgpu-cores:   800
  volcano.sh/vgpu-memory:  39312
  volcano.sh/vgpu-number:  80

实际上,Volcano DevicePlugin 内部启动了三个 DevicePlugin 分别使用了三个 ResourceName:

  • volcano.sh/vgpu-number

  • volcano.sh/vgpu-memory

  • volcano.sh/vgpu-cores

func (s *migStrategyNone) GetPlugins(cfg *config.NvidiaConfig, cache *DeviceCache) []*NvidiaDevicePlugin {
    return []*NvidiaDevicePlugin{
       NewNvidiaDevicePlugin(
          //"nvidia.com/gpu",
          util.ResourceName,
          cache,
          gpuallocator.NewBestEffortPolicy(),
          pluginapi.DevicePluginPath+"nvidia-gpu.sock",
          cfg),
       NewNvidiaDevicePlugin(
          util.ResourceMem,
          cache,
          gpuallocator.NewBestEffortPolicy(),
          pluginapi.DevicePluginPath+"nvidia-gpu-memory.sock",
          cfg),
       NewNvidiaDevicePlugin(
          util.ResourceCores,
          cache,
          gpuallocator.NewBestEffortPolicy(),
          pluginapi.DevicePluginPath+"nvidia-gpu-cores.sock",
          cfg),
    }
}

对应 sock 文件如下:

root@node5-3:/var/lib/kubelet/device-plugins# ls /var/lib/kubelet/device-plugins
DEPRECATION  kubelet.sock  kubelet_internal_checkpoint  nvidia-gpu-cores.sock  nvidia-gpu-memory.sock  nvidia-gpu.sock

在获取 Device 时也根据不同的 ResourceName 做了不同实现:

func (m *NvidiaDevicePlugin) apiDevices() []*pluginapi.Device {
    if strings.Compare(m.migStrategy, "mixed") == 0 {
       var pdevs []*pluginapi.Device
       for _, d := range m.cachedDevices {
          pdevs = append(pdevs, &d.Device)
       }
       return pdevs
    }
    devices := m.Devices()
    var res []*pluginapi.Device

    if strings.Compare(m.resourceName, util.ResourceMem) == 0 {
       for _, dev := range devices {
          i := 0
          klog.Infoln("memory=", dev.Memory, "id=", dev.ID)
          for i < int(32767) {
             res = append(res, &pluginapi.Device{
                ID:       fmt.Sprintf("%v-memory-%v", dev.ID, i),
                Health:   dev.Health,
                Topology: nil,
             })
             i++
          }
       }
       klog.Infoln("res length=", len(res))
       return res
    }
    if strings.Compare(m.resourceName, util.ResourceCores) == 0 {
       for _, dev := range devices {
          i := 0
          for i < 100 {
             res = append(res, &pluginapi.Device{
                ID:       fmt.Sprintf("%v-core-%v", dev.ID, i),
                Health:   dev.Health,
                Topology: nil,
             })
             i++
          }
       }
       return res
    }

    for _, dev := range devices {
       for i := uint(0); i < config.DeviceSplitCount; i++ {
          id := fmt.Sprintf("%v-%v", dev.ID, i)
          res = append(res, &pluginapi.Device{
             ID:       id,
             Health:   dev.Health,
             Topology: nil,
          })
       }
    }
    return res
}

Allocate

具体分配 Device 逻辑:

// Allocate which return list of devices.
func (m *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
    if len(reqs.ContainerRequests) > 1 {
       return &pluginapi.AllocateResponse{}, errors.New("multiple Container Requests not supported")
    }
    if strings.Compare(m.migStrategy, "mixed") == 0 {
       return m.MIGAllocate(ctx, reqs)
    }
    responses := pluginapi.AllocateResponse{}

    if strings.Compare(m.resourceName, util.ResourceMem) == 0 || strings.Compare(m.resourceName, util.ResourceCores) == 0 {
       for range reqs.ContainerRequests {
          responses.ContainerResponses = append(responses.ContainerResponses, &pluginapi.ContainerAllocateResponse{})
       }
       return &responses, nil
    }
    nodename := os.Getenv("NODE_NAME")

    current, err := util.GetPendingPod(nodename)
    if err != nil {
       lock.ReleaseNodeLock(nodename, util.VGPUDeviceName)
       return &pluginapi.AllocateResponse{}, err
    }
    if current == nil {
       klog.Errorf("no pending pod found on node %s", nodename)
       lock.ReleaseNodeLock(nodename, util.VGPUDeviceName)
       return &pluginapi.AllocateResponse{}, errors.New("no pending pod found on node")
    }

    for idx := range reqs.ContainerRequests {
       currentCtr, devreq, err := util.GetNextDeviceRequest(util.NvidiaGPUDevice, *current)
       klog.Infoln("deviceAllocateFromAnnotation=", devreq)
       if err != nil {
          klog.Errorln("get device from annotation failed", err.Error())
          util.PodAllocationFailed(nodename, current)
          return &pluginapi.AllocateResponse{}, err
       }
       if len(devreq) != len(reqs.ContainerRequests[idx].DevicesIDs) {
          klog.Errorln("device number not matched", devreq, reqs.ContainerRequests[idx].DevicesIDs)
          util.PodAllocationFailed(nodename, current)
          return &pluginapi.AllocateResponse{}, errors.New("device number not matched")
       }

       response := pluginapi.ContainerAllocateResponse{}
       response.Envs = make(map[string]string)
       response.Envs["NVIDIA_VISIBLE_DEVICES"] = strings.Join(m.GetContainerDeviceStrArray(devreq), ",")

       err = util.EraseNextDeviceTypeFromAnnotation(util.NvidiaGPUDevice, *current)
       if err != nil {
          klog.Errorln("Erase annotation failed", err.Error())
          util.PodAllocationFailed(nodename, current)
          return &pluginapi.AllocateResponse{}, err
       }

       if m.operatingMode != "mig" {

          for i, dev := range devreq {
             limitKey := fmt.Sprintf("CUDA_DEVICE_MEMORY_LIMIT_%v", i)
             response.Envs[limitKey] = fmt.Sprintf("%vm", dev.Usedmem*int32(config.GPUMemoryFactor))
          }
          response.Envs["CUDA_DEVICE_SM_LIMIT"] = fmt.Sprint(devreq[0].Usedcores)
          response.Envs["CUDA_DEVICE_MEMORY_SHARED_CACHE"] = fmt.Sprintf("/tmp/vgpu/%v.cache", uuid.NewUUID())

          cacheFileHostDirectory := "/tmp/vgpu/containers/" + string(current.UID) + "_" + currentCtr.Name
          os.MkdirAll(cacheFileHostDirectory, 0777)
          os.Chmod(cacheFileHostDirectory, 0777)
          os.MkdirAll("/tmp/vgpulock", 0777)
          os.Chmod("/tmp/vgpulock", 0777)
          hostHookPath := os.Getenv("HOOK_PATH")

          response.Mounts = append(response.Mounts,
             &pluginapi.Mount{ContainerPath: "/usr/local/vgpu/libvgpu.so",
                HostPath: hostHookPath + "/libvgpu.so",
                ReadOnly: true},
             &pluginapi.Mount{ContainerPath: "/tmp/vgpu",
                HostPath: cacheFileHostDirectory,
                ReadOnly: false},
             &pluginapi.Mount{ContainerPath: "/tmp/vgpulock",
                HostPath: "/tmp/vgpulock",
                ReadOnly: false},
          )
          found := false
          for _, val := range currentCtr.Env {
             if strings.Compare(val.Name, "CUDA_DISABLE_CONTROL") == 0 {
                found = true
                break
             }
          }
          if !found {
             response.Mounts = append(response.Mounts, &pluginapi.Mount{ContainerPath: "/etc/ld.so.preload",
                HostPath: hostHookPath + "/ld.so.preload",
                ReadOnly: true},
             )
          }
       }
       responses.ContainerResponses = append(responses.ContainerResponses, &response)
    }
    klog.Infoln("Allocate Response", responses.ContainerResponses)
    util.PodAllocationTrySuccess(nodename, current)
    return &responses, nil
}

核心部分:

包括指定环境变量,以及挂载 libvgpu.so 等逻辑。

for i, dev := range devreq {
    limitKey := fmt.Sprintf("CUDA_DEVICE_MEMORY_LIMIT_%v", i)
    response.Envs[limitKey] = fmt.Sprintf("%vm", dev.Usedmem*int32(config.GPUMemoryFactor))
}
response.Envs["CUDA_DEVICE_SM_LIMIT"] = fmt.Sprint(devreq[0].Usedcores)
response.Envs["CUDA_DEVICE_MEMORY_SHARED_CACHE"] = fmt.Sprintf("/tmp/vgpu/%v.cache", uuid.NewUUID())

response.Mounts = append(response.Mounts,
    &pluginapi.Mount{ContainerPath: "/usr/local/vgpu/libvgpu.so",
       HostPath: hostHookPath + "/libvgpu.so",
       ReadOnly: true},
    &pluginapi.Mount{ContainerPath: "/tmp/vgpu",
       HostPath: cacheFileHostDirectory,
       ReadOnly: false},
    &pluginapi.Mount{ContainerPath: "/tmp/vgpulock",
       HostPath: "/tmp/vgpulock",
       ReadOnly: false},
)

同时由于启动了三个 DevicePlugin,为了避免重复调用,Allocate 方法中根据 ResourceName 进行了判断,只有 volcano.sh/vgpu-number 时才真正执行分配逻辑。

if strings.Compare(m.resourceName, util.ResourceMem) == 0 || strings.Compare(m.resourceName, util.ResourceCores) == 0 {
    for range reqs.ContainerRequests {
       responses.ContainerResponses = append(responses.ContainerResponses, &pluginapi.ContainerAllocateResponse{})
    }
    return &responses, nil
}

4.2 deviceshare 插件分析

https://github.com/volcano-sh/volcano/blob/master/pkg/scheduler/plugins/deviceshare/deviceshare.go

简单分析一下 Volcano 中的 deviceshare 插件。

这块和 HAMi 实现基本一致,可以参考以下两篇文章:

每个插件都要实现 Volcano 定义的 Plugin 接口:

type Plugin interface {
    // The unique name of Plugin.
    Name() string

    OnSessionOpen(ssn *Session)
    OnSessionClose(ssn *Session)
}

核心代码在 OnSessionOpen 实现中,包含了调度的两个方法:

  • Predicate

  • NodeOrder

func (dp *deviceSharePlugin) OnSessionOpen(ssn *framework.Session) {
    // Register event handlers to update task info in PodLister & nodeMap
    ssn.AddPredicateFn(dp.Name(), func(task *api.TaskInfo, node *api.NodeInfo) error {
       predicateStatus := make([]*api.Status, 0)
       // Check PredicateWithCache
       for _, val := range api.RegisteredDevices {
          if dev, ok := node.Others[val].(api.Devices); ok {
             if reflect.ValueOf(dev).IsNil() {
                // TODO When a pod requests a device of the current type, but the current node does not have such a device, an error is thrown
                if dev == nil || dev.HasDeviceRequest(task.Pod) {
                   predicateStatus = append(predicateStatus, &api.Status{
                      Code:   devices.Unschedulable,
                      Reason: "node not initialized with device" + val,
                      Plugin: PluginName,
                   })
                   return api.NewFitErrWithStatus(task, node, predicateStatus...)
                }
                klog.V(4).Infof("pod %s/%s did not request device %s on %s, skipping it", task.Pod.Namespace, task.Pod.Name, val, node.Name)
                continue
             }
             code, msg, err := dev.FilterNode(task.Pod, dp.schedulePolicy)
             if err != nil {
                predicateStatus = append(predicateStatus, createStatus(code, msg))
                return api.NewFitErrWithStatus(task, node, predicateStatus...)
             }
             filterNodeStatus := createStatus(code, msg)
             if filterNodeStatus.Code != api.Success {
                predicateStatus = append(predicateStatus, filterNodeStatus)
                return api.NewFitErrWithStatus(task, node, predicateStatus...)
             }
          } else {
             klog.Warningf("Devices %s assertion conversion failed, skip", val)
          }
       }

       klog.V(4).Infof("checkDevices predicates Task <%s/%s> on Node <%s>: fit ",
          task.Namespace, task.Name, node.Name)

       return nil
    })

    ssn.AddNodeOrderFn(dp.Name(), func(task *api.TaskInfo, node *api.NodeInfo) (float64, error) {
       // DeviceScore
       nodeScore := float64(0)
       if dp.scheduleWeight > 0 {
          score, status := getDeviceScore(context.TODO(), task.Pod, node, dp.schedulePolicy)
          if !status.IsSuccess() {
             klog.Warningf("Node: %s, Calculate Device Score Failed because of Error: %v", node.Name, status.AsError())
             return 0, status.AsError()
          }

          // TODO: we should use a seperate plugin for devices, and seperate them from predicates and nodeOrder plugin.
          nodeScore = float64(score) * float64(dp.scheduleWeight)
          klog.V(5).Infof("Node: %s, task<%s/%s> Device Score weight %d, score: %f", node.Name, task.Namespace, task.Name, dp.scheduleWeight, nodeScore)
       }
       return nodeScore, nil
    })
}

主要实现调度过程中的节点过滤以及打分两部分逻辑。

Predicate

过滤不满足设备需求的节点

ssn.AddPredicateFn(dp.Name(), func(task *api.TaskInfo, node *api.NodeInfo) error {
    predicateStatus := make([]*api.Status, 0)
    // Check PredicateWithCache
    for _, val := range api.RegisteredDevices {
       if dev, ok := node.Others[val].(api.Devices); ok {
          if reflect.ValueOf(dev).IsNil() {
             // TODO When a pod requests a device of the current type, but the current node does not have such a device, an error is thrown
             if dev == nil || dev.HasDeviceRequest(task.Pod) {
                predicateStatus = append(predicateStatus, &api.Status{
                   Code:   devices.Unschedulable,
                   Reason: "node not initialized with device" + val,
                   Plugin: PluginName,
                })
                return api.NewFitErrWithStatus(task, node, predicateStatus...)
             }
             klog.V(4).Infof("pod %s/%s did not request device %s on %s, skipping it", task.Pod.Namespace, task.Pod.Name, val, node.Name)
             continue
          }
          code, msg, err := dev.FilterNode(task.Pod, dp.schedulePolicy)
          if err != nil {
             predicateStatus = append(predicateStatus, createStatus(code, msg))
             return api.NewFitErrWithStatus(task, node, predicateStatus...)
          }
          filterNodeStatus := createStatus(code, msg)
          if filterNodeStatus.Code != api.Success {
             predicateStatus = append(predicateStatus, filterNodeStatus)
             return api.NewFitErrWithStatus(task, node, predicateStatus...)
          }
       } else {
          klog.Warningf("Devices %s assertion conversion failed, skip", val)
       }
    }

    klog.V(4).Infof("checkDevices predicates Task <%s/%s> on Node <%s>: fit ",
       task.Namespace, task.Name, node.Name)

    return nil
})

核心逻辑在 FilterNode 方法中:

code, msg, err := dev.FilterNode(task.Pod, dp.schedulePolicy)
if err != nil {
    predicateStatus = append(predicateStatus, createStatus(code, msg))
    return api.NewFitErrWithStatus(task, node, predicateStatus...)
}
func (gs *GPUDevices) FilterNode(pod *v1.Pod, schedulePolicy string) (int, string, error) {
    if VGPUEnable {
       klog.V(4).Infoln("hami-vgpu DeviceSharing starts filtering pods", pod.Name)
       fit, _, score, err := checkNodeGPUSharingPredicateAndScore(pod, gs, true, schedulePolicy)
       if err != nil || !fit {
          klog.ErrorS(err, "Failed to fitler node to vgpu task", "pod", pod.Name)
          return devices.Unschedulable, "hami-vgpuDeviceSharing error", err
       }
       gs.Score = score
       klog.V(4).Infoln("hami-vgpu DeviceSharing successfully filters pods")
    }
    return devices.Success, "", nil
}

过滤不满足条件的节点,并为剩余节点打分。

节点过滤

从 core、memory 几方面判断 Node 是否有足够资源,不满足则过滤。

ctrdevs := []ContainerDevices{}
for _, val := range ctrReq {
    devs := []ContainerDevice{}
    if int(val.Nums) > len(gs.Device) {
       return false, []ContainerDevices{}, 0, fmt.Errorf("no enough gpu cards on node %s", gs.Name)
    }
    klog.V(3).InfoS("Allocating device for container", "request", val)

    for i := len(gs.Device) - 1; i >= 0; i-- {
       klog.V(3).InfoS("Scoring pod request", "memReq", val.Memreq, "memPercentageReq", val.MemPercentagereq, "coresReq", val.Coresreq, "Nums", val.Nums, "Index", i, "ID", gs.Device[i].ID)
       klog.V(3).InfoS("Current Device", "Index", i, "TotalMemory", gs.Device[i].Memory, "UsedMemory", gs.Device[i].UsedMem, "UsedCores", gs.Device[i].UsedCore, "replicate", replicate)
       if gs.Device[i].Number <= uint(gs.Device[i].UsedNum) {
          continue
       }
       if val.MemPercentagereq != 101 && val.Memreq == 0 {
          val.Memreq = gs.Device[i].Memory * uint(val.MemPercentagereq/100)
       }
       if int(gs.Device[i].Memory)-int(gs.Device[i].UsedMem) < int(val.Memreq) {
          continue
       }
       if gs.Device[i].UsedCore+val.Coresreq > 100 {
          continue
       }
       // Coresreq=100 indicates it want this card exclusively
       if val.Coresreq == 100 && gs.Device[i].UsedNum > 0 {
          continue
       }
       // You can't allocate core=0 job to an already full GPU
       if gs.Device[i].UsedCore == 100 && val.Coresreq == 0 {
          continue
       }
       if !checkType(pod.Annotations, *gs.Device[i], val) {
          klog.Errorln("failed checktype", gs.Device[i].Type, val.Type)
          continue
       }
       fit, uuid := gs.Sharing.TryAddPod(gs.Device[i], uint(val.Memreq), uint(val.Coresreq))
       if !fit {
          klog.V(3).Info(gs.Device[i].ID, "not fit")
          continue
       }
       //total += gs.Devices[i].Count
       //free += node.Devices[i].Count - node.Devices[i].Used
       if val.Nums > 0 {
          val.Nums--
          klog.V(3).Info("fitted uuid: ", uuid)
          devs = append(devs, ContainerDevice{
             UUID:      uuid,
             Type:      val.Type,
             Usedmem:   val.Memreq,
             Usedcores: val.Coresreq,
          })
          score += GPUScore(schedulePolicy, gs.Device[i])
       }
       if val.Nums == 0 {
          break
       }
    }
    if val.Nums > 0 {
       return false, []ContainerDevices{}, 0, fmt.Errorf("not enough gpu fitted on this node")
    }
    ctrdevs = append(ctrdevs, devs)
}
节点打分

根据配置的调度策略进行打分。

const (
    binpackMultiplier    = 100
    spreadMultiplier     = 100
 )

func GPUScore(schedulePolicy string, device *GPUDevice) float64 {
    var score float64
    switch schedulePolicy {
    case binpackPolicy:
       score = binpackMultiplier * (float64(device.UsedMem) / float64(device.Memory))
    case spreadPolicy:
       if device.UsedNum == 1 {
          score = spreadMultiplier
       }
    default:
       score = float64(0)
    }
    return score
}

逻辑比较简单:

  • Binpack :device 内存使用率越高,得分越高

  • Spread: device 有被共享使用得 100 分,否则 0 分。

NodeOrder

上一步已经为节点打好分了,这里只需要根据得分排序即可。

    ssn.AddNodeOrderFn(dp.Name(), func(task *api.TaskInfo, node *api.NodeInfo) (float64, error) {
       // DeviceScore
       nodeScore := float64(0)
       if dp.scheduleWeight > 0 {
          score, status := getDeviceScore(context.TODO(), task.Pod, node, dp.schedulePolicy)
          if !status.IsSuccess() {
             klog.Warningf("Node: %s, Calculate Device Score Failed because of Error: %v", node.Name, status.AsError())
             return 0, status.AsError()
          }

          // TODO: we should use a seperate plugin for devices, and seperate them from predicates and nodeOrder plugin.
          nodeScore = float64(score) * float64(dp.scheduleWeight)
          klog.V(5).Infof("Node: %s, task<%s/%s> Device Score weight %d, score: %f", node.Name, task.Namespace, task.Name, dp.scheduleWeight, nodeScore)
       }
       return nodeScore, nil
    })
}

核心部分:

nodeScore = float64(score) * float64(dp.scheduleWeight)

节点得分 * 权重得到最终得分。

5. FAQ

gpu-memory 显示为 0

现象device-plugin 部署后 gpu-memory 显示为 0 就像这样:

root@node5-3:~/lixd# k describe node node5-3 |grep Cap -A 10
Capacity:
  cpu:                     160
  ephemeral-storage:       3750157048Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  2113442544Ki
  nvidia.com/gpu:          0
  pods:                    110
  volcano.sh/vgpu-cores:   800
  volcano.sh/vgpu-memory:  0
  volcano.sh/vgpu-number:  80

具体原因https://github.com/volcano-sh/devices/issues/19

相关描述:

the size of device list exceeds the bound, and ListAndWatch failed as a result。

简而言之就是超过阈值的显存就会报错,导致 DevicePlugin 无法正常上报,因此显示为 0。

解决方案:需要在启动时设置参数 --gpu-memory-factor=10,将最小的显存块从默认 1MB 改成 10MB,就像这样:

      containers:
      - image: docker.io/projecthami/volcano-vgpu-device-plugin:v1.10.0
        args: ["--device-split-count=10","--gpu-memory-factor=10"]

这样最大能显示的数值就扩大了 10 倍,就可以避免该问题。

效果如下

root@node5-3:~/lixd# k describe node node5-3 |grep Cap -A 10
Capacity:
  cpu:                     160
  ephemeral-storage:       3750157048Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  2113442544Ki
  nvidia.com/gpu:          0
  pods:                    110
  volcano.sh/vgpu-cores:   800
  volcano.sh/vgpu-memory:  39312
  volcano.sh/vgpu-number:  80

volcano.sh/vgpu-memory: 39312 :由于设置了 factor=10,因此实际代表总显存 39312 * 10 = 393120 MB。

当前环境是 L40S*8,单卡显存 49140,49140 * 8 = 393120,正好符合,说明一切正常。

6. 小结

本文主要验证了 Volcano 如何通过集成HAMi vGPU技术实现 Kubernetes 环境下的 GPU 虚拟化,重点验证了HAMi-Core 模式的完整工作流程。

解答前面的问题:Volcano DevicePlugin 如何实现同时注册三个资源的?

通过启动三个 DevicePlugin 以实现注册 volcano.sh/vgpu-numbervolcano.sh/vgpu-memoryvolcano.sh/vgpu-cores 三种资源。

推荐阅读: