# Volcano VGPU实战：无硬件依赖的Kubernetes GPU共享与隔离方案


![volcano-vgpu.png](https://img.lixueduan.com/kubernetes/cover/volcano-vgpu.png)

在上一篇《Volcano初探：批处理调度引擎的云原生实践》中，我们通过Helm快速部署了Volcano集群，并成功运行了首个测试任务，验证了其基础调度能力。本文将进一步探索Volcano的**GPU虚拟化功能**，聚焦如何通过**HAMi vGPU 技术**实现GPU资源的细粒度共享与硬隔离。

<!--more-->

批处理调度引擎 Volcano 支持 GPU 虚拟化功能，该功能主要由 HAMi 提供。

HAMi vGPU 提供的 GPU 虚拟化包括 HAMi-Core 和 Dynamic MIG 两种模式：

| Mode        | Isolation        | MIG GPU Required | Annotation | Core/Memory Control | Recommended For            |
| ----------- | ---------------- | ---------------- | ---------- | ------------------- | -------------------------- |
| HAMI-core   | Software (VCUDA) | No               | No         | Yes                 | General workloads          |
| Dynamic MIG | Hardware         | Yes              | Yes        | MIG-controlled      | Performance-sensitive jobs |

如果硬件支持 MIG 同时运行的是性能敏感型任务，那么推荐使用 Dynamic MIG 模型，不支持 MIG 依旧可以使用更加通用，对硬件无要求的 HAMi-Core 模式。

本文主要以 HAMi-Core 进行演示，HAMi vGPU 如何集成到 Volcano。



使用流程：

* 1）创建集群

* 2）安装 GPU-Operator，但是不安装 DevicePlugin

* 3）安装 Volcano，并配置开启 vGPU 插件

* 4）安装 volcano-vgpu-device-plugin

* 5）验证



## 1. 环境准备

### 1.1 创建集群

使用 KubeClipper 部署一个集群进行验证。

[Kubernetes教程(十一)---使用 KubeClipper 通过一条命令快速创建 k8s 集群](https://www.lixueduan.com/posts/kubernetes/11-install-by-kubeclipper/)

### 1.2 GPU-Operator

参考之前的文章 [GPU 环境搭建指南：使用 GPU Operator 加速 Kubernetes GPU 环境搭建](https://www.lixueduan.com/posts/ai/02-gpu-operator/)，使用 GPU Operator 部署环境。

### 1.3 Volcano

#### 部署 Volcano

安装 Volcano,部署时**需要注意 volcano 和 k8s 的版本兼容性问题**，参考官方 README：[Kubernetes compatibility](https://github.com/volcano-sh/volcano?tab=readme-ov-file#kubernetes-compatibility)



这里部署的 v1.12.0 版本

```bash
# 添加仓库
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts

helm repo update

# 部署
helm upgrade --install volcano volcano-sh/volcano --version 1.12.0 -n volcano-system --create-namespace
```



部署完成后 Pod 列表如下：

```bash
root@node5-3:~# kubectl -n volcano-system get po
NAME                                  READY   STATUS    RESTARTS   AGE
volcano-admission-6444dd4fb7-8s8d9    1/1     Running   0          3m
volcano-controllers-75d5b78c7-llcrz   1/1     Running   0          3m
volcano-scheduler-7d46c5b5db-t2k42    1/1     Running   0          3m
```



#### 修改调度器配置：开启 deviceshare 插件

Volcano 部署完成之后，我们需要编辑调度器配置，开启 deviceshare 插件。

```bash
kubectl edit cm -n volcano-system volcano-scheduler-configmap
```

完整内容如下：

```yaml
kind: ConfigMap
apiVersion: v1
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: drf
      - name: deviceshare
        arguments:
          deviceshare.VGPUEnable: true # enable vgpu
          deviceshare.SchedulePolicy: binpack  # scheduling policy. binpack / spread
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
```

核心如下：

```yaml
      - name: deviceshare
        arguments:
          deviceshare.VGPUEnable: true # enable vgpu
          deviceshare.SchedulePolicy: binpack  # scheduling policy. binpack / spread
```

开启 vgpu 同时调度策略我们选择 binpack。

> HAMi 调度策略可以阅读这篇文章：[HAMi vGPU 原理分析 Part4：Spread\&Binpack 高级调度策略实现](https://www.lixueduan.com/posts/kubernetes/33-hami-analyze-4-scheduler-policy/)



修改后，不需要重启，Volcano 会自动检测，当文件变化后自动 reload。

```go
if opt.SchedulerConf != "" {
    var err error
    path := filepath.Dir(opt.SchedulerConf)
    watcher, err = filewatcher.NewFileWatcher(path)
    if err != nil {
       return nil, fmt.Errorf("failed creating filewatcher for %s: %v", opt.SchedulerConf, err)
    }
}

func (pc *Scheduler) watchSchedulerConf(stopCh <-chan struct{}) {
    if pc.fileWatcher == nil {
       return
    }
    eventCh := pc.fileWatcher.Events()
    errCh := pc.fileWatcher.Errors()
    for {
       select {
       case event, ok := <-eventCh:
          if !ok {
             return
          }
          klog.V(4).Infof("watch %s event: %v", pc.schedulerConf, event)
          if event.Op&fsnotify.Write == fsnotify.Write || event.Op&fsnotify.Create == fsnotify.Create {
             pc.loadSchedulerConf()
             pc.cache.SetMetricsConf(pc.metricsConf)
          }
       case err, ok := <-errCh:
          if !ok {
             return
          }
          klog.Infof("watch %s error: %v", pc.schedulerConf, err)
       case <-stopCh:
          return
       }
    }
}
```

不过 k8s 将 Configmap 同步到 Pod 中也是有延迟的，不想等的话也可以手动重启下。

```bash
kubectl -n volcano-system rollout restart deploy volcano-scheduler
```



### 1.4 volcano-vgpu-device-plugin

接下来我们部署和 Volcano 集成用到的 DevicePlugin：`volcano-vgpu-device-plugin`。

DevicePlugin 原理可以阅读这篇文章：[HAMi vGPU 原理分析 Part1：hami-device-plugin-nvidia 实现](https://www.lixueduan.com/posts/kubernetes/29-hami-analyze-1-device-plugin-nvidia/)，大致逻辑都是一样的。



#### 部署 DevicePlugin

从项目 [volcano-vgpu-device-plugin](https://github.com/Project-HAMi/volcano-vgpu-device-plugin) 根目录获取文件： `volcano-vgpu-device-plugin.yml`&#x20;

```bash
wget https://github.com/Project-HAMi/volcano-vgpu-device-plugin/blob/main/volcano-vgpu-device-plugin.yml
```

部署

```bash
kubectl apply -f volcano-vgpu-device-plugin.yml
```



查看 Pod 列表：

```bash
root@node5-3:~# kubectl -n kube-system get po
volcano-device-plugin-xkwzd                2/2     Running   0          10m
```



#### 验证 Node 资源

查看 Node 上的 Resource 信息：

```bash
root@node5-3:~/lixd# k describe node node5-3 |grep Cap -A 10
Capacity:
  cpu:                     160
  ephemeral-storage:       3750157048Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  2113442544Ki
  nvidia.com/gpu:          0
  pods:                    110
  volcano.sh/vgpu-cores:   800
  volcano.sh/vgpu-memory:  39312
  volcano.sh/vgpu-number:  80
```

Volcano 新增了下面三个：

```bash
  volcano.sh/vgpu-cores:   800
  volcano.sh/vgpu-memory:  39312
  volcano.sh/vgpu-number:  80
```

* `volcano.sh/vgpu-cores:   800`：每张 GPU 100 core，8 卡正好 800 core。

* `volcano.sh/vgpu-memory:  39312` ：由于设置了 factor=10，因此实际代表总显存 39312 \* 10 = 393120 MB。

  * 当前环境是 L40S\*8，单卡显存 49140，49140 \* 8 = 393120，正好符合，说明一切正常。

* `volcano.sh/vgpu-number:  80`：默认 `--device-split-count=10`,将 GPU 数量扩大了 10 倍。

说明插件部署成功。



## 2. 简单使用

### 启动 Pod

首先启动一个简单 Pod

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: test1
spec:
  restartPolicy: OnFailure
  schedulerName: volcano
  containers:
  - image: ubuntu:24.04
    name: pod1-ctr
    command: ["sleep"]
    args: ["100000"]
    resources:
      limits:
        volcano.sh/vgpu-memory: 1024
        volcano.sh/vgpu-number: 1
```

查看效果,vgpu-memory 申请的 1024，如下：

> 但是因为 factor=10,所以实际是 10240 MB。

```bash
root@node5-3:~/lixd# k exec -it test1 -- nvidia-smi
[HAMI-core Msg(16:140249737447232:libvgpu.c:838)]: Initializing.....
Tue Jul 22 13:52:58 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L40S                    On  | 00000000:91:00.0 Off |                  Off |
| N/A   28C    P8              34W / 350W |      0MiB / 10240MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
[HAMI-core Msg(16:140249737447232:multiprocess_memory_limit.c:499)]: Calling exit handler 16
```



### 启动 Volcano Job

启动一个简单的 Volcano Job 试试：

```yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: simple-vgpu-training
spec:
  schedulerName: volcano
  minAvailable: 3  # Gang Scheduling: 确保3个Pod同时启动
  
  tasks:
    - name: worker
      replicas: 3  # 启动2个Worker
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: python-trainer
            image: python:3.9-slim
            command: ["python", "-c"]
            args:
              - |
                # 简化的训练代码
                import os
                import time
                
                worker_id = os.getenv("VC_TASK_INDEX", "0")
                print(f"Worker {worker_id} started with vGPU")
                
                # 模拟训练过程
                for epoch in range(1, 10):
                    time.sleep(6)
                    print(f"Worker {worker_id} completed epoch {epoch}")
                
                print(f"Worker {worker_id} finished training!")
            env:
              # 获取任务索引 (0,1,...)
              - name: VC_TASK_INDEX
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['volcano.sh/task-index']
            resources:
              limits:
                volcano.sh/vgpu-memory: 1024  # 每个Worker分配1024MB显存
                volcano.sh/vgpu-number: 1     # 每个Worker1个vGPU
                cpu: "1"
                memory: "1Gi"
```

一切正常：

```bash
root@node5-3:~/lixd# k get po -w
NAME                            READY   STATUS      RESTARTS   AGE
simple-vgpu-training-worker-0   1/1     Running     0          5s
simple-vgpu-training-worker-1   1/1     Running     0          5s
simple-vgpu-training-worker-2   1/1     Running     0          5s
```

查看 Pod 中的 GPU 信息

```bash
root@node5-3:~/lixd# k exec -it simple-vgpu-training-worker-0 -- nvidia-smi
[HAMI-core Msg(7:140498435086144:libvgpu.c:838)]: Initializing.....
Tue Jul 22 15:02:36 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L40S                    On  | 00000000:91:00.0 Off |                  Off |
| N/A   28C    P8              34W / 350W |      0MiB / 10240MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
[HAMI-core Msg(7:140498435086144:multiprocess_memory_limit.c:499)]: Calling exit handler 7
```



日志

```bash
root@node5-3:~/lixd# k logs -f simple-vgpu-training-worker-2
Worker 2 started with vGPU
Worker 2 completed epoch 1
Worker 2 completed epoch 2
Worker 2 completed epoch 3
Worker 2 completed epoch 4
Worker 2 completed epoch 5
Worker 2 completed epoch 6
Worker 2 completed epoch 7
Worker 2 completed epoch 8
Worker 2 completed epoch 9
Worker 2 finished training!
```



## 3. 监控

### 调度器监控

```bash
curl {volcano scheduler cluster ip}:8080/metrics
```

包括 GPU core & memory 的分配信息，以及对应 Pod 信息，例如：

```bash
# HELP volcano_vgpu_device_allocated_cores The percentage of gpu compute cores allocated in this card
# TYPE volcano_vgpu_device_allocated_cores gauge
volcano_vgpu_device_allocated_cores{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad"} 50

# HELP volcano_vgpu_device_allocated_memory The number of vgpu memory allocated in this card
# TYPE volcano_vgpu_device_allocated_memory gauge
volcano_vgpu_device_allocated_memory{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad"} 2048

# HELP volcano_vgpu_device_core_allocation_for_a_certain_pod The vgpu device core allocated for a certain pod
# TYPE volcano_vgpu_device_core_allocation_for_a_certain_pod gauge
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-0"} 10
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-1"} 10
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-2"} 10
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-3"} 10
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-4"} 10

# HELP volcano_vgpu_device_memory_allocation_for_a_certain_pod The vgpu device memory allocated for a certain pod
# TYPE volcano_vgpu_device_memory_allocation_for_a_certain_pod gauge
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-0"} 128
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-1"} 128
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-2"} 128
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-3"} 128
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-4"} 128

# HELP volcano_vgpu_device_memory_limit The number of total device memory in this card
# TYPE volcano_vgpu_device_memory_limit gauge
volcano_vgpu_device_memory_limit{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad"} 4914

# HELP volcano_vgpu_device_shared_number The number of vgpu tasks sharing this card
# TYPE volcano_vgpu_device_shared_number gauge
volcano_vgpu_device_shared_number{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad"} 5
```



### 设备插件监控

直接访问 DevicePlugin Pod 的 9394 端口获取监控信息：

```bash
curl http://<plugin-pod-ip>:9394/metrics
```

可以查看到该节点上的 GPU 使用情况，metrics 如下：

```bash
# HELP Device_last_kernel_of_container Container device last kernel description
# TYPE Device_last_kernel_of_container gauge
Device_last_kernel_of_container{ctrname="pod1-ctr",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 257114
# HELP Device_memory_desc_of_container Container device meory description
# TYPE Device_memory_desc_of_container counter
Device_memory_desc_of_container{context="0",ctrname="pod1-ctr",data="0",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",module="0",offset="0",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 0
# HELP Device_utilization_desc_of_container Container device utilization description
# TYPE Device_utilization_desc_of_container gauge
Device_utilization_desc_of_container{ctrname="pod1-ctr",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 0
# HELP HostCoreUtilization GPU core utilization
# TYPE HostCoreUtilization gauge
HostCoreUtilization{deviceidx="0",deviceuuid="GPU-a11fe6d9-3dbe-8a24-34e9-535b2629babd",zone="vGPU"} 0
HostCoreUtilization{deviceidx="1",deviceuuid="GPU-b82090de-5250-44e2-a5ed-b0efc5763f8f",zone="vGPU"} 0
HostCoreUtilization{deviceidx="2",deviceuuid="GPU-8f563a66-d507-583f-59f1-46c2e97a393c",zone="vGPU"} 0
HostCoreUtilization{deviceidx="3",deviceuuid="GPU-1e5a0632-4332-f4d0-adf2-80ebfed56684",zone="vGPU"} 0
HostCoreUtilization{deviceidx="4",deviceuuid="GPU-384027fd-54f2-638b-cdfe-0d5f3b6630f5",zone="vGPU"} 0
HostCoreUtilization{deviceidx="5",deviceuuid="GPU-dbb95093-0147-7b3a-f468-8a3575a8dd4e",zone="vGPU"} 0
HostCoreUtilization{deviceidx="6",deviceuuid="GPU-f3eb6e71-e90a-bfc9-de06-dff90c3093b9",zone="vGPU"} 0
HostCoreUtilization{deviceidx="7",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",zone="vGPU"} 0
# HELP HostGPUMemoryUsage GPU device memory usage
# TYPE HostGPUMemoryUsage gauge
HostGPUMemoryUsage{deviceidx="0",deviceuuid="GPU-a11fe6d9-3dbe-8a24-34e9-535b2629babd",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="1",deviceuuid="GPU-b82090de-5250-44e2-a5ed-b0efc5763f8f",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="2",deviceuuid="GPU-8f563a66-d507-583f-59f1-46c2e97a393c",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="3",deviceuuid="GPU-1e5a0632-4332-f4d0-adf2-80ebfed56684",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="4",deviceuuid="GPU-384027fd-54f2-638b-cdfe-0d5f3b6630f5",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="5",deviceuuid="GPU-dbb95093-0147-7b3a-f468-8a3575a8dd4e",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="6",deviceuuid="GPU-f3eb6e71-e90a-bfc9-de06-dff90c3093b9",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="7",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",zone="vGPU"} 5.14326528e+08
# HELP vGPU_device_memory_limit_in_bytes vGPU device limit
# TYPE vGPU_device_memory_limit_in_bytes gauge
vGPU_device_memory_limit_in_bytes{ctrname="pod1-ctr",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 1.073741824e+10
# HELP vGPU_device_memory_usage_in_bytes vGPU device usage
# TYPE vGPU_device_memory_usage_in_bytes gauge
vGPU_device_memory_usage_in_bytes{ctrname="pod1-ctr",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 0
```



## 4. 源码分析

### 4.1 DevicePlugin

> https://github.com/Project-HAMi/volcano-vgpu-device-plugin



#### 资源注册

***DevicePlugin 都只会注册一个资源，而 Volcano DevicePlugin 却注册了三个资源，如何做到的？***

```bash
root@node5-3:~/lixd# k describe node node5-3 |grep Cap -A 10
Capacity:
  cpu:                     160
  ephemeral-storage:       3750157048Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  2113442544Ki
  nvidia.com/gpu:          0
  pods:                    110
  volcano.sh/vgpu-cores:   800
  volcano.sh/vgpu-memory:  39312
  volcano.sh/vgpu-number:  80
```

实际上，Volcano DevicePlugin 内部启动了三个 DevicePlugin 分别使用了三个 ResourceName：

* volcano.sh/vgpu-number

* volcano.sh/vgpu-memory

* volcano.sh/vgpu-cores



```go
func (s *migStrategyNone) GetPlugins(cfg *config.NvidiaConfig, cache *DeviceCache) []*NvidiaDevicePlugin {
    return []*NvidiaDevicePlugin{
       NewNvidiaDevicePlugin(
          //"nvidia.com/gpu",
          util.ResourceName,
          cache,
          gpuallocator.NewBestEffortPolicy(),
          pluginapi.DevicePluginPath+"nvidia-gpu.sock",
          cfg),
       NewNvidiaDevicePlugin(
          util.ResourceMem,
          cache,
          gpuallocator.NewBestEffortPolicy(),
          pluginapi.DevicePluginPath+"nvidia-gpu-memory.sock",
          cfg),
       NewNvidiaDevicePlugin(
          util.ResourceCores,
          cache,
          gpuallocator.NewBestEffortPolicy(),
          pluginapi.DevicePluginPath+"nvidia-gpu-cores.sock",
          cfg),
    }
}
```

对应 sock 文件如下：

```bash
root@node5-3:/var/lib/kubelet/device-plugins# ls /var/lib/kubelet/device-plugins
DEPRECATION  kubelet.sock  kubelet_internal_checkpoint  nvidia-gpu-cores.sock  nvidia-gpu-memory.sock  nvidia-gpu.sock
```



在获取 Device 时也根据不同的 ResourceName 做了不同实现：

```go
func (m *NvidiaDevicePlugin) apiDevices() []*pluginapi.Device {
    if strings.Compare(m.migStrategy, "mixed") == 0 {
       var pdevs []*pluginapi.Device
       for _, d := range m.cachedDevices {
          pdevs = append(pdevs, &d.Device)
       }
       return pdevs
    }
    devices := m.Devices()
    var res []*pluginapi.Device

    if strings.Compare(m.resourceName, util.ResourceMem) == 0 {
       for _, dev := range devices {
          i := 0
          klog.Infoln("memory=", dev.Memory, "id=", dev.ID)
          for i < int(32767) {
             res = append(res, &pluginapi.Device{
                ID:       fmt.Sprintf("%v-memory-%v", dev.ID, i),
                Health:   dev.Health,
                Topology: nil,
             })
             i++
          }
       }
       klog.Infoln("res length=", len(res))
       return res
    }
    if strings.Compare(m.resourceName, util.ResourceCores) == 0 {
       for _, dev := range devices {
          i := 0
          for i < 100 {
             res = append(res, &pluginapi.Device{
                ID:       fmt.Sprintf("%v-core-%v", dev.ID, i),
                Health:   dev.Health,
                Topology: nil,
             })
             i++
          }
       }
       return res
    }

    for _, dev := range devices {
       for i := uint(0); i < config.DeviceSplitCount; i++ {
          id := fmt.Sprintf("%v-%v", dev.ID, i)
          res = append(res, &pluginapi.Device{
             ID:       id,
             Health:   dev.Health,
             Topology: nil,
          })
       }
    }
    return res
}
```



#### Allocate

具体分配 Device 逻辑：

```bash
// Allocate which return list of devices.
func (m *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
    if len(reqs.ContainerRequests) > 1 {
       return &pluginapi.AllocateResponse{}, errors.New("multiple Container Requests not supported")
    }
    if strings.Compare(m.migStrategy, "mixed") == 0 {
       return m.MIGAllocate(ctx, reqs)
    }
    responses := pluginapi.AllocateResponse{}

    if strings.Compare(m.resourceName, util.ResourceMem) == 0 || strings.Compare(m.resourceName, util.ResourceCores) == 0 {
       for range reqs.ContainerRequests {
          responses.ContainerResponses = append(responses.ContainerResponses, &pluginapi.ContainerAllocateResponse{})
       }
       return &responses, nil
    }
    nodename := os.Getenv("NODE_NAME")

    current, err := util.GetPendingPod(nodename)
    if err != nil {
       lock.ReleaseNodeLock(nodename, util.VGPUDeviceName)
       return &pluginapi.AllocateResponse{}, err
    }
    if current == nil {
       klog.Errorf("no pending pod found on node %s", nodename)
       lock.ReleaseNodeLock(nodename, util.VGPUDeviceName)
       return &pluginapi.AllocateResponse{}, errors.New("no pending pod found on node")
    }

    for idx := range reqs.ContainerRequests {
       currentCtr, devreq, err := util.GetNextDeviceRequest(util.NvidiaGPUDevice, *current)
       klog.Infoln("deviceAllocateFromAnnotation=", devreq)
       if err != nil {
          klog.Errorln("get device from annotation failed", err.Error())
          util.PodAllocationFailed(nodename, current)
          return &pluginapi.AllocateResponse{}, err
       }
       if len(devreq) != len(reqs.ContainerRequests[idx].DevicesIDs) {
          klog.Errorln("device number not matched", devreq, reqs.ContainerRequests[idx].DevicesIDs)
          util.PodAllocationFailed(nodename, current)
          return &pluginapi.AllocateResponse{}, errors.New("device number not matched")
       }

       response := pluginapi.ContainerAllocateResponse{}
       response.Envs = make(map[string]string)
       response.Envs["NVIDIA_VISIBLE_DEVICES"] = strings.Join(m.GetContainerDeviceStrArray(devreq), ",")

       err = util.EraseNextDeviceTypeFromAnnotation(util.NvidiaGPUDevice, *current)
       if err != nil {
          klog.Errorln("Erase annotation failed", err.Error())
          util.PodAllocationFailed(nodename, current)
          return &pluginapi.AllocateResponse{}, err
       }

       if m.operatingMode != "mig" {

          for i, dev := range devreq {
             limitKey := fmt.Sprintf("CUDA_DEVICE_MEMORY_LIMIT_%v", i)
             response.Envs[limitKey] = fmt.Sprintf("%vm", dev.Usedmem*int32(config.GPUMemoryFactor))
          }
          response.Envs["CUDA_DEVICE_SM_LIMIT"] = fmt.Sprint(devreq[0].Usedcores)
          response.Envs["CUDA_DEVICE_MEMORY_SHARED_CACHE"] = fmt.Sprintf("/tmp/vgpu/%v.cache", uuid.NewUUID())

          cacheFileHostDirectory := "/tmp/vgpu/containers/" + string(current.UID) + "_" + currentCtr.Name
          os.MkdirAll(cacheFileHostDirectory, 0777)
          os.Chmod(cacheFileHostDirectory, 0777)
          os.MkdirAll("/tmp/vgpulock", 0777)
          os.Chmod("/tmp/vgpulock", 0777)
          hostHookPath := os.Getenv("HOOK_PATH")

          response.Mounts = append(response.Mounts,
             &pluginapi.Mount{ContainerPath: "/usr/local/vgpu/libvgpu.so",
                HostPath: hostHookPath + "/libvgpu.so",
                ReadOnly: true},
             &pluginapi.Mount{ContainerPath: "/tmp/vgpu",
                HostPath: cacheFileHostDirectory,
                ReadOnly: false},
             &pluginapi.Mount{ContainerPath: "/tmp/vgpulock",
                HostPath: "/tmp/vgpulock",
                ReadOnly: false},
          )
          found := false
          for _, val := range currentCtr.Env {
             if strings.Compare(val.Name, "CUDA_DISABLE_CONTROL") == 0 {
                found = true
                break
             }
          }
          if !found {
             response.Mounts = append(response.Mounts, &pluginapi.Mount{ContainerPath: "/etc/ld.so.preload",
                HostPath: hostHookPath + "/ld.so.preload",
                ReadOnly: true},
             )
          }
       }
       responses.ContainerResponses = append(responses.ContainerResponses, &response)
    }
    klog.Infoln("Allocate Response", responses.ContainerResponses)
    util.PodAllocationTrySuccess(nodename, current)
    return &responses, nil
}
```



核心部分：

包括指定环境变量，以及挂载 libvgpu.so 等逻辑。

```go
for i, dev := range devreq {
    limitKey := fmt.Sprintf("CUDA_DEVICE_MEMORY_LIMIT_%v", i)
    response.Envs[limitKey] = fmt.Sprintf("%vm", dev.Usedmem*int32(config.GPUMemoryFactor))
}
response.Envs["CUDA_DEVICE_SM_LIMIT"] = fmt.Sprint(devreq[0].Usedcores)
response.Envs["CUDA_DEVICE_MEMORY_SHARED_CACHE"] = fmt.Sprintf("/tmp/vgpu/%v.cache", uuid.NewUUID())

response.Mounts = append(response.Mounts,
    &pluginapi.Mount{ContainerPath: "/usr/local/vgpu/libvgpu.so",
       HostPath: hostHookPath + "/libvgpu.so",
       ReadOnly: true},
    &pluginapi.Mount{ContainerPath: "/tmp/vgpu",
       HostPath: cacheFileHostDirectory,
       ReadOnly: false},
    &pluginapi.Mount{ContainerPath: "/tmp/vgpulock",
       HostPath: "/tmp/vgpulock",
       ReadOnly: false},
)
```

同时由于启动了三个 DevicePlugin，为了避免重复调用，Allocate 方法中根据 ResourceName 进行了判断，只有 `volcano.sh/vgpu-number` 时才真正执行分配逻辑。

```go
if strings.Compare(m.resourceName, util.ResourceMem) == 0 || strings.Compare(m.resourceName, util.ResourceCores) == 0 {
    for range reqs.ContainerRequests {
       responses.ContainerResponses = append(responses.ContainerResponses, &pluginapi.ContainerAllocateResponse{})
    }
    return &responses, nil
}
```



### 4.2 deviceshare 插件分析

> https://github.com/volcano-sh/volcano/blob/master/pkg/scheduler/plugins/deviceshare/deviceshare.go

简单分析一下 Volcano 中的 deviceshare 插件。



这块和 HAMi 实现基本一致，可以参考以下两篇文章：

* [HAMi vGPU 原理分析 Part3：hami-scheduler 工作流程分析](https://www.lixueduan.com/posts/kubernetes/32-hami-analyze-3-scheduler/)

* [HAMi vGPU 原理分析 Part4：Spread\&Binpack 高级调度策略实现](https://www.lixueduan.com/posts/kubernetes/33-hami-analyze-4-scheduler-policy/)



每个插件都要实现 Volcano 定义的 Plugin 接口：

```go
type Plugin interface {
    // The unique name of Plugin.
    Name() string

    OnSessionOpen(ssn *Session)
    OnSessionClose(ssn *Session)
}
```



核心代码在 OnSessionOpen 实现中，包含了调度的两个方法：

* Predicate

* NodeOrder

```go
func (dp *deviceSharePlugin) OnSessionOpen(ssn *framework.Session) {
    // Register event handlers to update task info in PodLister & nodeMap
    ssn.AddPredicateFn(dp.Name(), func(task *api.TaskInfo, node *api.NodeInfo) error {
       predicateStatus := make([]*api.Status, 0)
       // Check PredicateWithCache
       for _, val := range api.RegisteredDevices {
          if dev, ok := node.Others[val].(api.Devices); ok {
             if reflect.ValueOf(dev).IsNil() {
                // TODO When a pod requests a device of the current type, but the current node does not have such a device, an error is thrown
                if dev == nil || dev.HasDeviceRequest(task.Pod) {
                   predicateStatus = append(predicateStatus, &api.Status{
                      Code:   devices.Unschedulable,
                      Reason: "node not initialized with device" + val,
                      Plugin: PluginName,
                   })
                   return api.NewFitErrWithStatus(task, node, predicateStatus...)
                }
                klog.V(4).Infof("pod %s/%s did not request device %s on %s, skipping it", task.Pod.Namespace, task.Pod.Name, val, node.Name)
                continue
             }
             code, msg, err := dev.FilterNode(task.Pod, dp.schedulePolicy)
             if err != nil {
                predicateStatus = append(predicateStatus, createStatus(code, msg))
                return api.NewFitErrWithStatus(task, node, predicateStatus...)
             }
             filterNodeStatus := createStatus(code, msg)
             if filterNodeStatus.Code != api.Success {
                predicateStatus = append(predicateStatus, filterNodeStatus)
                return api.NewFitErrWithStatus(task, node, predicateStatus...)
             }
          } else {
             klog.Warningf("Devices %s assertion conversion failed, skip", val)
          }
       }

       klog.V(4).Infof("checkDevices predicates Task <%s/%s> on Node <%s>: fit ",
          task.Namespace, task.Name, node.Name)

       return nil
    })

    ssn.AddNodeOrderFn(dp.Name(), func(task *api.TaskInfo, node *api.NodeInfo) (float64, error) {
       // DeviceScore
       nodeScore := float64(0)
       if dp.scheduleWeight > 0 {
          score, status := getDeviceScore(context.TODO(), task.Pod, node, dp.schedulePolicy)
          if !status.IsSuccess() {
             klog.Warningf("Node: %s, Calculate Device Score Failed because of Error: %v", node.Name, status.AsError())
             return 0, status.AsError()
          }

          // TODO: we should use a seperate plugin for devices, and seperate them from predicates and nodeOrder plugin.
          nodeScore = float64(score) * float64(dp.scheduleWeight)
          klog.V(5).Infof("Node: %s, task<%s/%s> Device Score weight %d, score: %f", node.Name, task.Namespace, task.Name, dp.scheduleWeight, nodeScore)
       }
       return nodeScore, nil
    })
}
```

主要实现调度过程中的节点过滤以及打分两部分逻辑。



#### Predicate

过滤不满足设备需求的节点

```go
ssn.AddPredicateFn(dp.Name(), func(task *api.TaskInfo, node *api.NodeInfo) error {
    predicateStatus := make([]*api.Status, 0)
    // Check PredicateWithCache
    for _, val := range api.RegisteredDevices {
       if dev, ok := node.Others[val].(api.Devices); ok {
          if reflect.ValueOf(dev).IsNil() {
             // TODO When a pod requests a device of the current type, but the current node does not have such a device, an error is thrown
             if dev == nil || dev.HasDeviceRequest(task.Pod) {
                predicateStatus = append(predicateStatus, &api.Status{
                   Code:   devices.Unschedulable,
                   Reason: "node not initialized with device" + val,
                   Plugin: PluginName,
                })
                return api.NewFitErrWithStatus(task, node, predicateStatus...)
             }
             klog.V(4).Infof("pod %s/%s did not request device %s on %s, skipping it", task.Pod.Namespace, task.Pod.Name, val, node.Name)
             continue
          }
          code, msg, err := dev.FilterNode(task.Pod, dp.schedulePolicy)
          if err != nil {
             predicateStatus = append(predicateStatus, createStatus(code, msg))
             return api.NewFitErrWithStatus(task, node, predicateStatus...)
          }
          filterNodeStatus := createStatus(code, msg)
          if filterNodeStatus.Code != api.Success {
             predicateStatus = append(predicateStatus, filterNodeStatus)
             return api.NewFitErrWithStatus(task, node, predicateStatus...)
          }
       } else {
          klog.Warningf("Devices %s assertion conversion failed, skip", val)
       }
    }

    klog.V(4).Infof("checkDevices predicates Task <%s/%s> on Node <%s>: fit ",
       task.Namespace, task.Name, node.Name)

    return nil
})
```



核心逻辑在 `FilterNode` 方法中：

```go
code, msg, err := dev.FilterNode(task.Pod, dp.schedulePolicy)
if err != nil {
    predicateStatus = append(predicateStatus, createStatus(code, msg))
    return api.NewFitErrWithStatus(task, node, predicateStatus...)
}
func (gs *GPUDevices) FilterNode(pod *v1.Pod, schedulePolicy string) (int, string, error) {
    if VGPUEnable {
       klog.V(4).Infoln("hami-vgpu DeviceSharing starts filtering pods", pod.Name)
       fit, _, score, err := checkNodeGPUSharingPredicateAndScore(pod, gs, true, schedulePolicy)
       if err != nil || !fit {
          klog.ErrorS(err, "Failed to fitler node to vgpu task", "pod", pod.Name)
          return devices.Unschedulable, "hami-vgpuDeviceSharing error", err
       }
       gs.Score = score
       klog.V(4).Infoln("hami-vgpu DeviceSharing successfully filters pods")
    }
    return devices.Success, "", nil
}
```

过滤不满足条件的节点，并为剩余节点打分。



##### 节点过滤

从 core、memory 几方面判断 Node 是否有足够资源，不满足则过滤。

```go
ctrdevs := []ContainerDevices{}
for _, val := range ctrReq {
    devs := []ContainerDevice{}
    if int(val.Nums) > len(gs.Device) {
       return false, []ContainerDevices{}, 0, fmt.Errorf("no enough gpu cards on node %s", gs.Name)
    }
    klog.V(3).InfoS("Allocating device for container", "request", val)

    for i := len(gs.Device) - 1; i >= 0; i-- {
       klog.V(3).InfoS("Scoring pod request", "memReq", val.Memreq, "memPercentageReq", val.MemPercentagereq, "coresReq", val.Coresreq, "Nums", val.Nums, "Index", i, "ID", gs.Device[i].ID)
       klog.V(3).InfoS("Current Device", "Index", i, "TotalMemory", gs.Device[i].Memory, "UsedMemory", gs.Device[i].UsedMem, "UsedCores", gs.Device[i].UsedCore, "replicate", replicate)
       if gs.Device[i].Number <= uint(gs.Device[i].UsedNum) {
          continue
       }
       if val.MemPercentagereq != 101 && val.Memreq == 0 {
          val.Memreq = gs.Device[i].Memory * uint(val.MemPercentagereq/100)
       }
       if int(gs.Device[i].Memory)-int(gs.Device[i].UsedMem) < int(val.Memreq) {
          continue
       }
       if gs.Device[i].UsedCore+val.Coresreq > 100 {
          continue
       }
       // Coresreq=100 indicates it want this card exclusively
       if val.Coresreq == 100 && gs.Device[i].UsedNum > 0 {
          continue
       }
       // You can't allocate core=0 job to an already full GPU
       if gs.Device[i].UsedCore == 100 && val.Coresreq == 0 {
          continue
       }
       if !checkType(pod.Annotations, *gs.Device[i], val) {
          klog.Errorln("failed checktype", gs.Device[i].Type, val.Type)
          continue
       }
       fit, uuid := gs.Sharing.TryAddPod(gs.Device[i], uint(val.Memreq), uint(val.Coresreq))
       if !fit {
          klog.V(3).Info(gs.Device[i].ID, "not fit")
          continue
       }
       //total += gs.Devices[i].Count
       //free += node.Devices[i].Count - node.Devices[i].Used
       if val.Nums > 0 {
          val.Nums--
          klog.V(3).Info("fitted uuid: ", uuid)
          devs = append(devs, ContainerDevice{
             UUID:      uuid,
             Type:      val.Type,
             Usedmem:   val.Memreq,
             Usedcores: val.Coresreq,
          })
          score += GPUScore(schedulePolicy, gs.Device[i])
       }
       if val.Nums == 0 {
          break
       }
    }
    if val.Nums > 0 {
       return false, []ContainerDevices{}, 0, fmt.Errorf("not enough gpu fitted on this node")
    }
    ctrdevs = append(ctrdevs, devs)
}
```



##### 节点打分

根据配置的调度策略进行打分。

```go
const (
    binpackMultiplier    = 100
    spreadMultiplier     = 100
 )

func GPUScore(schedulePolicy string, device *GPUDevice) float64 {
    var score float64
    switch schedulePolicy {
    case binpackPolicy:
       score = binpackMultiplier * (float64(device.UsedMem) / float64(device.Memory))
    case spreadPolicy:
       if device.UsedNum == 1 {
          score = spreadMultiplier
       }
    default:
       score = float64(0)
    }
    return score
}
```

逻辑比较简单：

* Binpack ：device 内存使用率越高，得分越高

* Spread： device 有被共享使用得 100 分，否则 0 分。



#### NodeOrder

上一步已经为节点打好分了，这里只需要根据得分排序即可。

```go
    ssn.AddNodeOrderFn(dp.Name(), func(task *api.TaskInfo, node *api.NodeInfo) (float64, error) {
       // DeviceScore
       nodeScore := float64(0)
       if dp.scheduleWeight > 0 {
          score, status := getDeviceScore(context.TODO(), task.Pod, node, dp.schedulePolicy)
          if !status.IsSuccess() {
             klog.Warningf("Node: %s, Calculate Device Score Failed because of Error: %v", node.Name, status.AsError())
             return 0, status.AsError()
          }

          // TODO: we should use a seperate plugin for devices, and seperate them from predicates and nodeOrder plugin.
          nodeScore = float64(score) * float64(dp.scheduleWeight)
          klog.V(5).Infof("Node: %s, task<%s/%s> Device Score weight %d, score: %f", node.Name, task.Namespace, task.Name, dp.scheduleWeight, nodeScore)
       }
       return nodeScore, nil
    })
}
```

核心部分：

```bash
nodeScore = float64(score) * float64(dp.scheduleWeight)
```

节点得分 \* 权重得到最终得分。



## 5. FAQ

### gpu-memory 显示为 0

**现象**：`device-plugin` 部署后 `gpu-memory` 显示为 0 就像这样：

```bash
root@node5-3:~/lixd# k describe node node5-3 |grep Cap -A 10
Capacity:
  cpu:                     160
  ephemeral-storage:       3750157048Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  2113442544Ki
  nvidia.com/gpu:          0
  pods:                    110
  volcano.sh/vgpu-cores:   800
  volcano.sh/vgpu-memory:  0
  volcano.sh/vgpu-number:  80
```

**具体原因**：[https://github.com/volcano-sh/devices/issues/19](https://github.com/volcano-sh/devices/issues/19)

相关描述：

> the size of device list exceeds the bound, and ListAndWatch failed as a result。

简而言之就是超过阈值的显存就会报错，导致 DevicePlugin 无法正常上报，因此显示为 0。



**解决方案**：需要在启动时设置参数 `--gpu-memory-factor=10`,将最小的显存块从默认 1MB 改成 10MB，就像这样：

```bash
      containers:
      - image: docker.io/projecthami/volcano-vgpu-device-plugin:v1.10.0
        args: ["--device-split-count=10","--gpu-memory-factor=10"]
```

这样最大能显示的数值就扩大了 10 倍，就可以避免该问题。



**效果如下**：

```bash
root@node5-3:~/lixd# k describe node node5-3 |grep Cap -A 10
Capacity:
  cpu:                     160
  ephemeral-storage:       3750157048Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  2113442544Ki
  nvidia.com/gpu:          0
  pods:                    110
  volcano.sh/vgpu-cores:   800
  volcano.sh/vgpu-memory:  39312
  volcano.sh/vgpu-number:  80
```

`volcano.sh/vgpu-memory:  39312` ：由于设置了 factor=10，因此实际代表总显存 39312 \* 10 = 393120 MB。

当前环境是 L40S\*8，单卡显存 49140，49140 \* 8 = 393120，正好符合，说明一切正常。



## 6. 小结

本文主要验证了 Volcano 如何通过集成**HAMi vGPU技术**实现 Kubernetes 环境下的 GPU 虚拟化，重点验证了**HAMi-Core 模式**的完整工作流程。



解答前面的问题：**Volcano DevicePlugin 如何实现同时注册三个资源的？**

通过启动三个 DevicePlugin 以实现注册 `volcano.sh/vgpu-number`、`volcano.sh/vgpu-memory`、`volcano.sh/vgpu-cores` 三种资源。



推荐阅读：

* [HAMi vGPU 原理分析 Part1：hami-device-plugin-nvidia 实现](https://www.lixueduan.com/posts/kubernetes/29-hami-analyze-1-device-plugin-nvidia/)

* [HAMi vGPU 原理分析 Part3：hami-scheduler 工作流程分析](https://www.lixueduan.com/posts/kubernetes/32-hami-analyze-3-scheduler/)

* [HAMi vGPU 原理分析 Part4：Spread\&Binpack 高级调度策略实现](https://www.lixueduan.com/posts/kubernetes/33-hami-analyze-4-scheduler-policy/)



---

> 作者: [意琦行](https://github.com/lixd)  
> URL: https://www.lixueduan.com/posts/kubernetes/45-volcano-vgpu/  

