Volcano vGPU实战:无硬件依赖的Kubernetes GPU共享与隔离方案
在上一篇《Volcano初探:批处理调度引擎的云原生实践》中,我们通过Helm快速部署了Volcano集群,并成功运行了首个测试任务,验证了其基础调度能力。本文将进一步探索Volcano的GPU虚拟化功能,聚焦如何通过HAMi vGPU 技术实现GPU资源的细粒度共享与硬隔离。
批处理调度引擎 Volcano 支持 GPU 虚拟化功能,该功能主要由 HAMi 提供。
HAMi vGPU 提供的 GPU 虚拟化包括 HAMi-Core 和 Dynamic MIG 两种模式:
Mode | Isolation | MIG GPU Required | Annotation | Core/Memory Control | Recommended For |
---|---|---|---|---|---|
HAMI-core | Software (VCUDA) | No | No | Yes | General workloads |
Dynamic MIG | Hardware | Yes | Yes | MIG-controlled | Performance-sensitive jobs |
如果硬件支持 MIG 同时运行的是性能敏感型任务,那么推荐使用 Dynamic MIG 模型,不支持 MIG 依旧可以使用更加通用,对硬件无要求的 HAMi-Core 模式。
本文主要以 HAMi-Core 进行演示,HAMi vGPU 如何集成到 Volcano。
使用流程:
1)创建集群
2)安装 GPU-Operator,但是不安装 DevicePlugin
3)安装 Volcano,并配置开启 vGPU 插件
4)安装 volcano-vgpu-device-plugin
5)验证
1. 环境准备
1.1 创建集群
使用 KubeClipper 部署一个集群进行验证。
Kubernetes教程(十一)—使用 KubeClipper 通过一条命令快速创建 k8s 集群
1.2 GPU-Operator
参考之前的文章 GPU 环境搭建指南:使用 GPU Operator 加速 Kubernetes GPU 环境搭建,使用 GPU Operator 部署环境。
1.3 Volcano
部署 Volcano
安装 Volcano,部署时需要注意 volcano 和 k8s 的版本兼容性问题,参考官方 README:Kubernetes compatibility
这里部署的 v1.12.0 版本
# 添加仓库
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update
# 部署
helm upgrade --install volcano volcano-sh/volcano --version 1.12.0 -n volcano-system --create-namespace
部署完成后 Pod 列表如下:
root@node5-3:~# kubectl -n volcano-system get po
NAME READY STATUS RESTARTS AGE
volcano-admission-6444dd4fb7-8s8d9 1/1 Running 0 3m
volcano-controllers-75d5b78c7-llcrz 1/1 Running 0 3m
volcano-scheduler-7d46c5b5db-t2k42 1/1 Running 0 3m
修改调度器配置:开启 deviceshare 插件
Volcano 部署完成之后,我们需要编辑调度器配置,开启 deviceshare 插件。
kubectl edit cm -n volcano-system volcano-scheduler-configmap
完整内容如下:
kind: ConfigMap
apiVersion: v1
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- plugins:
- name: drf
- name: deviceshare
arguments:
deviceshare.VGPUEnable: true # enable vgpu
deviceshare.SchedulePolicy: binpack # scheduling policy. binpack / spread
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
核心如下:
- name: deviceshare
arguments:
deviceshare.VGPUEnable: true # enable vgpu
deviceshare.SchedulePolicy: binpack # scheduling policy. binpack / spread
开启 vgpu 同时调度策略我们选择 binpack。
HAMi 调度策略可以阅读这篇文章:HAMi vGPU 原理分析 Part4:Spread&Binpack 高级调度策略实现
修改后,不需要重启,Volcano 会自动检测,当文件变化后自动 reload。
if opt.SchedulerConf != "" {
var err error
path := filepath.Dir(opt.SchedulerConf)
watcher, err = filewatcher.NewFileWatcher(path)
if err != nil {
return nil, fmt.Errorf("failed creating filewatcher for %s: %v", opt.SchedulerConf, err)
}
}
func (pc *Scheduler) watchSchedulerConf(stopCh <-chan struct{}) {
if pc.fileWatcher == nil {
return
}
eventCh := pc.fileWatcher.Events()
errCh := pc.fileWatcher.Errors()
for {
select {
case event, ok := <-eventCh:
if !ok {
return
}
klog.V(4).Infof("watch %s event: %v", pc.schedulerConf, event)
if event.Op&fsnotify.Write == fsnotify.Write || event.Op&fsnotify.Create == fsnotify.Create {
pc.loadSchedulerConf()
pc.cache.SetMetricsConf(pc.metricsConf)
}
case err, ok := <-errCh:
if !ok {
return
}
klog.Infof("watch %s error: %v", pc.schedulerConf, err)
case <-stopCh:
return
}
}
}
不过 k8s 将 Configmap 同步到 Pod 中也是有延迟的,不想等的话也可以手动重启下。
kubectl -n volcano-system rollout restart deploy volcano-scheduler
1.4 volcano-vgpu-device-plugin
接下来我们部署和 Volcano 集成用到的 DevicePlugin:volcano-vgpu-device-plugin
。
DevicePlugin 原理可以阅读这篇文章:HAMi vGPU 原理分析 Part1:hami-device-plugin-nvidia 实现,大致逻辑都是一样的。
部署 DevicePlugin
从项目 volcano-vgpu-device-plugin 根目录获取文件: volcano-vgpu-device-plugin.yml
wget https://github.com/Project-HAMi/volcano-vgpu-device-plugin/blob/main/volcano-vgpu-device-plugin.yml
部署
kubectl apply -f volcano-vgpu-device-plugin.yml
查看 Pod 列表:
root@node5-3:~# kubectl -n kube-system get po
volcano-device-plugin-xkwzd 2/2 Running 0 10m
验证 Node 资源
查看 Node 上的 Resource 信息:
root@node5-3:~/lixd# k describe node node5-3 |grep Cap -A 10
Capacity:
cpu: 160
ephemeral-storage: 3750157048Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 2113442544Ki
nvidia.com/gpu: 0
pods: 110
volcano.sh/vgpu-cores: 800
volcano.sh/vgpu-memory: 39312
volcano.sh/vgpu-number: 80
Volcano 新增了下面三个:
volcano.sh/vgpu-cores: 800
volcano.sh/vgpu-memory: 39312
volcano.sh/vgpu-number: 80
volcano.sh/vgpu-cores: 800
:每张 GPU 100 core,8 卡正好 800 core。volcano.sh/vgpu-memory: 39312
:由于设置了 factor=10,因此实际代表总显存 39312 * 10 = 393120 MB。- 当前环境是 L40S*8,单卡显存 49140,49140 * 8 = 393120,正好符合,说明一切正常。
volcano.sh/vgpu-number: 80
:默认--device-split-count=10
,将 GPU 数量扩大了 10 倍。
说明插件部署成功。
2. 简单使用
启动 Pod
首先启动一个简单 Pod
apiVersion: v1
kind: Pod
metadata:
name: test1
spec:
restartPolicy: OnFailure
schedulerName: volcano
containers:
- image: ubuntu:24.04
name: pod1-ctr
command: ["sleep"]
args: ["100000"]
resources:
limits:
volcano.sh/vgpu-memory: 1024
volcano.sh/vgpu-number: 1
查看效果,vgpu-memory 申请的 1024,如下:
但是因为 factor=10,所以实际是 10240 MB。
root@node5-3:~/lixd# k exec -it test1 -- nvidia-smi
[HAMI-core Msg(16:140249737447232:libvgpu.c:838)]: Initializing.....
Tue Jul 22 13:52:58 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L40S On | 00000000:91:00.0 Off | Off |
| N/A 28C P8 34W / 350W | 0MiB / 10240MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
[HAMI-core Msg(16:140249737447232:multiprocess_memory_limit.c:499)]: Calling exit handler 16
启动 Volcano Job
启动一个简单的 Volcano Job 试试:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: simple-vgpu-training
spec:
schedulerName: volcano
minAvailable: 3 # Gang Scheduling: 确保3个Pod同时启动
tasks:
- name: worker
replicas: 3 # 启动2个Worker
template:
spec:
restartPolicy: OnFailure
containers:
- name: python-trainer
image: python:3.9-slim
command: ["python", "-c"]
args:
- |
# 简化的训练代码
import os
import time
worker_id = os.getenv("VC_TASK_INDEX", "0")
print(f"Worker {worker_id} started with vGPU")
# 模拟训练过程
for epoch in range(1, 10):
time.sleep(6)
print(f"Worker {worker_id} completed epoch {epoch}")
print(f"Worker {worker_id} finished training!")
env:
# 获取任务索引 (0,1,...)
- name: VC_TASK_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.annotations['volcano.sh/task-index']
resources:
limits:
volcano.sh/vgpu-memory: 1024 # 每个Worker分配1024MB显存
volcano.sh/vgpu-number: 1 # 每个Worker1个vGPU
cpu: "1"
memory: "1Gi"
一切正常:
root@node5-3:~/lixd# k get po -w
NAME READY STATUS RESTARTS AGE
simple-vgpu-training-worker-0 1/1 Running 0 5s
simple-vgpu-training-worker-1 1/1 Running 0 5s
simple-vgpu-training-worker-2 1/1 Running 0 5s
查看 Pod 中的 GPU 信息
root@node5-3:~/lixd# k exec -it simple-vgpu-training-worker-0 -- nvidia-smi
[HAMI-core Msg(7:140498435086144:libvgpu.c:838)]: Initializing.....
Tue Jul 22 15:02:36 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L40S On | 00000000:91:00.0 Off | Off |
| N/A 28C P8 34W / 350W | 0MiB / 10240MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
[HAMI-core Msg(7:140498435086144:multiprocess_memory_limit.c:499)]: Calling exit handler 7
日志
root@node5-3:~/lixd# k logs -f simple-vgpu-training-worker-2
Worker 2 started with vGPU
Worker 2 completed epoch 1
Worker 2 completed epoch 2
Worker 2 completed epoch 3
Worker 2 completed epoch 4
Worker 2 completed epoch 5
Worker 2 completed epoch 6
Worker 2 completed epoch 7
Worker 2 completed epoch 8
Worker 2 completed epoch 9
Worker 2 finished training!
3. 监控
调度器监控
curl {volcano scheduler cluster ip}:8080/metrics
包括 GPU core & memory 的分配信息,以及对应 Pod 信息,例如:
# HELP volcano_vgpu_device_allocated_cores The percentage of gpu compute cores allocated in this card
# TYPE volcano_vgpu_device_allocated_cores gauge
volcano_vgpu_device_allocated_cores{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad"} 50
# HELP volcano_vgpu_device_allocated_memory The number of vgpu memory allocated in this card
# TYPE volcano_vgpu_device_allocated_memory gauge
volcano_vgpu_device_allocated_memory{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad"} 2048
# HELP volcano_vgpu_device_core_allocation_for_a_certain_pod The vgpu device core allocated for a certain pod
# TYPE volcano_vgpu_device_core_allocation_for_a_certain_pod gauge
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-0"} 10
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-1"} 10
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-2"} 10
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-3"} 10
volcano_vgpu_device_core_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-4"} 10
# HELP volcano_vgpu_device_memory_allocation_for_a_certain_pod The vgpu device memory allocated for a certain pod
# TYPE volcano_vgpu_device_memory_allocation_for_a_certain_pod gauge
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-0"} 128
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-1"} 128
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-2"} 128
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-3"} 128
volcano_vgpu_device_memory_allocation_for_a_certain_pod{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podName="simple-vgpu-training-worker-4"} 128
# HELP volcano_vgpu_device_memory_limit The number of total device memory in this card
# TYPE volcano_vgpu_device_memory_limit gauge
volcano_vgpu_device_memory_limit{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad"} 4914
# HELP volcano_vgpu_device_shared_number The number of vgpu tasks sharing this card
# TYPE volcano_vgpu_device_shared_number gauge
volcano_vgpu_device_shared_number{NodeName="node5-3",devID="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad"} 5
设备插件监控
直接访问 DevicePlugin Pod 的 9394 端口获取监控信息:
curl http://<plugin-pod-ip>:9394/metrics
可以查看到该节点上的 GPU 使用情况,metrics 如下:
# HELP Device_last_kernel_of_container Container device last kernel description
# TYPE Device_last_kernel_of_container gauge
Device_last_kernel_of_container{ctrname="pod1-ctr",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 257114
# HELP Device_memory_desc_of_container Container device meory description
# TYPE Device_memory_desc_of_container counter
Device_memory_desc_of_container{context="0",ctrname="pod1-ctr",data="0",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",module="0",offset="0",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 0
# HELP Device_utilization_desc_of_container Container device utilization description
# TYPE Device_utilization_desc_of_container gauge
Device_utilization_desc_of_container{ctrname="pod1-ctr",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 0
# HELP HostCoreUtilization GPU core utilization
# TYPE HostCoreUtilization gauge
HostCoreUtilization{deviceidx="0",deviceuuid="GPU-a11fe6d9-3dbe-8a24-34e9-535b2629babd",zone="vGPU"} 0
HostCoreUtilization{deviceidx="1",deviceuuid="GPU-b82090de-5250-44e2-a5ed-b0efc5763f8f",zone="vGPU"} 0
HostCoreUtilization{deviceidx="2",deviceuuid="GPU-8f563a66-d507-583f-59f1-46c2e97a393c",zone="vGPU"} 0
HostCoreUtilization{deviceidx="3",deviceuuid="GPU-1e5a0632-4332-f4d0-adf2-80ebfed56684",zone="vGPU"} 0
HostCoreUtilization{deviceidx="4",deviceuuid="GPU-384027fd-54f2-638b-cdfe-0d5f3b6630f5",zone="vGPU"} 0
HostCoreUtilization{deviceidx="5",deviceuuid="GPU-dbb95093-0147-7b3a-f468-8a3575a8dd4e",zone="vGPU"} 0
HostCoreUtilization{deviceidx="6",deviceuuid="GPU-f3eb6e71-e90a-bfc9-de06-dff90c3093b9",zone="vGPU"} 0
HostCoreUtilization{deviceidx="7",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",zone="vGPU"} 0
# HELP HostGPUMemoryUsage GPU device memory usage
# TYPE HostGPUMemoryUsage gauge
HostGPUMemoryUsage{deviceidx="0",deviceuuid="GPU-a11fe6d9-3dbe-8a24-34e9-535b2629babd",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="1",deviceuuid="GPU-b82090de-5250-44e2-a5ed-b0efc5763f8f",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="2",deviceuuid="GPU-8f563a66-d507-583f-59f1-46c2e97a393c",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="3",deviceuuid="GPU-1e5a0632-4332-f4d0-adf2-80ebfed56684",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="4",deviceuuid="GPU-384027fd-54f2-638b-cdfe-0d5f3b6630f5",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="5",deviceuuid="GPU-dbb95093-0147-7b3a-f468-8a3575a8dd4e",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="6",deviceuuid="GPU-f3eb6e71-e90a-bfc9-de06-dff90c3093b9",zone="vGPU"} 5.14326528e+08
HostGPUMemoryUsage{deviceidx="7",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",zone="vGPU"} 5.14326528e+08
# HELP vGPU_device_memory_limit_in_bytes vGPU device limit
# TYPE vGPU_device_memory_limit_in_bytes gauge
vGPU_device_memory_limit_in_bytes{ctrname="pod1-ctr",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 1.073741824e+10
# HELP vGPU_device_memory_usage_in_bytes vGPU device usage
# TYPE vGPU_device_memory_usage_in_bytes gauge
vGPU_device_memory_usage_in_bytes{ctrname="pod1-ctr",deviceuuid="GPU-542efc47-39a1-9669-3d17-3b7dec8251ad",podname="test1",podnamespace="default",vdeviceid="0",zone="vGPU"} 0
4. 源码分析
4.1 DevicePlugin
资源注册
DevicePlugin 都只会注册一个资源,而 Volcano DevicePlugin 却注册了三个资源,如何做到的?
root@node5-3:~/lixd# k describe node node5-3 |grep Cap -A 10
Capacity:
cpu: 160
ephemeral-storage: 3750157048Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 2113442544Ki
nvidia.com/gpu: 0
pods: 110
volcano.sh/vgpu-cores: 800
volcano.sh/vgpu-memory: 39312
volcano.sh/vgpu-number: 80
实际上,Volcano DevicePlugin 内部启动了三个 DevicePlugin 分别使用了三个 ResourceName:
volcano.sh/vgpu-number
volcano.sh/vgpu-memory
volcano.sh/vgpu-cores
func (s *migStrategyNone) GetPlugins(cfg *config.NvidiaConfig, cache *DeviceCache) []*NvidiaDevicePlugin {
return []*NvidiaDevicePlugin{
NewNvidiaDevicePlugin(
//"nvidia.com/gpu",
util.ResourceName,
cache,
gpuallocator.NewBestEffortPolicy(),
pluginapi.DevicePluginPath+"nvidia-gpu.sock",
cfg),
NewNvidiaDevicePlugin(
util.ResourceMem,
cache,
gpuallocator.NewBestEffortPolicy(),
pluginapi.DevicePluginPath+"nvidia-gpu-memory.sock",
cfg),
NewNvidiaDevicePlugin(
util.ResourceCores,
cache,
gpuallocator.NewBestEffortPolicy(),
pluginapi.DevicePluginPath+"nvidia-gpu-cores.sock",
cfg),
}
}
对应 sock 文件如下:
root@node5-3:/var/lib/kubelet/device-plugins# ls /var/lib/kubelet/device-plugins
DEPRECATION kubelet.sock kubelet_internal_checkpoint nvidia-gpu-cores.sock nvidia-gpu-memory.sock nvidia-gpu.sock
在获取 Device 时也根据不同的 ResourceName 做了不同实现:
func (m *NvidiaDevicePlugin) apiDevices() []*pluginapi.Device {
if strings.Compare(m.migStrategy, "mixed") == 0 {
var pdevs []*pluginapi.Device
for _, d := range m.cachedDevices {
pdevs = append(pdevs, &d.Device)
}
return pdevs
}
devices := m.Devices()
var res []*pluginapi.Device
if strings.Compare(m.resourceName, util.ResourceMem) == 0 {
for _, dev := range devices {
i := 0
klog.Infoln("memory=", dev.Memory, "id=", dev.ID)
for i < int(32767) {
res = append(res, &pluginapi.Device{
ID: fmt.Sprintf("%v-memory-%v", dev.ID, i),
Health: dev.Health,
Topology: nil,
})
i++
}
}
klog.Infoln("res length=", len(res))
return res
}
if strings.Compare(m.resourceName, util.ResourceCores) == 0 {
for _, dev := range devices {
i := 0
for i < 100 {
res = append(res, &pluginapi.Device{
ID: fmt.Sprintf("%v-core-%v", dev.ID, i),
Health: dev.Health,
Topology: nil,
})
i++
}
}
return res
}
for _, dev := range devices {
for i := uint(0); i < config.DeviceSplitCount; i++ {
id := fmt.Sprintf("%v-%v", dev.ID, i)
res = append(res, &pluginapi.Device{
ID: id,
Health: dev.Health,
Topology: nil,
})
}
}
return res
}
Allocate
具体分配 Device 逻辑:
// Allocate which return list of devices.
func (m *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
if len(reqs.ContainerRequests) > 1 {
return &pluginapi.AllocateResponse{}, errors.New("multiple Container Requests not supported")
}
if strings.Compare(m.migStrategy, "mixed") == 0 {
return m.MIGAllocate(ctx, reqs)
}
responses := pluginapi.AllocateResponse{}
if strings.Compare(m.resourceName, util.ResourceMem) == 0 || strings.Compare(m.resourceName, util.ResourceCores) == 0 {
for range reqs.ContainerRequests {
responses.ContainerResponses = append(responses.ContainerResponses, &pluginapi.ContainerAllocateResponse{})
}
return &responses, nil
}
nodename := os.Getenv("NODE_NAME")
current, err := util.GetPendingPod(nodename)
if err != nil {
lock.ReleaseNodeLock(nodename, util.VGPUDeviceName)
return &pluginapi.AllocateResponse{}, err
}
if current == nil {
klog.Errorf("no pending pod found on node %s", nodename)
lock.ReleaseNodeLock(nodename, util.VGPUDeviceName)
return &pluginapi.AllocateResponse{}, errors.New("no pending pod found on node")
}
for idx := range reqs.ContainerRequests {
currentCtr, devreq, err := util.GetNextDeviceRequest(util.NvidiaGPUDevice, *current)
klog.Infoln("deviceAllocateFromAnnotation=", devreq)
if err != nil {
klog.Errorln("get device from annotation failed", err.Error())
util.PodAllocationFailed(nodename, current)
return &pluginapi.AllocateResponse{}, err
}
if len(devreq) != len(reqs.ContainerRequests[idx].DevicesIDs) {
klog.Errorln("device number not matched", devreq, reqs.ContainerRequests[idx].DevicesIDs)
util.PodAllocationFailed(nodename, current)
return &pluginapi.AllocateResponse{}, errors.New("device number not matched")
}
response := pluginapi.ContainerAllocateResponse{}
response.Envs = make(map[string]string)
response.Envs["NVIDIA_VISIBLE_DEVICES"] = strings.Join(m.GetContainerDeviceStrArray(devreq), ",")
err = util.EraseNextDeviceTypeFromAnnotation(util.NvidiaGPUDevice, *current)
if err != nil {
klog.Errorln("Erase annotation failed", err.Error())
util.PodAllocationFailed(nodename, current)
return &pluginapi.AllocateResponse{}, err
}
if m.operatingMode != "mig" {
for i, dev := range devreq {
limitKey := fmt.Sprintf("CUDA_DEVICE_MEMORY_LIMIT_%v", i)
response.Envs[limitKey] = fmt.Sprintf("%vm", dev.Usedmem*int32(config.GPUMemoryFactor))
}
response.Envs["CUDA_DEVICE_SM_LIMIT"] = fmt.Sprint(devreq[0].Usedcores)
response.Envs["CUDA_DEVICE_MEMORY_SHARED_CACHE"] = fmt.Sprintf("/tmp/vgpu/%v.cache", uuid.NewUUID())
cacheFileHostDirectory := "/tmp/vgpu/containers/" + string(current.UID) + "_" + currentCtr.Name
os.MkdirAll(cacheFileHostDirectory, 0777)
os.Chmod(cacheFileHostDirectory, 0777)
os.MkdirAll("/tmp/vgpulock", 0777)
os.Chmod("/tmp/vgpulock", 0777)
hostHookPath := os.Getenv("HOOK_PATH")
response.Mounts = append(response.Mounts,
&pluginapi.Mount{ContainerPath: "/usr/local/vgpu/libvgpu.so",
HostPath: hostHookPath + "/libvgpu.so",
ReadOnly: true},
&pluginapi.Mount{ContainerPath: "/tmp/vgpu",
HostPath: cacheFileHostDirectory,
ReadOnly: false},
&pluginapi.Mount{ContainerPath: "/tmp/vgpulock",
HostPath: "/tmp/vgpulock",
ReadOnly: false},
)
found := false
for _, val := range currentCtr.Env {
if strings.Compare(val.Name, "CUDA_DISABLE_CONTROL") == 0 {
found = true
break
}
}
if !found {
response.Mounts = append(response.Mounts, &pluginapi.Mount{ContainerPath: "/etc/ld.so.preload",
HostPath: hostHookPath + "/ld.so.preload",
ReadOnly: true},
)
}
}
responses.ContainerResponses = append(responses.ContainerResponses, &response)
}
klog.Infoln("Allocate Response", responses.ContainerResponses)
util.PodAllocationTrySuccess(nodename, current)
return &responses, nil
}
核心部分:
包括指定环境变量,以及挂载 libvgpu.so 等逻辑。
for i, dev := range devreq {
limitKey := fmt.Sprintf("CUDA_DEVICE_MEMORY_LIMIT_%v", i)
response.Envs[limitKey] = fmt.Sprintf("%vm", dev.Usedmem*int32(config.GPUMemoryFactor))
}
response.Envs["CUDA_DEVICE_SM_LIMIT"] = fmt.Sprint(devreq[0].Usedcores)
response.Envs["CUDA_DEVICE_MEMORY_SHARED_CACHE"] = fmt.Sprintf("/tmp/vgpu/%v.cache", uuid.NewUUID())
response.Mounts = append(response.Mounts,
&pluginapi.Mount{ContainerPath: "/usr/local/vgpu/libvgpu.so",
HostPath: hostHookPath + "/libvgpu.so",
ReadOnly: true},
&pluginapi.Mount{ContainerPath: "/tmp/vgpu",
HostPath: cacheFileHostDirectory,
ReadOnly: false},
&pluginapi.Mount{ContainerPath: "/tmp/vgpulock",
HostPath: "/tmp/vgpulock",
ReadOnly: false},
)
同时由于启动了三个 DevicePlugin,为了避免重复调用,Allocate 方法中根据 ResourceName 进行了判断,只有 volcano.sh/vgpu-number
时才真正执行分配逻辑。
if strings.Compare(m.resourceName, util.ResourceMem) == 0 || strings.Compare(m.resourceName, util.ResourceCores) == 0 {
for range reqs.ContainerRequests {
responses.ContainerResponses = append(responses.ContainerResponses, &pluginapi.ContainerAllocateResponse{})
}
return &responses, nil
}
4.2 deviceshare 插件分析
https://github.com/volcano-sh/volcano/blob/master/pkg/scheduler/plugins/deviceshare/deviceshare.go
简单分析一下 Volcano 中的 deviceshare 插件。
这块和 HAMi 实现基本一致,可以参考以下两篇文章:
每个插件都要实现 Volcano 定义的 Plugin 接口:
type Plugin interface {
// The unique name of Plugin.
Name() string
OnSessionOpen(ssn *Session)
OnSessionClose(ssn *Session)
}
核心代码在 OnSessionOpen 实现中,包含了调度的两个方法:
Predicate
NodeOrder
func (dp *deviceSharePlugin) OnSessionOpen(ssn *framework.Session) {
// Register event handlers to update task info in PodLister & nodeMap
ssn.AddPredicateFn(dp.Name(), func(task *api.TaskInfo, node *api.NodeInfo) error {
predicateStatus := make([]*api.Status, 0)
// Check PredicateWithCache
for _, val := range api.RegisteredDevices {
if dev, ok := node.Others[val].(api.Devices); ok {
if reflect.ValueOf(dev).IsNil() {
// TODO When a pod requests a device of the current type, but the current node does not have such a device, an error is thrown
if dev == nil || dev.HasDeviceRequest(task.Pod) {
predicateStatus = append(predicateStatus, &api.Status{
Code: devices.Unschedulable,
Reason: "node not initialized with device" + val,
Plugin: PluginName,
})
return api.NewFitErrWithStatus(task, node, predicateStatus...)
}
klog.V(4).Infof("pod %s/%s did not request device %s on %s, skipping it", task.Pod.Namespace, task.Pod.Name, val, node.Name)
continue
}
code, msg, err := dev.FilterNode(task.Pod, dp.schedulePolicy)
if err != nil {
predicateStatus = append(predicateStatus, createStatus(code, msg))
return api.NewFitErrWithStatus(task, node, predicateStatus...)
}
filterNodeStatus := createStatus(code, msg)
if filterNodeStatus.Code != api.Success {
predicateStatus = append(predicateStatus, filterNodeStatus)
return api.NewFitErrWithStatus(task, node, predicateStatus...)
}
} else {
klog.Warningf("Devices %s assertion conversion failed, skip", val)
}
}
klog.V(4).Infof("checkDevices predicates Task <%s/%s> on Node <%s>: fit ",
task.Namespace, task.Name, node.Name)
return nil
})
ssn.AddNodeOrderFn(dp.Name(), func(task *api.TaskInfo, node *api.NodeInfo) (float64, error) {
// DeviceScore
nodeScore := float64(0)
if dp.scheduleWeight > 0 {
score, status := getDeviceScore(context.TODO(), task.Pod, node, dp.schedulePolicy)
if !status.IsSuccess() {
klog.Warningf("Node: %s, Calculate Device Score Failed because of Error: %v", node.Name, status.AsError())
return 0, status.AsError()
}
// TODO: we should use a seperate plugin for devices, and seperate them from predicates and nodeOrder plugin.
nodeScore = float64(score) * float64(dp.scheduleWeight)
klog.V(5).Infof("Node: %s, task<%s/%s> Device Score weight %d, score: %f", node.Name, task.Namespace, task.Name, dp.scheduleWeight, nodeScore)
}
return nodeScore, nil
})
}
主要实现调度过程中的节点过滤以及打分两部分逻辑。
Predicate
过滤不满足设备需求的节点
ssn.AddPredicateFn(dp.Name(), func(task *api.TaskInfo, node *api.NodeInfo) error {
predicateStatus := make([]*api.Status, 0)
// Check PredicateWithCache
for _, val := range api.RegisteredDevices {
if dev, ok := node.Others[val].(api.Devices); ok {
if reflect.ValueOf(dev).IsNil() {
// TODO When a pod requests a device of the current type, but the current node does not have such a device, an error is thrown
if dev == nil || dev.HasDeviceRequest(task.Pod) {
predicateStatus = append(predicateStatus, &api.Status{
Code: devices.Unschedulable,
Reason: "node not initialized with device" + val,
Plugin: PluginName,
})
return api.NewFitErrWithStatus(task, node, predicateStatus...)
}
klog.V(4).Infof("pod %s/%s did not request device %s on %s, skipping it", task.Pod.Namespace, task.Pod.Name, val, node.Name)
continue
}
code, msg, err := dev.FilterNode(task.Pod, dp.schedulePolicy)
if err != nil {
predicateStatus = append(predicateStatus, createStatus(code, msg))
return api.NewFitErrWithStatus(task, node, predicateStatus...)
}
filterNodeStatus := createStatus(code, msg)
if filterNodeStatus.Code != api.Success {
predicateStatus = append(predicateStatus, filterNodeStatus)
return api.NewFitErrWithStatus(task, node, predicateStatus...)
}
} else {
klog.Warningf("Devices %s assertion conversion failed, skip", val)
}
}
klog.V(4).Infof("checkDevices predicates Task <%s/%s> on Node <%s>: fit ",
task.Namespace, task.Name, node.Name)
return nil
})
核心逻辑在 FilterNode
方法中:
code, msg, err := dev.FilterNode(task.Pod, dp.schedulePolicy)
if err != nil {
predicateStatus = append(predicateStatus, createStatus(code, msg))
return api.NewFitErrWithStatus(task, node, predicateStatus...)
}
func (gs *GPUDevices) FilterNode(pod *v1.Pod, schedulePolicy string) (int, string, error) {
if VGPUEnable {
klog.V(4).Infoln("hami-vgpu DeviceSharing starts filtering pods", pod.Name)
fit, _, score, err := checkNodeGPUSharingPredicateAndScore(pod, gs, true, schedulePolicy)
if err != nil || !fit {
klog.ErrorS(err, "Failed to fitler node to vgpu task", "pod", pod.Name)
return devices.Unschedulable, "hami-vgpuDeviceSharing error", err
}
gs.Score = score
klog.V(4).Infoln("hami-vgpu DeviceSharing successfully filters pods")
}
return devices.Success, "", nil
}
过滤不满足条件的节点,并为剩余节点打分。
节点过滤
从 core、memory 几方面判断 Node 是否有足够资源,不满足则过滤。
ctrdevs := []ContainerDevices{}
for _, val := range ctrReq {
devs := []ContainerDevice{}
if int(val.Nums) > len(gs.Device) {
return false, []ContainerDevices{}, 0, fmt.Errorf("no enough gpu cards on node %s", gs.Name)
}
klog.V(3).InfoS("Allocating device for container", "request", val)
for i := len(gs.Device) - 1; i >= 0; i-- {
klog.V(3).InfoS("Scoring pod request", "memReq", val.Memreq, "memPercentageReq", val.MemPercentagereq, "coresReq", val.Coresreq, "Nums", val.Nums, "Index", i, "ID", gs.Device[i].ID)
klog.V(3).InfoS("Current Device", "Index", i, "TotalMemory", gs.Device[i].Memory, "UsedMemory", gs.Device[i].UsedMem, "UsedCores", gs.Device[i].UsedCore, "replicate", replicate)
if gs.Device[i].Number <= uint(gs.Device[i].UsedNum) {
continue
}
if val.MemPercentagereq != 101 && val.Memreq == 0 {
val.Memreq = gs.Device[i].Memory * uint(val.MemPercentagereq/100)
}
if int(gs.Device[i].Memory)-int(gs.Device[i].UsedMem) < int(val.Memreq) {
continue
}
if gs.Device[i].UsedCore+val.Coresreq > 100 {
continue
}
// Coresreq=100 indicates it want this card exclusively
if val.Coresreq == 100 && gs.Device[i].UsedNum > 0 {
continue
}
// You can't allocate core=0 job to an already full GPU
if gs.Device[i].UsedCore == 100 && val.Coresreq == 0 {
continue
}
if !checkType(pod.Annotations, *gs.Device[i], val) {
klog.Errorln("failed checktype", gs.Device[i].Type, val.Type)
continue
}
fit, uuid := gs.Sharing.TryAddPod(gs.Device[i], uint(val.Memreq), uint(val.Coresreq))
if !fit {
klog.V(3).Info(gs.Device[i].ID, "not fit")
continue
}
//total += gs.Devices[i].Count
//free += node.Devices[i].Count - node.Devices[i].Used
if val.Nums > 0 {
val.Nums--
klog.V(3).Info("fitted uuid: ", uuid)
devs = append(devs, ContainerDevice{
UUID: uuid,
Type: val.Type,
Usedmem: val.Memreq,
Usedcores: val.Coresreq,
})
score += GPUScore(schedulePolicy, gs.Device[i])
}
if val.Nums == 0 {
break
}
}
if val.Nums > 0 {
return false, []ContainerDevices{}, 0, fmt.Errorf("not enough gpu fitted on this node")
}
ctrdevs = append(ctrdevs, devs)
}
节点打分
根据配置的调度策略进行打分。
const (
binpackMultiplier = 100
spreadMultiplier = 100
)
func GPUScore(schedulePolicy string, device *GPUDevice) float64 {
var score float64
switch schedulePolicy {
case binpackPolicy:
score = binpackMultiplier * (float64(device.UsedMem) / float64(device.Memory))
case spreadPolicy:
if device.UsedNum == 1 {
score = spreadMultiplier
}
default:
score = float64(0)
}
return score
}
逻辑比较简单:
Binpack :device 内存使用率越高,得分越高
Spread: device 有被共享使用得 100 分,否则 0 分。
NodeOrder
上一步已经为节点打好分了,这里只需要根据得分排序即可。
ssn.AddNodeOrderFn(dp.Name(), func(task *api.TaskInfo, node *api.NodeInfo) (float64, error) {
// DeviceScore
nodeScore := float64(0)
if dp.scheduleWeight > 0 {
score, status := getDeviceScore(context.TODO(), task.Pod, node, dp.schedulePolicy)
if !status.IsSuccess() {
klog.Warningf("Node: %s, Calculate Device Score Failed because of Error: %v", node.Name, status.AsError())
return 0, status.AsError()
}
// TODO: we should use a seperate plugin for devices, and seperate them from predicates and nodeOrder plugin.
nodeScore = float64(score) * float64(dp.scheduleWeight)
klog.V(5).Infof("Node: %s, task<%s/%s> Device Score weight %d, score: %f", node.Name, task.Namespace, task.Name, dp.scheduleWeight, nodeScore)
}
return nodeScore, nil
})
}
核心部分:
nodeScore = float64(score) * float64(dp.scheduleWeight)
节点得分 * 权重得到最终得分。
5. FAQ
gpu-memory 显示为 0
现象:device-plugin
部署后 gpu-memory
显示为 0 就像这样:
root@node5-3:~/lixd# k describe node node5-3 |grep Cap -A 10
Capacity:
cpu: 160
ephemeral-storage: 3750157048Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 2113442544Ki
nvidia.com/gpu: 0
pods: 110
volcano.sh/vgpu-cores: 800
volcano.sh/vgpu-memory: 0
volcano.sh/vgpu-number: 80
具体原因:https://github.com/volcano-sh/devices/issues/19
相关描述:
the size of device list exceeds the bound, and ListAndWatch failed as a result。
简而言之就是超过阈值的显存就会报错,导致 DevicePlugin 无法正常上报,因此显示为 0。
解决方案:需要在启动时设置参数 --gpu-memory-factor=10
,将最小的显存块从默认 1MB 改成 10MB,就像这样:
containers:
- image: docker.io/projecthami/volcano-vgpu-device-plugin:v1.10.0
args: ["--device-split-count=10","--gpu-memory-factor=10"]
这样最大能显示的数值就扩大了 10 倍,就可以避免该问题。
效果如下:
root@node5-3:~/lixd# k describe node node5-3 |grep Cap -A 10
Capacity:
cpu: 160
ephemeral-storage: 3750157048Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 2113442544Ki
nvidia.com/gpu: 0
pods: 110
volcano.sh/vgpu-cores: 800
volcano.sh/vgpu-memory: 39312
volcano.sh/vgpu-number: 80
volcano.sh/vgpu-memory: 39312
:由于设置了 factor=10,因此实际代表总显存 39312 * 10 = 393120 MB。
当前环境是 L40S*8,单卡显存 49140,49140 * 8 = 393120,正好符合,说明一切正常。
6. 小结
本文主要验证了 Volcano 如何通过集成HAMi vGPU技术实现 Kubernetes 环境下的 GPU 虚拟化,重点验证了HAMi-Core 模式的完整工作流程。
解答前面的问题:Volcano DevicePlugin 如何实现同时注册三个资源的?
通过启动三个 DevicePlugin 以实现注册 volcano.sh/vgpu-number
、volcano.sh/vgpu-memory
、volcano.sh/vgpu-cores
三种资源。
推荐阅读: