root@hami-30:/mnt/b66582121706406e9797ffaf64a831b0# nvidia-smi
[HAMI-core Msg(68:139953433691968:libvgpu.c:836)]: Initializing.....
Mon Oct 14 13:14:23 2024+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 ||-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |||| MIG M. ||===============================+======================+======================||0 NVIDIA A40 Off | 00000000:00:07.0 Off |0|| 0% 30C P8 29W / 300W | 0MiB / 20000MiB | 0% Default |||| N/A |+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=============================================================================|| No running processes found |+-----------------------------------------------------------------------------+
[HAMI-core Msg(68:139953433691968:multiprocess_memory_limit.c:468)]: Calling exit handler 68
测试脚本
然后跑一个脚本测试 申请 20000M 之后是否就会 OOM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
importtorchimportsysdefallocate_memory(memory_size_mb):# 将 MB 转换为字节数,并计算需要分配的 float32 元素个数num_elements=memory_size_mb*1024*1024//4# 1 float32 = 4 bytestry:# 尝试分配显存print(f"Attempting to allocate {memory_size_mb} MB on GPU...")x=torch.empty(num_elements,dtype=torch.float32,device='cuda')print(f"Successfully allocated {memory_size_mb} MB on GPU.")exceptRuntimeErrorase:print(f"Failed to allocate {memory_size_mb} MB on GPU: OOM.")print(e)if__name__=="__main__":# 从命令行获取参数,如果未提供则使用默认值 1024MBmemory_size_mb=int(sys.argv[1])iflen(sys.argv)>1else1024allocate_memory(memory_size_mb)
开始
1
2
3
4
5
6
7
8
9
10
11
12
13
root@hami-30:/mnt/b66582121706406e9797ffaf64a831b0/lixd/hami-test# python test_oom.py 20000[HAMI-core Msg(1046:140457967137280:libvgpu.c:836)]: Initializing.....
Attempting to allocate 20000 MB on GPU...
[HAMI-core Warn(1046:140457967137280:utils.c:183)]: get default cuda from (null)[HAMI-core Msg(1046:140457967137280:libvgpu.c:855)]: Initialized
[HAMI-core ERROR (pid:1046 thread=140457967137280 allocator.c:49)]: Device 0 OOM 21244149760 / 20971520000Failed to allocate 20000 MB on GPU: OOM.
CUDA error: unrecognized error code
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[HAMI-core Msg(1046:140457967137280:multiprocess_memory_limit.c:468)]: Calling exit handler 1046
直接 OOM 了,看来是有点极限了,试试 19500
1
2
3
4
5
6
7
root@hami-30:/mnt/b66582121706406e9797ffaf64a831b0/lixd/hami-test# python test_oom.py 19500[HAMI-core Msg(1259:140397947200000:libvgpu.c:836)]: Initializing.....
Attempting to allocate 19500 MB on GPU...
[HAMI-core Warn(1259:140397947200000:utils.c:183)]: get default cuda from (null)[HAMI-core Msg(1259:140397947200000:libvgpu.c:855)]: Initialized
Successfully allocated 19500 MB on GPU.
[HAMI-core Msg(1259:140397947200000:multiprocess_memory_limit.c:468)]: Calling exit handler 1259