Sample Header Ad - 728x90

Slurm jobs ignore GPU skipping in gres.conf

0 votes
0 answers
26 views
When I specify in gres.conf to omit the first GPU, Processes in Slurm still use the first one. If I allow Slurm to manage both, the second concurrent process properly goes onto the second GPU. Why? Context: I have a server with ubuntu 20.4.2 LTS with Slurm (23.11.4 from apt) installed. The server has 2 GPUS: Wed Apr 2 12:13:38 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4070 Off | 00000000:01:00.0 Off | N/A | | 30% 29C P8 6W / 200W | 10468MiB / 12282MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4070 Off | 00000000:03:00.0 Off | N/A | | 30% 29C P8 3W / 200W | 158MiB / 12282MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ I intend to take out the first GPU out of local Slurm, but encountered an issue. According to the docs: https://slurm.schedmd.com/gres.conf.html , I should be able to just omit the GPU in gres.conf, eg from this: NodeName=server2 Name=gpu Type=geforce_rtx_4070 File=/dev/nvidia[0-1] To this: NodeName=server2 Name=gpu Type=geforce_rtx_4070 File=/dev/nvidia1 Since doing that alone causes the node to be DRAINED due to count mismatch I also modified slurm.conf compute node resources line from NodeName=server2 (...) Gres=gpu:geforce_rtx_4070:2 (...) to NodeName=server2 (...) Gres=gpu:geforce_rtx_4070:1 (...) The problem is that after those changes, all queued processed (by sbatch --gres=gpu:1 ...) go now onto the first GPU. I would assume that it's some problem with the process completely ignoring boundries made by Slurm, but with the old configuration, when launching two processes, the second launched process properly sees and uses only the second GPU. What might be the reason?
Asked by Bartłomiej Popielarz (101 rep)
Apr 2, 2025, 12:35 PM