Sample Header Ad - 728x90

The strange power consumption behaviour of a Quadro card when `vfio-pci` has been removed an `nvidial` reattached

1 vote
0 answers
204 views
I have built a system with a Geforce GTX 960 and a Quadro M4000 graphics card, that I usually pass through to a virtual machine. The GTX 960 card is only used by the host. Normally, the Quadro card would not be available by the host, because the kernel driver vfio-pci prevents it from being used. However, when I don't use it in the virtual machine, then I would like to have it accessible from the host machine, e.g. to do some computation. But, there is this very strange behaviour in power consumption and fan speed... How can I reduce the power consumption and fan speed without needing to have nvidia-setttings open all the time? From my notes: ## Reuse a Passed-through-ready Device on the Host Supposed a secondary graphics card, that has been prepared for passing it through to a guest, should be used on the host instead. The device would normally not be usable on the host, since the wrong driver is loaded. Here, the Quadro M4000 has the vfio-pci driver in use, but instead the nvidia driver should be used.
sudo lspci -nnk | egrep -A3 "VGA|Display|3D"
  # 0b:00.0 VGA compatible controller : NVIDIA Corporation GM206 [GeForce GTX 960] [10de:1401] (rev a1)
  # Subsystem: Gigabyte Technology Co., Ltd Device [1458:36ac]
  # Kernel driver in use: nvidia
  # Kernel modules: nouveau, nvidia_drm, nvidia
  # --
  # 0c:00.0 VGA compatible controller : NVIDIA Corporation GM204GL [Quadro M4000] [10de:13f1] (rev a1)
  # Subsystem: Hewlett-Packard Company Device [103c:1153]
  # Kernel driver in use: vfio-pci
  # Kernel modules: nouveau, nvidia_drm, nvidia
Unload the vfio-pci driver and check the device status again. No kernel driver should be in use, hence line Kernel driver in use: ... is gone.
sudo modprobe -r vfio-pci
sudo lspci -nnk | egrep -A3 "VGA|Display|3D"
  # 0b:00.0 VGA compatible controller : NVIDIA Corporation GM206 [GeForce GTX 960] [10de:1401] (rev a1)
  # Subsystem: Gigabyte Technology Co., Ltd Device [1458:36ac]
  # Kernel driver in use: nvidia
  # Kernel modules: nouveau, nvidia_drm, nvidia
  # --
  # 0c:00.0 VGA compatible controller : NVIDIA Corporation GM204GL [Quadro M4000] [10de:13f1] (rev a1)
  # Subsystem: Hewlett-Packard Company Device [103c:1153]
  # Kernel modules: nouveau, nvidia_drm, nvidia
  # 0c:00.1 Audio device : NVIDIA Corporation GM204 High Definition Audio Controller [10de:0fbb] (rev a1)
Also check the output of the nvidia driver tool nvidia-smi. It should list only one graphics card (the not-passed-through GTX 960).
sudo nvidia-smi 
  # Tue Sep 28 18:19:36 2021       
  # +-----------------------------------------------------------------------------+
  # | NVIDIA-SMI 470.74       Driver Version: 470.74       CUDA Version: 11.4     |
  # |-------------------------------+----------------------+----------------------+
  # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
  # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
  # |                               |                      |               MIG M. |
  # |===============================+======================+======================|
  # |   0  NVIDIA GeForce ...  Off  | 00000000:0B:00.0  On |                  N/A |
  # |  0%   51C    P8    19W / 160W |    477MiB /  4040MiB |      0%      Default |
  # |                               |                      |                  N/A |
  # +-------------------------------+----------------------+----------------------+
  # ...
Remove all associated PCI devices from the system. In this case, those are 0c:00.0 and 0c:00.1. Then check that those are actually gone.
echo 1 | sudo tee /sys/bus/pci/devices/0000\:0c\:00.0/remove
echo 1 | sudo tee /sys/bus/pci/devices/0000\:0c\:00.1/remove
sudo ls /sys/bus/pci/devices/ | grep 0c:00.
  # nothing...
Then let it rescan for PCI devices and check if the devices are there again and enabled. Also check which kernel driver is in use and what nvidia-smi is telling.
echo 1 | sudo tee /sys/bus/pci/rescan
sudo ls /sys/bus/pci/devices/ | grep 0c:00.
sudo cat /sys/bus/pci/devices/0000\:0c\:00.?/enable
  # 1
  # 1
sudo lspci -nnk | egrep -A3 "VGA|Display|3D"
  # 0b:00.0 VGA compatible controller : NVIDIA Corporation GM206 [GeForce GTX 960] [10de:1401] (rev a1)
  # Subsystem: Gigabyte Technology Co., Ltd Device [1458:36ac]
  # Kernel driver in use: nvidia
  # Kernel modules: nouveau, nvidia_drm, nvidia
  # --
  # 0c:00.0 VGA compatible controller : NVIDIA Corporation GM204GL [Quadro M4000] [10de:13f1] (rev a1)
  # Subsystem: Hewlett-Packard Company Device [103c:1153]
  # Kernel driver in use: nvidia      # <-- here!
  # Kernel modules: nouveau, nvidia_drm, nvidia
sudo nvidia-smi 
  # Tue Sep 28 18:26:16 2021       
  # +-----------------------------------------------------------------------------+
  # | NVIDIA-SMI 470.74       Driver Version: 470.74       CUDA Version: 11.4     |
  # |-------------------------------+----------------------+----------------------+
  # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
  # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
  # |                               |                      |               MIG M. |
  # |===============================+======================+======================|
  # |   0  NVIDIA GeForce ...  Off  | 00000000:0B:00.0  On |                  N/A |
  # |  0%   47C    P8    19W / 160W |    479MiB /  4040MiB |      0%      Default |
  # |                               |                      |                  N/A |
  # +-------------------------------+----------------------+----------------------+
  # |   1  Quadro M4000        Off  | 00000000:0C:00.0 Off |                  N/A |
  # | 45%   37C    P0    42W / 120W |      0MiB /  8127MiB |      2%      Default |
  # |                               |                      |                  N/A |
  # +-------------------------------+----------------------+----------------------+
  # ...
Funny enough, the Quadro M4000 consumes about 42 Watts under absolutely no load. I guess this is due to a driver problem... **However**, if the graphical nvidia-settings program is loaded, the power demand **drops** to about **12 Watts**.
# Terminal A
watch -d -n 1 sudo nvidia-smi
# Terminal B
nvidia-settings
Watch nvidia-smi and listen to the fan noise when the magic happens...
watch -d -n 1 sudo nvidia-smi
  # ...
  # +-------------------------------+----------------------+----------------------+
  # |   1  Quadro M4000        Off  | 00000000:0C:00.0 Off |                  N/A |
  # | 46%   38C    P0    10W / 120W |      0MiB /  8127MiB |      0%      Default |
  # |                               |                      |                  N/A |
  # +-------------------------------+----------------------+----------------------+
  # ...
Best of all -- nvidia-settings does not even list my Quadro card... No Quadro card in nvidia-settings
Asked by dani (102 rep)
Sep 28, 2021, 05:21 PM