Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

3 votes

1 answers

190 views

nvidia-detect installed but the command not found

I was reinstalling nvidia driver on my Debian 12, but couldn't get `nvidia-detect` working: it is installed and can be seen from `apt list` but the command is not found. ``` $ apt search nvidia-detect nvidia-detect/unknown,now 570.172.08-1 amd64 [installed] Transitional dummy package $ apt list --in...

I was reinstalling nvidia driver on my Debian 12, but couldn't get nvidia-detect working: it is installed and can be seen from apt list but the command is not found.

$ apt search nvidia-detect
nvidia-detect/unknown,now 570.172.08-1 amd64 [installed]
  Transitional dummy package

$ apt list --installed | ag nvidia
nvidia-detet/unknown,now 570.172.08-1 amd64 [installed]

$ nvidia-detect
bash: nvidia-detect: command not found

Why is nvidia-detect a **Transitional dummy package**? and why can't I use it? What is the problem here?

Rahn (289 rep)

Jul 18, 2025, 03:06 PM • Last activity: Jul 18, 2025, 05:45 PM

0 votes

1 answers

2873 views

How to install CUDA 9.2 on Linux Mint 19

linux-mint cuda machine-learning

I would like to install CUDA 9.2. on Linux Mint 19. It was possible to [install CUDA on Linux Mint 18.3](https://unix.stackexchange.com/questions/433743/how-to-install-cuda-9-1-on-mint-18-3/435095) even though Linux Mint is not an [officially supported distribution](https://docs.nvidia.com/cuda/cuda...

                                  I would like to install CUDA 9.2. on Linux Mint 19.

It was possible to [install CUDA on Linux Mint 18.3](https://unix.stackexchange.com/questions/433743/how-to-install-cuda-9-1-on-mint-18-3/435095)  even though Linux Mint is not an [officially supported distribution](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#system-requirements) . However, 
Linux Mint 18 was based on Ubuntu 16.04, while the newer Linux Mint 19 is based on Ubuntu 18.04  ([wikipedia](https://en.wikipedia.org/wiki/Linux_Mint_version_history)) . 

Ubuntu 18.04 seems to have a [super easy solution](https://askubuntu.com/questions/1028830/how-do-i-install-cuda-on-ubuntu-18-04?rq=1)  (most popular answer) which is also tempting but I would prefer to avoid switching distros if I can avoid it.

Sam Murphy (101 rep)

Aug 23, 2018, 10:32 AM • Last activity: Jul 12, 2025, 08:46 AM

0 votes

1 answers

1938 views

failure: repodata/repomd.xml from libnvidia-container: [Errno 256] No more mirrors to try

centos drivers nvidia mirror cuda

I followed the instructions from https://developer.nvidia.com/cuda-80-ga2-download-archive sudo rpm -i cuda-repo-rhel7-8-0-local-ga2-8.0.61-1.x86_64-rpm sudo yum clean all sudo yum install cuda Now I get this error at the last step: [jalal@goku GoodNews]$ sudo yum install cuda [sudo] password for ja...

                                  I followed the instructions from https://developer.nvidia.com/cuda-80-ga2-download-archive 

    sudo rpm -i cuda-repo-rhel7-8-0-local-ga2-8.0.61-1.x86_64-rpm
    sudo yum clean all
    sudo yum install cuda


Now I get this error at the last step:

    [jalal@goku GoodNews]$ sudo yum install cuda
    [sudo] password for jalal: 
    Loaded plugins: aliases, changelog, copr, fastestmirror, kabi, langpacks, nvidia, priorities, product-id, search-disabled-repos, subscription-
                  : manager, tmprepo, verify, versionlock
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Loading support for Red Hat kernel ABI
    Loading mirror speeds from cached hostfile
     * centos-sclo-rh: centos.mirror.constant.com
     * centos-sclo-sclo: mirror.lug.udel.edu
     * remi-php70: repo1.ash.innoscale.net
     * remi-php71: repo1.ash.innoscale.net
     * remi-php73: repo1.ash.innoscale.net
     * remi-safe: repo1.ash.innoscale.net
     * webtatic: us-east.repo.webtatic.com
    Atom/x86_64/signature                                                                                                   |  833 B  00:00:00     
    Atom/x86_64/signature                                                                                                   | 1.0 kB  00:00:00 !!! 
    WANdisco-git                                                                                                            | 2.9 kB  00:00:00     
    adobe-linux-x86_64                                                                                                      | 2.9 kB  00:00:00     
    base                                                                                                                    | 3.6 kB  00:00:00     
    carlwgeorge-ripgrep                                                                                                     | 3.3 kB  00:00:00     
    centos-sclo-rh                                                                                                          | 3.0 kB  00:00:00     
    centos-sclo-sclo                                                                                                        | 3.0 kB  00:00:00     
    code                                                                                                                    | 3.0 kB  00:00:00     
    cs                                                                                                                      | 2.9 kB  00:00:00     
    cuda-8-0-local-ga2                                                                                                      | 2.5 kB  00:00:00     
    docker-ce-stable                                                                                                        | 3.5 kB  00:00:00     
    epel                                                                                                                    | 4.7 kB  00:00:00     
    extras                                                                                                                  | 2.9 kB  00:00:00     
    google-chrome                                                                                                           | 1.3 kB  00:00:00     
    ius                                                                                                                     | 1.3 kB  00:00:00     
    jknife-ue4deps                                                                                                          | 3.0 kB  00:00:00     
    libnvidia-container/x86_64/signature                                                                                    |  488 B  00:00:00     
    Retrieving key from https://nvidia.github.io/libnvidia-container/gpgkey 
    libnvidia-container/x86_64/signature                                                                                    | 2.1 kB  00:00:00 !!! 
    https://nvidia.github.io/libnvidia-container/centos7/x86_64/repodata/repomd.xml : [Errno -1] repomd.xml signature could not be verified for libnvidia-container
    Trying other mirror.
    
    
     One of the configured repositories failed (libnvidia-container),
     and yum doesn't have enough cached data to continue. At this point the only
     safe thing yum can do is fail. There are a few ways to work "fix" this:
    
         1. Contact the upstream for the repository and get them to fix the problem.
    
         2. Reconfigure the baseurl/etc. for the repository, to point to a working
            upstream. This is most often useful if you are using a newer
            distribution release than is supported by the repository (and the
            packages for the previous distribution release still work).
    
         3. Run the command with the repository temporarily disabled
                yum --disablerepo=libnvidia-container ...
    
         4. Disable the repository permanently, so yum won't use it by default. Yum
            will then just ignore the repository until you permanently enable it
            again or use --enablerepo for temporary usage:
    
                yum-config-manager --disable libnvidia-container
            or
                subscription-manager repos --disable=libnvidia-container
    
         5. Configure the failing repository to be skipped, if it is unavailable.
            Note that yum will try to contact the repo. when it runs most commands,
            so will have to try and fail each time (and thus. yum will be be much
            slower). If it is a very temporary problem though, this is often a nice
            compromise:
    
                yum-config-manager --save --setopt=libnvidia-container.skip_if_unavailable=true
    
    failure: repodata/repomd.xml from libnvidia-container: [Errno 256] No more mirrors to try.
    https://nvidia.github.io/libnvidia-container/centos7/x86_64/repodata/repomd.xml : [Errno -1] repomd.xml signature could not be verified for libnvidia-container

I have the following:

    $ nvcc --version
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2016 NVIDIA Corporation
    Built on Tue_Jan_10_13:22:03_CST_2017
    Cuda compilation tools, release 8.0, V8.0.61



    $ lsb_release -a
    LSB Version:	:core-4.1-amd64:core-4.1-noarch
    Distributor ID:	CentOS
    Description:	CentOS Linux release 7.8.2003 (Core)
    Release:	7.8.2003
    Codename:	Core

and two 1080 Ti GPUs.


                                

Mona Jalal (119 rep)

Sep 19, 2020, 04:42 AM • Last activity: Jun 11, 2025, 01:03 AM

0 votes

0 answers

25 views

CUDA initialization: Unexpected error from cudaGetDeviceCount() error while trying to install Stable Diffusion on FreeBSD + nvidia driver 570.144

freebsd nvidia cuda

I'm trying to install Stable Diffusion on FreeBSD 14.2 + nvidia driver 570.144,following this tutorial : https://github.com/verm/freebsd-stable-diffusion There Verm used nvidia driver 525 and FreeBSD 13 I presume,but now I'm running FreeBSD 14.2 and the nvidia driver 570.144. This is what I did : #...

                                  I'm trying to install Stable Diffusion on FreeBSD 14.2 + nvidia driver 570.144,following this tutorial :

https://github.com/verm/freebsd-stable-diffusion 

There Verm used nvidia driver 525 and FreeBSD 13 I presume,but now I'm running FreeBSD 14.2 and the nvidia driver 570.144.

This is what I did :

    # pkg install linux-miniconda-installer
    # miniconda-installer
    # bash
    # source ${BASE_PATH}/etc/profile.d/conda.sh
    # conda activate
    (base) #
    (base) # conda create --name pytorch python=3.10
    (base) # conda activate pytorch
      # pip install torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113 
    (pytorch) # LD_PRELOAD="/compat/dummy-uvm.so" python3 -c 'import torch; print(torch.cuda.is_available())'
      True

but with the driver : 570.144 it does not work anymore :

    (pytorch) marietto# nv-sglrun nvidia-smi
    
    /usr/local/lib/libc6-shim/libc6.so: shim init
    Wed May 21 09:41:54 2025       
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 570.144                Driver Version: 570.144        CUDA Version: 12.8     |
    |-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  NVIDIA GeForce GTX 1060 3GB    Off |   00000000:01:00.0  On |                  N/A |
    | 56%   39C    P8              7W /  120W |     279MiB /   3072MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
                                                                                             
    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    +-----------------------------------------------------------------------------------------+
    
    [marietto@marietto ~]==> source ./miniconda3//etc/profile.d/conda.sh
    
    [marietto@marietto ~]==> conda activate
      (base) #
    
    marietto# conda create --name pytorch python=3.10
    /home/marietto/miniconda3/lib/python3.12/site-packages/archspec/cpu/detect.py:290: UserWarning: [Errno 2] No such file or directory: '/proc/cpuinfo'
      warnings.warn(str(exc))
    Retrieving notices: done
    Channels:
     - defaults
    Platform: linux-64
    Collecting package metadata (repodata.json): done
    Solving environment: done
    
    ## Package Plan ##
    
      environment location: /home/marietto/miniconda3/envs/pytorch
    
      added / updated specs:
        - python=3.10
    
    
    The following packages will be downloaded:
    
        package                    |            build
        ---------------------------|-----------------
        python-3.10.16             |       he870216_1        26.9 MB
        setuptools-78.1.1          |  py310h06a4308_0         1.7 MB
        wheel-0.45.1               |  py310h06a4308_0         115 KB
        ------------------------------------------------------------
                                               Total:        28.7 MB
    
    The following NEW packages will be INSTALLED:
    
      _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
      _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
      bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 
      ca-certificates    pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 
      ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
      libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
      libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
      libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
      libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
      libuuid            pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 
      ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
      openssl            pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 
      pip                pkgs/main/noarch::pip-25.1-pyhc872135_2 
      python             pkgs/main/linux-64::python-3.10.16-he870216_1 
      readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
      setuptools         pkgs/main/linux-64::setuptools-78.1.1-py310h06a4308_0 
      sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
      tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
      tzdata             pkgs/main/noarch::tzdata-2025b-h04d1e81_0 
      wheel              pkgs/main/linux-64::wheel-0.45.1-py310h06a4308_0 
      xz                 pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 
      zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 
    
    
    Proceed ([y]/n)? Y
    
    Downloading and Extracting Packages:
                                                                                                             
    Preparing transaction: done                                                                              
    Verifying transaction: done                                                                              
    Executing transaction: done
    #
    # To activate this environment, use
    #
    #     $ conda activate pytorch
    #
    # To deactivate an active environment, use
    #
    #     $ conda deactivate
    
    marietto# exit
    (base) [marietto@marietto ~]$ conda activate pytorch
    
    Defaulting to user installation because normal site-packages is not writeable
    Looking in indexes: https://pypi.org/simple , https://download.pytorch.org/whl/cu113 
    Collecting torch==1.12.1+cu113
      Downloading https://download.pytorch.org/whl/cu113/torch-1.12.1%2Bcu113-cp310-cp310-linux_x86_64.whl  (1837.7 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 GB 4.5 MB/s eta 0:00:00
    Collecting typing-extensions (from torch==1.12.1+cu113)
      Downloading typing_extensions-4.13.2-py3-none-any.whl.metadata (3.0 kB)
    Downloading typing_extensions-4.13.2-py3-none-any.whl (45 kB)
    Installing collected packages: typing-extensions, torch
       ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 1/2 [torch]  WARNING: The scripts convert-caffe2-to-onnx, convert-onnx-to-caffe2 and torchrun are installed in '/home/marietto/.local/bin' which is not on PATH.
      Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
    Successfully installed torch-1.12.1+cu113 typing-extensions-4.13.2
    
    (pytorch) [marietto@marietto ~]$ export CUDA_VISIBLE_DEVICES=0,1;
    
    (pytorch) [marietto@marietto ~]$ LD_PRELOAD="/mnt/da0p2/CG/Tools/Stable-Diffusion/dummy-uvm.so" python3 -c 'import torch; print(torch.cuda.is_available())'
    /home/marietto/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:83: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:109.)
      return torch._C._cuda_getDeviceCount() > 0
    False


                                

mister_smith (1 rep)

May 21, 2025, 03:18 PM

5 votes

3 answers

7346 views

How do I use Cuda toolkit nvcc 11.7.1 on Fedora 36?

fedora nvidia cuda

As of Sept 2022, Nvidia still has not officially supported cuda toolkit on Fedora 36. The particular part missing is support for gcc12, which Fedora 36 defaults to. One solution to use nvcc on fedora is to go to [fedora mirrors](https://admin.fedoraproject.org/mirrormanager/) and download Fedora 35....

                                  As of Sept 2022, Nvidia still has not officially supported cuda toolkit on Fedora 36. The particular part missing is support for gcc12, which Fedora 36 defaults to. One solution to use nvcc on fedora is to go to [fedora mirrors](https://admin.fedoraproject.org/mirrormanager/)  and download Fedora 35. However, I'd like to know **how to getting nvcc to work on Fedora 36.**

[There's an RPM fusion wiki page on cuda](https://rpmfusion.org/Howto/CUDA) , though some of the info is still somewhat difficult to find.

The fedora 35 cuda repo [is complete and has all the necessary files](https://developer.download.nvidia.com/compute/cuda/repos/fedora35/x86_64/) , but (as of Sept 2022) the equivalent fedora 36 nvidia cuda repo [exists but seems incomplete](https://developer.download.nvidia.com/compute/cuda/repos/fedora36/x86_64/) , in particular it's missing the rpm files that start with cuda-11....

xdavidliu (393 rep)

Sep 5, 2022, 09:53 PM • Last activity: Apr 30, 2025, 02:05 AM

1 votes

3 answers

10166 views

Ubuntu 22.04, Dual graphic cards, Intel for display, NVidia for GPGPU, how to setup?

ubuntu nvidia intel cuda

My laptop got 2 graphics cards, something very similar to: [https://www.linuxbabe.com/desktop-linux/switch-intel-nvidia-graphics-card-ubuntu][1] . I would like to have **Intel** graphic card for display purpose only, and **NVidia** graphic card for heavy computation (GPGPU). My questions: 1. Do I st...

                                  My laptop got 2 graphics cards, something very similar to: https://www.linuxbabe.com/desktop-linux/switch-intel-nvidia-graphics-card-ubuntu  .

I would like to have **Intel** graphic card for display purpose only, and **NVidia** graphic card for heavy computation (GPGPU).

My questions:

 1. Do I still need to install NVidia driver? It seems the driver is **for display ONLY** ? So, is it a must to install NVidia driver if I do **NOT** expect to use NVidia card for display ?
 2. Without NVidia driver, will those 3rd-party libraries still be able to run? For instance, Tensorflow, etc.? From https://docs.nvidia.com/deploy/cuda-compatibility/index.html , it's clearly written that: 

> To build an application, a developer has to install only the CUDA
> Toolkit and necessary libraries required for linking.
> 
> In order to run a CUDA application, the system should have a CUDA
> enabled GPU and an NVIDIA display driver that is compatible with the
> CUDA Toolkit that was used to build the application itself.

It looks to me: 

 - In order to run CUDA application, I have to install NVidia Driver, which enables the NVidia graphics card.
 - However, the NVidia Driver is for display purpose. In order to use it, I have to use NVidia graphics card for display, rather than using Intel card for display???

Sorry for my naive question. A kind of conceptually confused... Looking forward to the answer.

                                

Pei JIA (111 rep)

Jun 21, 2022, 03:17 AM • Last activity: Apr 29, 2025, 11:10 AM

0 votes

0 answers

26 views

Slurm jobs ignore GPU skipping in gres.conf

ubuntu services slurm cuda

When I specify in `gres.conf` to omit the first GPU, Processes in Slurm still use the first one. If I allow Slurm to manage both, the second concurrent process properly goes onto the second GPU. Why? Context: I have a server with ubuntu 20.4.2 LTS with Slurm (23.11.4 from apt) installed. The server...

                                  When I specify in gres.conf to omit the first GPU, Processes in Slurm still use the first one. If I allow Slurm to manage both, the second concurrent process properly goes onto the second GPU. Why?

Context:

I have a server with ubuntu 20.4.2 LTS with Slurm (23.11.4 from apt) installed.
The server has 2 GPUS:

    Wed Apr  2 12:13:38 2025
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
    |-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  NVIDIA GeForce RTX 4070        Off |   00000000:01:00.0 Off |                  N/A |
    | 30%   29C    P8              6W /  200W |   10468MiB /  12282MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    |   1  NVIDIA GeForce RTX 4070        Off |   00000000:03:00.0 Off |                  N/A |
    | 30%   29C    P8              3W /  200W |     158MiB /  12282MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+

I intend to take out the first GPU out of local Slurm, but encountered an issue.
According to the docs: https://slurm.schedmd.com/gres.conf.html , I should be able to just omit the GPU in gres.conf, eg from this:

    NodeName=server2 Name=gpu Type=geforce_rtx_4070 File=/dev/nvidia[0-1]

To this:

    NodeName=server2 Name=gpu Type=geforce_rtx_4070 File=/dev/nvidia1

Since doing that alone causes the node to be DRAINED due to count mismatch I also modified slurm.conf compute node resources line from NodeName=server2 (...) Gres=gpu:geforce_rtx_4070:2 (...) to NodeName=server2 (...) Gres=gpu:geforce_rtx_4070:1 (...) 

The problem is that after those changes, all queued processed (by sbatch --gres=gpu:1 ...) go now onto the first GPU. I would assume that it's some problem with the process completely ignoring boundries made by Slurm, but with the old configuration, when launching two processes, the second launched process properly sees and uses only the second GPU.

What might be the reason?
                                

Bartłomiej Popielarz (101 rep)

Apr 2, 2025, 12:35 PM

0 votes

0 answers

196 views

Which NVIDIA CUDA version works with Linux Mint 21,22, which kernel and driver?

linux-mint nvidia cuda

I want to try to use GPU in my old laptop with Linux Mint 21 (kernel 5.15.0-41-generic, based on Ubuntu jammy). Trying to get NVIDIA driver and CUDA working I've found out already: 1) can install with some script or from repos. I've been installing from repos. 2) for NVIDIA driver to work after boot...

                                  I want to try to use GPU in my old laptop with Linux Mint 21 (kernel 5.15.0-41-generic, based on Ubuntu jammy). Trying to get NVIDIA driver and CUDA working I've found out already:

1) can install with some script or from repos. I've been installing from repos.
2) for NVIDIA driver to work after boot one can (need to?) blacklist nouveau driver - done via module_blacklist=nouveau in boot command.
3) "open" drivers work for Turing and later, not my Maxwell. So I've installed latest (not -open) from LM 21 repos: apt-get install nvidia-driver-550 nvidia-cuda-toolkit and apt downloaded 11.5 cuda files with it.
4) nvidia-smi shows the card and CUDA 12.4. Even as my debs were 11.5, why btw?

But CUDA samples fail with unknown error, Web search found it could be due to:
1) too new a kernel, solution - switch to LTS. (https://bbs.archlinux.org/viewtopic.php?id=260036) 
2) version (presumably kernel/driver/cuda) issues, solutions: downgrade nvidia driver / cuda. (https://discussion.fedoraproject.org/t/cuda-initialisation-errors/98985) 

I'm already tired of installing and booting. Please tell me which modern versions are compatible/working. Preferably for my preferred distro - Linux Mint.
                                

Martian2020 (1443 rep)

Apr 2, 2025, 01:36 AM • Last activity: Apr 2, 2025, 12:01 PM

1 votes

0 answers

331 views

How do I get rootless podman to work with nvidia gpu after reboot?

rhel nvidia devices selinux cuda

I have a RHEL9 system with a NVIDIA L40S and Driver Version: 570.124.06 CUDA Version: 12.8. Installed as described [here][1] by (basically) running: # dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel9/$(uname -i)/cuda-rhel9.repo # dnf module install nvidia-d...

I have a RHEL9 system with a NVIDIA L40S and Driver Version: 570.124.06 CUDA Version: 12.8. Installed as described here by (basically) running: # dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel9/$(uname -i)/cuda-rhel9.repo # dnf module install nvidia-driver:latest # nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml In order to allow non-root users access to the GPU via podman I changed the selinux type on the /dev/nvidia* device objects like so:

# semanage fcontext -a -t container_file_t '/dev/nvidia(.*)?'
# restorecon -rv /dev/nvidia*
Relabeled /dev/nvidia0 from system_u:object_r:xserver_misc_device_t:s0 to system_u:object_r:container_file_t:s0
Relabeled /dev/nvidia-caps from unconfined_u:object_r:device_t:s0 to unconfined_u:object_r:container_file_t:s0
Relabeled /dev/nvidia-caps/nvidia-cap2 from unconfined_u:object_r:xserver_misc_device_t:s0 to unconfined_u:object_r:container_file_t:s0
Relabeled /dev/nvidia-caps/nvidia-cap1 from unconfined_u:object_r:xserver_misc_device_t:s0 to unconfined_u:object_r:container_file_t:s0
Relabeled /dev/nvidiactl from system_u:object_r:xserver_misc_device_t:s0 to system_u:object_r:container_file_t:s0
Relabeled /dev/nvidia-uvm from unconfined_u:object_r:xserver_misc_device_t:s0 to unconfined_u:object_r:container_file_t:s0
Relabeled /dev/nvidia-uvm-tools from unconfined_u:object_r:xserver_misc_device_t:s0 to unconfined_u:object_r:container_file_t:s0

After which a non-root could run the following successfully: $ podman run --rm --device nvidia.com/gpu=all nvidia/cuda:12.8.1-base-ubi9 nvidia-smi Everything done and looking good I figured until dnf-automatic rebooted the machine. When the machine comes up after reboot all the device files are naturally re-created with the old selinux labels - forcing a root to re-run the restorecon to make the devices available to the user. To make things worse not all nvidia devices are created on boot, specifically the user runs into:

$ podman run --rm --device nvidia.com/gpu=all nvidia/cuda:12.8.1-base-ubi9 nvidia-smi
Error: setting up CDI devices: failed to inject devices: failed to stat CDI host device "/dev/nvidia-uvm": no such file or directory

If I run something like nvidia-smi the driver "wakes up" and creates the uvm-device. After which I can run restorecon. After which the user can do its things. Am I holding it wrong? Should I really need to create a oneshot-unit to massage these things into place? Or is there someway to teach selinux and the nvidia driver how to wake up in the desired state?

azzid (1010 rep)

Mar 21, 2025, 07:33 AM

0 votes

0 answers

69 views

Error when trying to install up to date cuda environment on Ubuntu 22.04

software-installation drivers nvidia cuda

I'm following NVidias instructions there: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local I'm encountering several issues. First of all, after the cuda toolkit installation, I expected to have nvcc available but it appears not to be there. I'm prompted to do apt install nvidia-cuda-toolkit that points to the default repository version and it is outdated. Anyway, I'm trying to get to the next step, which is updating drivers.
The first given step is apt install nvidia-open and it fails with: The following packages have unmet dependencies: > nvidia-kernel-common-535 : Conflicts with: nvidia-kernel-common nvidia-kernel-common-560 : Depends: nvidia-modprobe (>= 560.35.05-0ubuntu1) but 470.103.01-1 will be installed Conflicts with: nvidia-kernel-common E: Error, pkgProblem::Resolve generated breaks, this may be caused by held packages. I tried also my original driver install procedure:

sudo ubuntu-drivers devices
sudo ubuntu-drivers install

The recommanded version appears to be 560 but installation fails with same error as above. How to fix this issue?

Oersted (101 rep)

Jan 9, 2025, 09:22 AM

0 votes

0 answers

17 views

How can I Determine CUDA update version based on installed toolkit files?

version forensics cuda

I've installed some version of the CUDA toolkit to `/usr/local/cuda`. Suppose I don't have access to any information about the system, like activity logs, package management state and such - I'm only inspecting the contents of files in `/usr/local/cuda`. I want to determine the CUDA release version...

                                  I've installed some version of the CUDA toolkit to /usr/local/cuda. Suppose I don't have access to any information about the system, like activity logs, package management state and such - I'm only inspecting the contents of files in /usr/local/cuda.

I want to determine the CUDA release version of this directory, but - I want the update number as well, e.g. I want to distinguish 12.6 update 2 from 12.6 update 0.

Now, if I could download and install back-versions of CUDA, I could achieve this by examining the nvcc --version output, since that usually has both a timestamp of the build and an X.Y.Z version string. Unfortunately, the third element of that string is a number different from the update number, e.g.V12.6.85 for CUDA 12.6 update 3. How do I know that 85 is the fourth value?

What I tried:

* version.json has all sorts of version info, but those values again don't have a trivial correspondence to the update number.
* lib64/* are a bunch of library files, some with versions, but again - no clear correspondence to the update number
* include/ - couldn't find something relevant inside the headers, but maybe I wasn't looking in the right place?

einpoklum (10753 rep)

Dec 19, 2024, 03:44 PM

0 votes

0 answers

34 views

Can I install multiple different versions of cuda on a single system?

drivers nvidia cuda

I got a system with 2 RTX3090. Many ML projects require differing cuda versions. Is it possible to somehow install them side by side, or do I have to either install another version of the nvidia driver each time (or have multiple os)?

                                  I got a system with 2 RTX3090. Many ML projects require differing cuda versions. Is it possible to somehow install them side by side, or do I have to either install another version of the nvidia driver each time (or have multiple os)?
                                

user2741831 (323 rep)

Nov 1, 2024, 04:04 PM

1 votes

0 answers

192 views

GUI Screen Issues on Rocky Linux 9.4 with NVIDIA Driver Installation

gpu cuda rocky-linux

On Rocky Linux 9.4, there is an issue where the GUI screen does not appear after installing the NVIDIA driver. The server is equipped with four PCIE type A100 GPUs. The driver version is 555.42.06, and the installation method is as follows: sudo dnf install kernel-devel-$(uname -r) kernel-headers-$(...

                                  On Rocky Linux 9.4, there is an issue where the GUI screen does not appear after installing the NVIDIA driver. The server is equipped with four PCIE type A100 GPUs.

The driver version is 555.42.06, and the installation method is as follows:

    sudo dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
    sudo rpm --install cuda-repo-rhel9-12-5-local-12.5.1_555.42.06-1.x86_64.rpm
    sudo dnf module install nvidia-driver:latest-dkms

Edit the following file:

    sudo vim /etc/modprobe.d/blacklist-nouveau.conf
    
    blacklist nouveau
    options nouveau modeset=0

Then, run:

    sudo dracut --force
    sudo reboot

After rebooting, the screen remains black after boot. However, switching to CLI mode using ctrl + alt + F2 on the KVM provides a functional CLI screen. Running nvidia-smi indicates that the GPU driver is installed correctly.

I am looking for a solution to get the GUI screen working on Rocky Linux.

Antenna_ (35 rep)

Jul 30, 2024, 12:52 PM

-1 votes

1 answers

308 views

Where is CUDA deviceQuery installed in Pop!OS?

pop-os cuda

I am using Pop!OS and would like to run the CUDA `deviceQuery` tool, which can apparently provide info about the number of blocks and threads available for CUDA on an NVidia GPU. However, I am not sure where `deviceQuery` is installed on the system. Can anyone tell me where I can find the command, o...

                                  I am using Pop!OS and would like to run the CUDA deviceQuery tool, which can apparently provide info about the number of blocks and threads available for CUDA on an NVidia GPU. However, I am not sure where deviceQuery is installed on the system. Can anyone tell me where I can find the command, or which package I need to install to use it?
                                

Time4Tea (2618 rep)

May 7, 2024, 02:08 PM • Last activity: May 8, 2024, 05:41 PM

0 votes

1 answers

851 views

Lots of errors trying to update nvidia drivers with dnf update --refresh on Fedora 39

drivers nvidia kernel-modules dnf cuda

So about a week or two ago I tried updating my kernel which incidentally broke my NVIDIA Drivers somehow. Every startup I'd get the "NVIDIA Kernel module broken. Reverting to nouveau" message or something along those lines. Looked it up and was told by multiple sources to sudo dnf update --refresh,...

Problem 1: package xorg-x11-drv-nvidia-power-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-nvidia-driver requires xorg-x11-drv-nvidia(x86-64) = 3:550.67, but none of the providers can be installed
  - cannot install the best update candidate for package xorg-x11-drv-nvidia-power-3:550.54.14-1.fc39.x86_64
  - package xorg-x11-drv-nvidia-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-nvidia-driver is filtered out by modular filtering
  - package xorg-x11-drv-nvidia-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-updates is filtered out by modular filtering
 Problem 2: package akmod-nvidia-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-nvidia-driver requires nvidia-kmod-common >= 3:550.67, but none of the providers can be installed
  - cannot install the best update candidate for package akmod-nvidia-3:535.129.03-1.fc39.x86_64
  - package xorg-x11-drv-nvidia-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-nvidia-driver is filtered out by modular filtering
  - package xorg-x11-drv-nvidia-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-updates is filtered out by modular filtering
 Problem 3: package nvidia-kmod-common-3:550.54.15-1.fc39.noarch from cuda-fedora39-x86_64 requires nvidia-kmod = 3:550.54.15, but none of the providers can be installed
  - package nvidia-driver-3:550.54.15-1.fc39.x86_64 from cuda-fedora39-x86_64 requires nvidia-kmod-common = 3:550.54.15, but none of the providers can be installed
  - package kmod-nvidia-open-dkms-3:550.54.14-1.fc39.x86_64 from @System conflicts with kmod-nvidia-latest-dkms provided by kmod-nvidia-latest-dkms-3:550.54.15-1.fc39.x86_64 from cuda-fedora39-x86_64
  - cannot install the best update candidate for package xorg-x11-drv-nvidia-3:550.54.14-1.fc39.x86_64
  - cannot install the best update candidate for package kmod-nvidia-open-dkms-3:550.54.14-1.fc39.x86_64
  - package kmod-nvidia-open-dkms-3:550.54.15-1.fc39.x86_64 from cuda-fedora39-x86_64 is filtered out by modular filtering
 Problem 4: package xorg-x11-drv-nvidia-power-3:550.54.14-1.fc39.x86_64 from @System requires xorg-x11-drv-nvidia(x86-64) = 3:550.54.14, but none of the providers can be installed
  - package xorg-x11-drv-nvidia-3:550.54.14-1.fc39.x86_64 from @System requires nvidia-modprobe(x86-64) = 3:550.54.14, but none of the providers can be installed
  - problem with installed package xorg-x11-drv-nvidia-power-3:550.54.14-1.fc39.x86_64
  - cannot install both nvidia-modprobe-3:550.54.15-1.fc39.x86_64 from cuda-fedora39-x86_64 and nvidia-modprobe-3:550.54.14-1.fc39.x86_64 from @System
  - cannot install both nvidia-modprobe-3:550.54.15-1.fc39.x86_64 from cuda-fedora39-x86_64 and nvidia-modprobe-3:550.54.14-1.fc39.x86_64 from cuda-fedora39-x86_64
  - package xorg-x11-drv-nvidia-power-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-nvidia-driver requires xorg-x11-drv-nvidia(x86-64) = 3:550.67, but none of the providers can be installed
  - package xorg-x11-drv-nvidia-power-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-updates requires xorg-x11-drv-nvidia(x86-64) = 3:550.67, but none of the providers can be installed
  - cannot install the best update candidate for package nvidia-modprobe-3:550.54.14-1.fc39.x86_64
  - package xorg-x11-drv-nvidia-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-nvidia-driver is filtered out by modular filtering
  - package xorg-x11-drv-nvidia-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-updates is filtered out by modular filtering
 Problem 5: problem with installed package akmod-nvidia-3:535.129.03-1.fc39.x86_64
  - package akmod-nvidia-3:535.129.03-1.fc39.x86_64 from @System requires xorg-x11-drv-nvidia-kmodsrc = 3:535.129.03, but none of the providers can be installed
  - package akmod-nvidia-3:535.129.03-1.fc39.x86_64 from rpmfusion-nonfree requires xorg-x11-drv-nvidia-kmodsrc = 3:535.129.03, but none of the providers can be installed
  - cannot install both xorg-x11-drv-nvidia-kmodsrc-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-nvidia-driver and xorg-x11-drv-nvidia-kmodsrc-3:535.129.03-2.fc39.x86_64 from @System
  - cannot install both xorg-x11-drv-nvidia-kmodsrc-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-nvidia-driver and xorg-x11-drv-nvidia-kmodsrc-3:535.129.03-2.fc39.x86_64 from rpmfusion-nonfree
  - package akmod-nvidia-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-nvidia-driver requires nvidia-kmod-common >= 3:550.67, but none of the providers can be installed
  - package akmod-nvidia-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-updates requires nvidia-kmod-common >= 3:550.67, but none of the providers can be installed
  - cannot install the best update candidate for package xorg-x11-drv-nvidia-kmodsrc-3:535.129.03-2.fc39.x86_64
  - package xorg-x11-drv-nvidia-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-nvidia-driver is filtered out by modular filtering
  - package xorg-x11-drv-nvidia-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-updates is filtered out by modular filtering
 Problem 6: problem with installed package kmod-nvidia-open-dkms-3:550.54.14-1.fc39.x86_64
  - package kmod-nvidia-open-dkms-3:550.54.14-1.fc39.x86_64 from @System requires nvidia-kmod-common = 3:550.54.14, but none of the providers can be installed
  - package kmod-nvidia-open-dkms-3:550.54.14-1.fc39.x86_64 from cuda-fedora39-x86_64 requires nvidia-kmod-common = 3:550.54.14, but none of the providers can be installed
  - package nvidia-kmod-common-3:550.54.14-1.fc39.noarch from cuda-fedora39-x86_64 requires nvidia-driver = 3:550.54.14, but none of the providers can be installed
  - cannot install both nvidia-driver-3:550.54.15-1.fc39.x86_64 from cuda-fedora39-x86_64 and nvidia-driver-3:550.54.14-1.fc39.x86_64 from cuda-fedora39-x86_64
  - package xorg-x11-drv-nvidia-3:550.54.14-1.fc39.x86_64 from @System requires nvidia-settings(x86-64) = 3:550.54.14, but none of the providers can be installed
  - package nvidia-settings-3:550.54.15-1.fc39.x86_64 from cuda-fedora39-x86_64 requires nvidia-driver(x86-64) = 3:550.54.15, but none of the providers can be installed
  - cannot install both nvidia-settings-3:550.54.15-1.fc39.x86_64 from cuda-fedora39-x86_64 and nvidia-settings-3:550.54.14-1.fc39.x86_64 from @System
  - cannot install both nvidia-settings-3:550.54.15-1.fc39.x86_64 from cuda-fedora39-x86_64 and nvidia-settings-3:550.54.14-1.fc39.x86_64 from cuda-fedora39-x86_64
  - cannot install the best update candidate for package nvidia-settings-3:550.54.14-1.fc39.x86_64
==============================================================================================================================================================================================================
 Package                                                 Architecture                       Version                                         Repository                                                   Size
==============================================================================================================================================================================================================
Skipping packages with conflicts:
(add '--best --allowerasing' to command line to force their upgrade):
 kmod-nvidia-latest-dkms                                 x86_64                             3:550.54.15-1.fc39                              cuda-fedora39-x86_64                                         40 M
 nvidia-driver                                           x86_64                             3:550.54.14-1.fc39                              cuda-fedora39-x86_64                                        126 M
 nvidia-driver                                           x86_64                             3:550.54.15-1.fc39                              cuda-fedora39-x86_64                                        126 M
 nvidia-modprobe                                         x86_64                             3:550.54.15-1.fc39                              cuda-fedora39-x86_64                                         30 k
 nvidia-settings                                         x86_64                             3:550.54.15-1.fc39                              cuda-fedora39-x86_64                                        822 k
 xorg-x11-drv-nvidia-kmodsrc                             x86_64                             3:550.67-1.fc39                                 rpmfusion-nonfree-nvidia-driver                              44 M
Skipping packages with broken dependencies:
 akmod-nvidia                                            x86_64                             3:550.67-1.fc39                                 rpmfusion-nonfree-updates                                    40 k
 nvidia-kmod-common                                      noarch                             3:550.54.14-1.fc39                              cuda-fedora39-x86_64                                         12 k
 nvidia-kmod-common                                      noarch                             3:550.54.15-1.fc39                              cuda-fedora39-x86_64                                         12 k
 xorg-x11-drv-nvidia-power                               x86_64                             3:550.67-1.fc39                                 rpmfusion-nonfree-nvidia-driver                             103 k

Transaction Summary
==============================================================================================================================================================================================================
Skip  10 Packages

Nothing to do.
Complete!

Did what it said and tried adding --best --allowerasing

Problem 1: cannot install the best update candidate for package xorg-x11-drv-nvidia-power-3:550.54.14-1.fc39.x86_64
  - problem with installed package xorg-x11-drv-nvidia-power-3:550.54.14-1.fc39.x86_64
  - package xorg-x11-drv-nvidia-power-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-nvidia-driver requires xorg-x11-drv-nvidia(x86-64) = 3:550.67, but none of the providers can be installed
  - package xorg-x11-drv-nvidia-power-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-updates requires xorg-x11-drv-nvidia(x86-64) = 3:550.67, but none of the providers can be installed
  - package xorg-x11-drv-nvidia-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-nvidia-driver is filtered out by modular filtering
  - package xorg-x11-drv-nvidia-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-updates is filtered out by modular filtering
 Problem 2: problem with installed package akmod-nvidia-3:535.129.03-1.fc39.x86_64
  - cannot install the best update candidate for package akmod-nvidia-3:535.129.03-1.fc39.x86_64
  - package akmod-nvidia-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-nvidia-driver requires nvidia-kmod-common >= 3:550.67, but none of the providers can be installed
  - package akmod-nvidia-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-updates requires nvidia-kmod-common >= 3:550.67, but none of the providers can be installed
  - package xorg-x11-drv-nvidia-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-nvidia-driver is filtered out by modular filtering
  - package xorg-x11-drv-nvidia-3:550.67-1.fc39.x86_64 from rpmfusion-nonfree-updates is filtered out by modular filtering
(try to add '--skip-broken' to skip uninstallable packages)

Doing all this because ever since the Kernel broke, I haven't been able to run any CUDA code. I tried running NVIDIA's Vector Addition sample and got this error: Failed to allocate device vector A (error code system has unsupported display driver / cuda driver combination)! If I try running my own vector output CUDA program, it just outputs 0. Needless to say CUDA is not working. I checked my CUDA Toolkit and even reinstalled it so that should be fine. For all I know I might not be even in the ballpark of what I should be doing to fix this. Ultimate goal is to get my CUDA code working again. Things I tried: - Reinstalling CUDA Toolkit - sudo dnf update --refresh - sudo dnf update --refresh --best --allowerasing - Signing the NVIDIA kernel module (can't remember where but some place said to try it, I followed [this guide](https://blog.monosoul.dev/2022/05/17/automatically-sign-nvidia-kernel-module-in-fedora-36/)) - Completely reinstalling drivers (through [this guide](https://www.tecmint.com/install-nvidia-drivers-in-linux/)) - Reverting to an older kernel (this worked for a while but broke eventually too in the same way, plus I'd rather not be stuck on an older kernel if I can fix the issue) Pretty much all of these had no effect

Lobster Roast (1 rep)

Apr 19, 2024, 07:04 PM • Last activity: Apr 20, 2024, 03:32 AM

0 votes

1 answers

84 views

GPU RTX 3090 keeps of going into ERR after some usage

ubuntu manjaro python3 gpu cuda

I have been struggling with some issue with a GPU I have on my machine. Currently the GPU works fine for some training work. But it goes into ERR when I type `nvidia-smi`. What happens then is that I have a python process which I cannot kill, not even with `sudo -kill 9 PID`. This is always accompan...

I have been struggling with some issue with a GPU I have on my machine. Currently the GPU works fine for some training work. But it goes into ERR when I type nvidia-smi. What happens then is that I have a python process which I cannot kill, not even with sudo -kill 9 PID. This is always accompanied by a core whose bar is 100% red in htop, not sure what that means. If I try to restart the GPU, it tells me it cannot because the GPU is being used in other processes, which I guess are the ones I cannot kill. This happens consistently, if I reboot the problem seems solved, but again after a couple of training exercises the issue returns. The main issue is that most of the time I am connecting to my machine trough ssh, so if I reboot I have to ask someone to turn back on my machine, or go myself. The OS on my machine is Manjaro. But I had similar issues with Ubuntu 22.04, where I got

CUDA error: unspecified launch failure

I don't think it can be hardware related, as the GPU is one year old, and again it is able to train once restarted. The specs of my machine are the following: - CPU: intel i9-13900K/KF 5.8GHz - Motherboard: MSI PRO Z690-A DDR4 - RAM: 64GB DDR4 3200Mhz 2x32GB - Power supply: Corsair RM1000 80+ Gold Modular The machine has also another GPU, which is an RTX 2080 TI. Is there a fix to this problem? This is really worrying and problematic for workflow as you can imagine. Best, Luca

luchino_prince (101 rep)

Mar 15, 2024, 03:00 AM • Last activity: Mar 15, 2024, 04:14 PM

0 votes

0 answers

186 views

Multiple NVIDIA RTX GPU for Cuda (arch linux) with EGPU

arch-linux nvidia gpu cuda

I've got an **arch linux**, with two GPU in the laptop (**thinkpad P14s Gen 4**) + a new RTX 3090 plugged via thunderbolt 4 with the Cool Master EG200 GPU enclosure: ``` ❯ lspci -k | grep -A 2 -E "(VGA|3D)" 00:02.0 VGA compatible controller: Intel Corporation Raptor Lake-P [Iris Xe Graphics] (rev 04...

I've got an **arch linux**, with two GPU in the laptop (**thinkpad P14s Gen 4**) + a new RTX 3090 plugged via thunderbolt 4 with the Cool Master EG200 GPU enclosure:

❯ lspci -k | grep -A 2 -E "(VGA|3D)"
00:02.0 VGA compatible controller: Intel Corporation Raptor Lake-P [Iris Xe Graphics] (rev 04)
        Subsystem: Lenovo Raptor Lake-P [Iris Xe Graphics]
        Kernel driver in use: i915
--
03:00.0 3D controller: NVIDIA Corporation GA107GLM [RTX A500 Laptop GPU] (rev a1)
        Subsystem: Lenovo GA107GLM [RTX A500 Laptop GPU]
        Kernel driver in use: nvidia
--
22:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
        Subsystem: Gigabyte Technology Co., Ltd GA102 [GeForce RTX 3090]
        Kernel driver in use: nvidia

The thunderbolt connection to the RTX 3090 is authorized as you can see here:

❯ sudo boltctl info c4010000-0070-740e-0362-00168691c921
[sudo] password for aemonge: 
 ● Cooler Master Technology,Inc MasterCase EG200
   ├─ type:          peripheral
   ├─ name:          MasterCase EG200
   ├─ vendor:        Cooler Master Technology,Inc
   ├─ uuid:          c4010000-0070-740e-0362-00168691c921
   ├─ dbus path:     /org/freedesktop/bolt/devices/c4010000_0070_740e_0362_00168691c921
   ├─ generation:    Thunderbolt 3
   ├─ status:        authorized
   │  ├─ domain:     69078780-60ab-fe2a-ffff-ffffffffffff
   │  ├─ parent:     69078780-60ab-fe2a-ffff-ffffffffffff
   │  ├─ syspath:    /sys/devices/pci0000:00/0000:00:0d.2/domain0/0-0/0-1
   │  ├─ rx speed:   40 Gb/s = 2 lanes * 20 Gb/s
   │  ├─ tx speed:   40 Gb/s = 2 lanes * 20 Gb/s
   │  └─ authflags:  boot
   ├─ authorized:    Wed 24 Jan 2024 06:49:10 AM UTC
   ├─ connected:     Wed 24 Jan 2024 06:49:10 AM UTC
   └─ stored:        Tue 23 Jan 2024 03:50:50 PM UTC
      ├─ policy:     iommu
      └─ key:        no

I really don't care for the graphics, nor the RTX3090 to be loaded in the xorg nor the graphical interface. I just want it to be used as compute only workloads, and I have followed thouroly this arch wiki https://wiki.archlinux.org/title/External_GPU But givien that context, my nvidia-smi can't seam to find the GPU:

❯ nvidia-smi -L
GPU 0: NVIDIA RTX A500 Laptop GPU (UUID: GPU-762410c2-1c0d-ef4a-89ac-91afd926381b)

Nor can a simple python script, **cuda-devices.py**:

❯ cat cuda-devics.py
import torch

# Check if CUDA is available
if torch.cuda.is_available():
    print("CUDA is available.")
    # Get the number of CUDA devices
    num_devices = torch.cuda.device_count()
    print(f"Number of CUDA devices: {num_devices}")
    # Get the name of each CUDA device
    for i in range(num_devices):
        print(f"Device {i} name: {torch.cuda.get_device_name(i)}")
else:
    print("CUDA is not available.")
❯ python cuda-devics.py
CUDA is available.
Number of CUDA devices: 1
Device 0 name: NVIDIA RTX A500 Laptop GPU

❯ CUDA_VISIBLE_DEVICES="0,1,2" python cuda-devics.py

CUDA is available.
Number of CUDA devices: 1
Device 0 name: NVIDIA RTX A500 Laptop GPU

I have also tried with these three repositories https://github.com/ewagner12/all-ways-egpu , https://github.com/karli-sjoberg/gswitch and https://github.com/hertg/egpu-switcher . To disable the internal GPU's A500 and Iris Xe but it's blaking (black screen). --- ## Solved Solved in https://forums.developer.nvidia.com/t/multiple-nvidia-rtx-gpu-for-cuda-arch-linux-with-egpu/280031/7 nvidia developer forums, by > generic Top Contributor 5h > Please check for a bios update. If none is available, please use Software & Updates to switch to the “-open” driver version and set kernel parameter nvidia.NVreg_OpenRmEnableUnsupportedGpus=1 Which meant the following:

sudo pacman -S nvidia-open

**/boot/loader/entries/*_linux.conf**

# Created by: archinstall
# Created on: ***********
title   Arch Linux (linux)
linux   /vmlinuz-linux
initrd  /intel-ucode.img
initrd  /initramfs-linux.img
options root=PARTUUID=####-####-####### zswap.enabled=0 rw nvidia.NVreg_OpenRmEnableUnsupportedGpus=1 rootfstype=ext4

aemonge (101 rep)

Jan 24, 2024, 11:48 AM • Last activity: Jan 25, 2024, 02:04 PM

1 votes

0 answers

53 views

There are no scripts running, but the GPU memory is still allocated

memory nvidia cuda

I am accessing a remote Linux server from a local machine. There are no scripts running on the remote server, but the GPU memory is still allocated. P.S.: It might be attributable to some crash. The `nvidia-smi` shows: ``` +----------------------------------------------------------------------------...

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:31:00.0 Off |                    0 |
| N/A   34C    P0    42W / 250W |  19403MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  Off  | 00000000:4B:00.0 Off |                    0 |
| N/A   35C    P0    59W / 250W |  10886MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       583      C                                    1001MiB |
|    0   N/A  N/A     16158      C                                    5065MiB |
|    0   N/A  N/A     35103      C                                    1291MiB |
|    0   N/A  N/A     46387      C                                    1337MiB |
|    0   N/A  N/A     54860      C                                    1273MiB |
|    0   N/A  N/A     71766      C                                    2077MiB |
|    0   N/A  N/A     80967      C                                    4991MiB |
|    0   N/A  N/A     83598      C                                    1071MiB |
|    0   N/A  N/A     93077      C                                    1293MiB |
|    1   N/A  N/A       583      C                                     917MiB |
|    1   N/A  N/A     47859      C                                    1297MiB |
|    1   N/A  N/A     74282      C                                    1273MiB |
|    1   N/A  N/A     90599      C                                    7397MiB |
+-----------------------------------------------------------------------------+

An error "No such process" was raised when I tried to kill it:

>>> kill -9 16158
-bash: kill: (16158) - No such process

And the ps -p PID also cannot detect the process:

>>> ps -p 583
 PID TTY          TIME CMD

How can I release this memory? This problem has persisted for some weeks, and it caused an OOM issue today.

lllIlllllIll (11 rep)

Nov 7, 2023, 11:28 AM • Last activity: Nov 7, 2023, 12:09 PM

1 votes

1 answers

2512 views

dpkg-shlibdeps fails with “no dependency information found”

package-management dpkg cuda

I am building a custom debian package for tensorflow. At some point, when I run dpkg-buildpackage -us -uc I get: ``` dpkg-shlibdeps: error: no dependency information found for /usr/local/cuda-9.1/lib64/libcurand.so.9.1 (used by debian/libhal-tensorflow-cc/usr/lib/libtensorflow_framework.so) Hint: ch...

I am building a custom debian package for tensorflow. At some point, when I run dpkg-buildpackage -us -uc I get:

dpkg-shlibdeps: error: no dependency information found for /usr/local/cuda-9.1/lib64/libcurand.so.9.1 (used by debian/libhal-tensorflow-cc/usr/lib/libtensorflow_framework.so)
Hint: check if the library actually comes from a package.
dh_shlibdeps: dpkg-shlibdeps -Tdebian/libhal-tensorflow-cc.substvars debian/libhal-tensorflow-cc/usr/lib/libtensorflow_cc.so debian/libhal-tensorflow-cc/usr/lib/libtensorflow_framework.so returned exit code 2
debian/rules:9: recipe for target 'binary' failed

I looked up this page: https://manpages.debian.org/jessie/dpkg-dev/dpkg-shlibdeps.1.en.html and tried to follow the steps performed by this tool to get the dependency information:

$ dpkg -S libcurand.so.9.1
cuda-curand-9-1: /usr/local/cuda-9.1/targets/x86_64-linux/lib/libcurand.so.9.1
cuda-curand-9-1: /usr/local/cuda-9.1/targets/x86_64-linux/lib/libcurand.so.9.1.85

Actually, there is a corresponding .shlibs file for this package:

$ cat /var/lib/dpkg/info/cuda-curand-9-1.shlibs
libcurand 9.1 cuda-curand-9-1

I checked if the package is actually installed, and it is:

$ apt list | grep cuda-curand-9-1

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

cuda-curand-9-1/unknown,now 9.1.85-1 amd64 [installed,automatic]

so I am out of ideas what it complains about

Piotr G (121 rep)

Sep 3, 2020, 07:03 AM • Last activity: Jul 21, 2023, 03:14 AM

0 votes

0 answers

168 views

A process using CUDA gets stuck, then all others get stuck as well - what do I do?

drivers kill proprietary-drivers hang cuda

I’m writing some program using CUDA CUDA 12.1, running on a Linux system (Devuan Daedalus, kernel version 6.1.27). For some reason (which may be a bug of mine, although I kind of doubt it) - the process gets stuck at some point. Sending it SIGINT, SIGTERM or SIGKILL has no effect. The details of wha...

                                  I’m writing some program using CUDA CUDA 12.1, running on a Linux system (Devuan Daedalus, kernel version 6.1.27).

For some reason (which may be a bug of mine, although I kind of doubt it) - the process gets stuck at some point. Sending it SIGINT, SIGTERM or SIGKILL has no effect. The details of what this process does shouldn’t really matter, but - it doesn’t do file I/O, it doesn’t use the network, it doesn’t use any other peripherals - it just uses CUDA APIs (specifically, execution graphs), does some computation in-memory, and prints messages to its standard output.

So, first part of the question question: How can I kill such a process (other than by rebooting the machine)?

Now, after this process gets stuck - any process using CUDA APIs seems to also get stuck, (almost) immediately when starting to run.

Thus, a second part of the question: Can I avoid other processes getting stuck as well?

einpoklum (10753 rep)

Jul 13, 2023, 12:00 PM

Showing page 1 of 20 total questions