Sample Header Ad - 728x90

GPU RTX 3090 keeps of going into ERR after some usage

0 votes
1 answer
84 views
I have been struggling with some issue with a GPU I have on my machine. Currently the GPU works fine for some training work. But it goes into ERR when I type nvidia-smi. What happens then is that I have a python process which I cannot kill, not even with sudo -kill 9 PID. This is always accompanied by a core whose bar is 100% red in htop, not sure what that means. If I try to restart the GPU, it tells me it cannot because the GPU is being used in other processes, which I guess are the ones I cannot kill. This happens consistently, if I reboot the problem seems solved, but again after a couple of training exercises the issue returns. The main issue is that most of the time I am connecting to my machine trough ssh, so if I reboot I have to ask someone to turn back on my machine, or go myself. The OS on my machine is Manjaro. But I had similar issues with Ubuntu 22.04, where I got
CUDA error: unspecified launch failure
I don't think it can be hardware related, as the GPU is one year old, and again it is able to train once restarted. The specs of my machine are the following: - CPU: intel i9-13900K/KF 5.8GHz - Motherboard: MSI PRO Z690-A DDR4 - RAM: 64GB DDR4 3200Mhz 2x32GB - Power supply: Corsair RM1000 80+ Gold Modular The machine has also another GPU, which is an RTX 2080 TI. Is there a fix to this problem? This is really worrying and problematic for workflow as you can imagine. Best, Luca
Asked by luchino_prince (101 rep)
Mar 15, 2024, 03:00 AM
Last activity: Mar 15, 2024, 04:14 PM