-
Check if there's any hardware failure, i.e. gpu lost, by listing all gpus:
lspci | grep -i nvidia
-
Purge CUDA & nvidia driver:
sudo apt --purge remove "*cublas*" "cuda*" "nsight*" "*cudnn*" "libnvidia*" -y sudo apt remove --purge '^nvidia-.*' -y # (Optional, Prefer) Remove all CUDA to avoid possible confilication with new driver sudo rm -rf /usr/local/cuda* sudo apt --purge autoremove -y sudo apt autoclean
-
Check if there is any leftover and remove them:
dpkg -l | grep -i nvidia dpkg -l | grep nvidia-driver sudo apt --purge remove {some-pkg}
Tip
Besides /usr/local/cuda*
, make sure to check $CUDA_HOME
, $PATH
and $LD_LIBRARY_PATH
for non-default CUDA installation path.
Important
Read this offical blog to check which flavor of driver does your gpu need first.
-
Check CUDA Toolkit Archive List to find prefered version and follow the instructions.
- If you want versions lower than 12.3, then check this doc.
-
Check cuDNN Archive List to find prefered version and follow the instuctions.
-
An example if you use
Ubuntu 24.04(x86_64)
anddeb(network)
for CUDA 12.5 Update 1 installation:- Base Installer:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt update sudo apt install cuda-toolkit-12-5 -y
- Driver Installer, open kernel module flavor:
sudo apt install nvidia-driver-555-open -y sudo apt install cuda-drivers-555 -y
- Driver Installer, legacy kernel module flavor:
sudo apt install cuda-drivers -y
-
Set CUDA
PATH
andLD_LIBRARY_PATH
to your[ba/z]shrc
:# set cuda path if nvidia-smi works if command -v nvidia-smi &>/dev/null; then [[ ":$path:" == *":/usr/local/cuda/bin:"* ]] || export PATH=/usr/local/cuda/bin${PATH:+:${PATH}} [[ ":$LD_LIBRARY_PATH:" == *":/usr/local/cuda/lib64:"* ]] || export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} fi
-
Reload the system path:
source ~/.[ba/z]shrc
-
Reboot and check gpu info with
nvidia-smi
. (prefered)- If you want to reload gpu mods while keeping machine alive, use following commands.
# Change to CLI only mode sudo systemctl isolate multi-user.target # Kill processes using nvidia devices if any sudo lsof /dev/nvidia* sudo lsof -t /dev/nvidia* | xargs sudo kill -9 # Remove nvidia module # "rmmod" shows the dependencies, remove them recursively and manually with "sudo rmmod sth" sudo rmmod nvidia # Reload nvidia module sudo modprobe nvidia # Set to default target sudo systemctl default
- If
nvidia-modprobe
is broken or missing, fix it via following commands:
# Install nvidia-modprobe sudo apt install nvidia-modprobe # Check nvidia-modprobe version nvidia-modprobe -v
Caution
However if you manually reload nvidia-mod, it may sriously slow down the gpu process until you reboot your machine.
Note
Persistence Daemon
is prefered than Persistence Mode
and enabled by default after nvida driver R319.
Note
You might need to install third-party libraries for compiling cuda-samples, check here to install the libs.
-
Check if
Persistence Daemon
is active:sudo systemctl status nvidia-persistenced.service
If it's active, it should shows:(click-to-expand)
● nvidia-persistenced.service - NVIDIA Persistence Daemon Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; static) Active: active (running) since Sun 2024-07-21 07:31:57 UTC; 5h 53min ago Main PID: 1331 (nvidia-persiste) Tasks: 1 (limit: 38220) Memory: 368.0K (peak: 844.0K) CPU: 1ms CGroup: /system.slice/nvidia-persistenced.service └─1331 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose Jul 21 07:31:57 {hostname} systemd[1]: Starting nvidia-persistenced.service - NVIDIA Persistence Daemon... Jul 21 07:31:57 {hostname} nvidia-persistenced[1331]: Verbose syslog connection opened Jul 21 07:31:57 {hostname} nvidia-persistenced[1331]: Now running with user ID 116 and group ID 120 Jul 21 07:31:57 {hostname} nvidia-persistenced[1331]: Started (1331) Jul 21 07:31:57 {hostname} nvidia-persistenced[1331]: device 0000:01:00.0 - registered Jul 21 07:31:57 {hostname} nvidia-persistenced[1331]: device 0000:02:00.0 - registered Jul 21 07:31:57 {hostname} nvidia-persistenced[1331]: Local RPC services initialized Jul 21 07:31:57 {hostname} systemd[1]: Started nvidia-persistenced.service - NVIDIA Persistence Daemon.
- If it's not active, enable the daemon via:
sudo systemctl enable nvidia-persistenced.service sudo systemctl start nvidia-persistenced.service
-
Verify the installation:
- For CUDA Toolkit, follow the steps for cuda-samples.
- For cuDNN, follow the steps in the doc.
- CUDA FAQ
- Useful nvidia-smi Queries
- R367.38 nvidia-smi.txt
- nvtop - GPU & Accelerator process monitoring for AMD, Apple, Huawei, Intel, NVIDIA and Qualcomm
- nvitop - An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
- gpustat - A simple command-line utility for querying and monitoring GPU status
- nvidia-htop - A tool for enriching the output of nvidia-smi.
- CUDA Toolkit:
- cuDNN
- GitHub repo
- Docs and Blogs