nvidia

Introduction

The nodes have an Nvidia Tesla P4 gpu installed for gpu-enabled computing. Cuda 12.4 is available by default.

Using Python

Using Nvidia is a real pain. There are two options: When submitting jobs through slurm, you can use computecanada's prebuilt modules via cvmfs. When using jupyterhub locally, just pick the “local” kernel.

CVMFS with slurm

Do this by:

(enable modules if not yet enabled): `source /cvmfs/soft.computecanada.ca/config/profile/bash.sh`
`export CC_CLUSTER=cedar` (compute nodes most closely resemble those of cedar
Load the python module `module load python`
`pip install –no-index torch torchvision torchtext torchaudio` (after loading the module, we will be using computecanada's prebuilt torches).

Local Jupyterhub kernel

(If you're just submitting jobs you don't need to worry about this. This was just how I installed the kernel locally).

Install the Nvidia driver using the runfile from their website (after `systemctl isolate multi-user.target` to disable X!)
For ubuntu 22.04, make sure you symlink gcc to gcc12 (at least during the installation). The default kernel for ubuntu 22.04 is compiled with gcc12.

For some reason torch 2.4.1 does not package the correct version of libnccl2. To get around this, follow the instructions here, except: For reasons, torch is still using cuda12.4 but if you let nvidia auto install, you will end up with cuda 12.6, which is not supported by torch (as of september 2024). Specify:

Remove any installed torch: `pip uninstall -y torch torchvision torchaudio`
Install libnccl: `sudo apt install libnccl2=2.21.5-1+cuda12.4 libnccl-dev=2.21.5-1+cuda12.4`
Install cudnn: `sudo apt-get -y install cudnn`
Install torch `pip install torch torchvision torchaudio –index-url https://download.pytorch.org/whl/cu124
Remove the libnccl2 packaged with torch: `mv /path/to/venv/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2 /path/to/venv/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2.old`

By this point the libraries for the version of libnccl necessary should be picked up by torch so you wont have the error “undefined symbol: ncclCommRegister”

nvcc

Many versions of nvcc are installed for each version of cuda. Check `/usr/local/cuda*` for the correct one. You may need to symlink /usr/local/cuda/bin/ to point to the right version. Alternatively, change your PATH=/usr/local/cuda-VERSION/bin/:$PATH before you start compiling anything that needs the version of nvcc you need.