User Tools

Site Tools


nvidia

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nvidia [2024/10/02 00:26] randonnvidia [2025/04/26 04:23] (current) – external edit 127.0.0.1
Line 7: Line 7:
 Do this by:  Do this by: 
   *(enable modules if not yet enabled): `source /cvmfs/soft.computecanada.ca/config/profile/bash.sh`   *(enable modules if not yet enabled): `source /cvmfs/soft.computecanada.ca/config/profile/bash.sh`
-  *`export CC_CLUSTER=cedar` (compute nodes most closely resemble those of [https://docs.alliancecan.ca/wiki/Cedar cedar]+  *`export CC_CLUSTER=cedar` (compute nodes most closely resemble those of [[https://docs.alliancecan.ca/wiki/Cedar|cedar]]
   *Load the python module `module load python`   *Load the python module `module load python`
   *`pip install --no-index torch torchvision torchtext torchaudio` (after loading the module, we will be using computecanada's prebuilt torches).   *`pip install --no-index torch torchvision torchtext torchaudio` (after loading the module, we will be using computecanada's prebuilt torches).
Line 18: Line 18:
   *For ubuntu 22.04, make sure you symlink gcc to gcc12 (at least during the installation). The default kernel for ubuntu 22.04 is compiled with gcc12.   *For ubuntu 22.04, make sure you symlink gcc to gcc12 (at least during the installation). The default kernel for ubuntu 22.04 is compiled with gcc12.
    
-For some reason torch 2.4.1 does not package the correct version of libnccl2. To get around this, follow the instructions [https://github.com/pytorch/pytorch/issues/119932#issuecomment-2024911522 here], except:+For some reason torch 2.4.1 does not package the correct version of libnccl2. To get around this, follow the instructions [[https://github.com/pytorch/pytorch/issues/119932#issuecomment-2024911522|here]], except:
 For reasons, torch is still using cuda12.4 but if you let nvidia auto install, you will end up with cuda 12.6, which is not supported by torch (as of september 2024). Specify: For reasons, torch is still using cuda12.4 but if you let nvidia auto install, you will end up with cuda 12.6, which is not supported by torch (as of september 2024). Specify:
   * Remove any installed torch: `pip uninstall -y torch torchvision torchaudio`   * Remove any installed torch: `pip uninstall -y torch torchvision torchaudio`
Line 27: Line 27:
  
 By this point the libraries for the version of libnccl necessary should be picked up by torch so you wont have the error "undefined symbol: ncclCommRegister" By this point the libraries for the version of libnccl necessary should be picked up by torch so you wont have the error "undefined symbol: ncclCommRegister"
 +
 +====nvcc====
 +Many versions of nvcc are installed for each version of cuda. Check `/usr/local/cuda*` for the correct one. You may need to symlink /usr/local/cuda/bin/ to point to the right version. Alternatively, change your PATH=/usr/local/cuda-VERSION/bin/:$PATH before you start compiling anything that needs the version of nvcc you need.
nvidia.1727828802.txt.gz · Last modified: 2025/04/26 04:23 (external edit)