So you wanna upgrade PyTorch to support a new CUDA? Follow these steps in order! They are adapted from previous CUDA upgrade processes.
Here is the supported matrix for CUDA and CUDNN
CUDA | CUDNN | additional details |
---|---|---|
11.6 | 8.3.2.44 | Stable CUDA Release |
11.7 | 8.5.0.96 | Latest CUDA Release |
Package availability to validate before starting upgrade process :
-
CUDA and CUDNN is available for Linux and Windows: https://developer.download.nvidia.com/compute/cuda/11.5.0/local_installers/cuda_11.5.0_495.29.05_linux.run https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/
-
CUDA is available on conda via nvidia channel : https://anaconda.org/nvidia/cuda/files
-
CudaToolkit is available on conda via nvidia channel: https://anaconda.org/nvidia/cudatoolkit/files
-
CUDA is available on Docker hub images : https://hub.docker.com/r/nvidia/cuda Following example is for cuda 11.5: https://gitlab.com/nvidia/container-images/cuda/-/tree/master/dist/11.5.1/ubuntu2004/runtime (Make sure to use version without CUDNN, it should be installed separately by install script)
-
Validate new driver availability: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html. Check following table: Table 3. CUDA Toolkit and Corresponding Driver Versions
Make an issue to track the progress, for example #56721: Support 11.3. This is especially important as many PyTorch external users are interested in CUDA upgrades.
There are three types of Docker containers we maintain in order to build Linux binaries: conda
, libtorch
, and manywheel
. They all require installing CUDA and then updating code references in respective build scripts/Dockerfiles. This step is about conda.
- Follow this PR 992 for all steps in this section
- Find the CUDA install link here
- Get the cudnn link from NVIDIA on the PyTorch Slack
- Modify
install_cuda.sh
- Run the
install_116
chunk of code on your devbox to make sure it works. - Check this link to see if you need to add/remove any architectures to the nvprune list.
- Go into your cuda-11.6 folder and make sure what you're pruning actually exists. Update versions as needed, especially the visual tools like
nsight-systems
. - Add setup for our Docker
conda
scripts/Dockerfiles - To test that your code works, from the root builder repo, run something similar to
export CUDA_VERSION=11.3 && ./conda/build_docker.sh
for theconda
images. - Validate conda-builder docker hub cuda11.6 to see that images have been built and correctly tagged. These images are used in the next step to build Magma for linux.
Build Magma for Linux. Our Linux CUDA jobs use conda, so we need to build magma-cuda116 and push it to anaconda:
- Follow this PR 997 for all steps in this section
- Currently, this is mainly copy-paste in
magma/Makefile
if there are no major code API changes/deprecations to the CUDA version. Previously, we've needed to add patches to MAGMA, so this may be something to check with NVIDIA about. - To push the package, please update build-magma-linux workflow PR 897.
- NOTE: This step relies on the conda-builder image (changes to
.github/workflows/build-conda-images.yml
), so make sure you have pushed the new conda-builder prior. Validate this step by logging into anaconda.org and seeing your package deployed for example here
4. Modify scripts to install the new CUDA for Libtorch and Manywheel Docker Linux containers. Modify builder supporting scripts
There are three types of Docker containers we maintain in order to build Linux binaries: conda
, libtorch
, and manywheel
. They all require installing CUDA and then updating code references in respective build scripts/Dockerfiles. This step is about libtorch and manywheel containers.
Add setup for our Docker libtorch
and manywheel
:
- Follow this PR PR 1003 for all steps in this section
- For
libtorch
, the code changes are usually copy-paste. Formanywheel
, you should manually verify the versions of the shared libraries with the CUDA you downloaded before. - This is Manual Step: Create a ticket for PyTorch Dev Infra team to Create a new repo to host manylinux-cuda images in docker hub, for example, https://hub.docker.com/r/pytorch/manylinux-cuda115. This repo should have public visibility and read & write access for bots. This step can be removed once the following issue is addressed.
- Push the images to Docker Hub. This step should be automated with the help with GitHub Actions in the
pytorch/builder
repo. Make sure to update thecuda_version
to the version you're adding in respective YAMLs, such as.github/workflows/build-manywheel-images.yml
,.github/workflows/build-conda-images.yml
,.github/workflows/build-libtorch-images.yml
. - Verify that each of the workflows that push the images succeed by selecting and verifying them in the Actions page of pytorch/builder. Furthermore, check https://hub.docker.com/r/pytorch/manylinux-builder/tags, https://hub.docker.com/r/pytorch/libtorch-cxx11-builder/tags to verify that the right tags exist for manylinux and libtorch types of images.
- Finally before enabling nightly binaries and CI builds we should make sure we post following PRs in PR 1015 PR 1017 and this commit to enable the new CUDA build in wheels and conda.
- Follow this PR 999 for all steps in this section
- To get the CUDA install link, just like with Linux, go here and upload that
.exe
file to our S3 bucket ossci-windows. - Review "Table 3. Possible Subpackage Names" of CUDA installation guide for windows link to make sure the Subpackage Names have not changed. These are specified in cuda_install.bat file
- To get the cuDNN install link, you could ask NVIDIA, but you could also just sign up for an NVIDIA account and access the needed
.zip
file at this link. First click oncuDNN Library for Windows (x86)
and then upload that zip file to our S3 bucket. - NOTE: When you upload files to S3, make sure to make these objects publicly readable so that our CI can access them!
- Most times, you have to upgrade the driver install for newer versions, which would look like updating the
windows/internal/driver_update.bat
file- Please check the CUDA Toolkit and Minimum Required Driver Version for CUDA minor version compatibility table in the release notes to see if a driver update is necessary.
- Compile MAGMA with the new CUDA version. Update
.github/workflows/build-magma-windows.yml
to include new version. - Validate Magma builds by going to S3 ossci-windows. And querying for
magma_
Please note, since this step currently requires access to corporate AWS, this step should be performed by Meta employee. To be removed, once automated.
- For Windows you will need to rebuild the test AMI, please refer to this PR. After this is done, run the release of Windows AMI using this proecedure. As time of this writing this is manual steps performed on dev machine. Please note that packer, aws cli needs to be installed and configured!
- After step 1 is complete and new Windows AMI have been deployed to AWS. We need to deploy the new AMI to our canary environment (https://github.com/pytorch/pytorch-canary) through https://github.com/fairinternal/pytorch-gha-infra example : PR . After this is completed Submit the code for all windows workflows to https://github.com/pytorch/pytorch-canary and make sure all test are passing for all CUDA versions.
- After that we can deploy the Windows AMI out to prod using the same pytorch-gha-infra repository.
Adding the new version to nightlies allows PyTorch binaries compiled with the new CUDA version to be available to users through conda
or pip
or just raw libtorch
.
- If the new CUDA version requires a new driver (see #1 sub-bullet), the CI and binaries would also need the new driver. Find the driver download here and update the link like so.
- Please check the Driver Version table in the release notes to see if a driver update is necessary.
- Follow this PR 81095 for steps 2-4 in this section.
- Once PR 81095 is created make sure to attach ciflow/binaries, ciflow/binaries_conda, ciflow/binaries_wheel, ciflow/nightly labels to this PR. And make sure all the new workflow with new CUDA version terminate successfully.
- Testing nightly builds is done as follows:
- Make sure your commit to master passed all the test and there are no failures, otherwise the next step will not work
- Make sure your changes are promoted to viable/strict branch: https://github.com/pytorch/pytorch/tree/viable/strict . Run viable/strict promotion job to promote from master to viable/strict
- After your changes are promoted to viable/strict. Run nighly build job.
- Make sure your changes made to nightly branch https://github.com/pytorch/pytorch/tree/nightly
- Make sure all nightly build succeeded before continuing to Step #6
Testing the new version in CI is crucial for finding regressions and should be done ASAP along with the next step (I am simply putting this one first as it is usually easier).
- The configuration files will be subject to change, but usually you just have to replace an older CUDA version with the new version you're adding. Code reference for 11.5: PR 68745 for Linux and PR 69377 for Windows, and code reference for 11.3 where we just replaced verbatim yaml and updated magma for conda for Linux: PR 57223 for Windows and PR 57222 for Linux
- IMPORTANT NOTE: the CI is not always automatically triggered when you edit the workflow files! Ensure that the new CI job for the new CUDA version is showing up in the PR signal box. If it is not there, make sure you add the correct ciflow label (ciflow/periodic, for example) to trigger the test. Just because the CI is green on your pull request does NOT mean the test has been run and is green.
- It is likely that there will be tests that no longer pass with the new CUDA version or GPU driver. Disable them for the time being, notify people who can help, and make issues to track them (like so).
Torchvision and torchaudio is usually a dependency for installing PyTorch for most of our users. This is why it is important to also propagate the CI changes so that torchvision and torchaudio can be packaged for the new CUDA version as well.
- A code sample for torchvision: PR 4248
- A code sample for torchaudio: PR 2067
- Almost every change in the above sample is copy-pasted from either itself or other existing parts of code in the builder repo. The difficulty again is not changing the config but rather verifying and debugging any failing builds.
Congrats! PyTorch now has support for a new CUDA version and you made it happen!