Add context around CUDA driver vs kernel versions #2

matthewfeickert · 2022-12-02T00:37:53Z

I don't fully understand the subtleties of trying to match CUDA drivers on Ubuntu (https://github.com/matthewfeickert/nvidia-gpu-ml-library-test is basically just me recording what commands I typed that worked) and getting those to match with the kernel versions that different wheels with cudnn were built against.

In the https://github.com/CHTC/templates-GPUs examples they mention

We require a machine with a modern version of the CUDA driver. CUDA drivers are
usually backwards compatible. So a machine with CUDA Driver version 10.1 should
be able to run containers built with older versions of CUDA.
Requirements = (Target.CUDADriverVersion >= 10.1)

which is why in PR #1 I set

htcondor-examples/noxfile.py

Line 18 in 0252615

base_image = "nvidia/cuda:11.6.0-cudnn8-devel-ubuntu20.04"

and

htcondor-examples/htcondor_templates/chtc_hello_gpu/chtc_hello_gpu.sub

Lines 20 to 21 in 0252615

    
           # We require a machine with a modern version of the CUDA driver 
        
           Requirements = (Target.CUDADriverVersion >= 11.6)

as I could get CUDA 11.6.0 image to run on my local machine for interactive testing

$ nvidia-smi | head -n 4
Thu Dec  1 18:35:01 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+

and still assumed that HCTC would be able to have a machine that supports them. I had originally tried with nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 but that failed with

>>> from jax.lib import xla_bridge
>>> xla_backend = xla_bridge.get_backend()
2022-12-02 00:45:49.488387: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2022-12-02 00:45:49.488493: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 510.85.2 does not match DSO version 520.61.5 -- cannot find working devices in this configuration
WARNING:jax._src.lib.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

which lead me down a bit of a rabbit hole.

@kratsg I think you have a much better handle on how to try to match drivers and conditions. So maybe we could try to document an approach here or on https://github.com/pyhf/cuda-images about how to go about finding the right match of CUDA driver, nvidia/cuda Docker image, and software releases for the problems someone might want to solve. We could additionally think about building a battery of images against common CUDA versions if that might be helpful for running on more sites.

The text was updated successfully, but these errors were encountered:

matthewfeickert · 2022-12-02T04:17:54Z

Perhaps

We require a machine with a modern version of the CUDA driver. CUDA drivers are
usually backwards compatible. So a machine with CUDA Driver version 10.1 should
be able to run containers built with older versions of CUDA.

explains the full extent of this though. If CUDA drivers are mostly backwards compatible, then if you're trying to make a Docker image that works on most machines with CUDA installed then perhaps it is best to try and target slightly older releases(?).

Using my local machine as an example:

$ nvidia-smi | head -n 4
Thu Dec  1 18:35:01 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+

I'm able to run pyhf/cuda:0.7.0-jax-cuda-11.6.0-cudnn8

$ docker run --rm -ti --gpus all pyhf/cuda:0.7.0-jax-cuda-11.6.0-cudnn8                                                 
root@415cb7459135:/home/data# nvidia-smi 
Fri Dec  2 04:09:45 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   49C    P0    15W /  N/A |      4MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@415cb7459135:/home/data# echo "${CUDA_VERSION}"
11.6.0
root@415cb7459135:/home/data# python /docker/jax_detect_GPU.py 
XLA backend type: gpu

Number of GPUs found on system: 1

Active GPU index: 0
Active GPU name: NVIDIA GeForce RTX 3050 Ti Laptop GPU
root@415cb7459135:/home/data#

and the older pyhf/cuda:0.6.3-jax-cuda-11.1

$ docker run --rm -ti --gpus all pyhf/cuda:0.6.3-jax-cuda-11.1 
root@1986c33106f6:/home/data# nvidia-smi 
Fri Dec  2 04:11:20 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P0    15W /  N/A |      4MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@1986c33106f6:/home/data# echo "${CUDA_VERSION}"
11.1.1
root@1986c33106f6:/home/data# curl -sL https://raw.githubusercontent.com/matthewfeickert/nvidia-gpu-ml-library-test/main/jax_detect_GPU.py | python
XLA backend type: gpu

Number of GPUs found on system: 1

Active GPU index: 0
Active GPU name: NVIDIA GeForce RTX 3050 Ti Laptop GPU
root@1986c33106f6:/home/data#

but the nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 based pyhf/cuda:0.7.0-jax-cuda-11.8.0-cudnn8 image has a version of CUDA (11.8) that is newer than my local machine's (11.6) and so fails with CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW

$ docker run --rm -ti --gpus all pyhf/cuda:0.7.0-jax-cuda-11.8.0-cudnn8
root@8815d82e76c5:/home/data# nvidia-smi 
Fri Dec  2 04:15:15 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   50C    P0    15W /  N/A |      4MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@8815d82e76c5:/home/data# echo "${CUDA_VERSION}"
11.8.0
root@8815d82e76c5:/home/data# python -c 'from jax.lib import xla_bridge; xla_bridge.get_backend()'
2022-12-02 04:15:40.881759: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2022-12-02 04:15:40.881850: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 510.85.2 does not match DSO version 520.61.5 -- cannot find working devices in this configuration
WARNING:jax._src.lib.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
root@8815d82e76c5:/home/data#

There's still the question I guess of if/how Python wheels compiled for newer CUDA architectures work with older CUDA versions. That is, will does jaxlib v0.3.25+cuda11.cudnn82 work with even older versions of CUDA. I guess this can be tested by building images with older CUDA versions, checking the output of

cat /usr/include/x86_64-linux-gnu/cudnn_version_v*.h | grep '#define CUDNN_'

in them and then seeing if more modern versions of jaxlib work. Though I think if I understood how the dependencies required by the information in a wheel name better this would be easier.

@kratsg given all this, what is the nvidia/cuda base image that you use for SLUGPU? You mentioned it used CUDA v11.2, but what is the actual base image tag? Can you try this image built on top of nvidia/cuda:11.2.2-devel-ubuntu20.04?

$ docker run --rm -ti --gpus all pyhf/cuda:0.7.0-jax-cuda-11.2.2       
root@54ea11142a60:/home/data# nvidia-smi 
Fri Dec  2 04:40:31 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P0    15W /  N/A |      4MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@54ea11142a60:/home/data# echo "${CUDA_VERSION}"
11.2.2
root@54ea11142a60:/home/data# python /docker/jax_detect_GPU.py 
XLA backend type: gpu

Number of GPUs found on system: 1

Active GPU index: 0
Active GPU name: NVIDIA GeForce RTX 3050 Ti Laptop GPU
root@54ea11142a60:/home/data#

matthewfeickert · 2022-12-02T05:24:01Z

Not to abuse your time, but @bdice, as you're an expert's expert when it comes to CUDA, do you have general recommendations (or resources to look at) when thinking about building CUDA enabled Docker images for software with the goal of running code on GPUs on remote clusters and how to try to make all the CUDA versions and binaries play nicely together?

My current (weak) understanding is that you:

Probably want to target base images with older versions of CUDA to make them able to be run on as many machines as possible given CUDA's backwards compatible strategy. So here you're hoping to find a machine that has a CUDA version the same or newer than your so that it will be able to run with your container's CUDA.
Hope that the version of the library that you want to use that has CUDA dependencies has binaries that are compatible with your image's CUDA version. If not, then play around with building an image that has a CUDA version that can support your required library version (and is still within reach of the machine you want to run on).
- Here I'm thinking about Python wheels, but I'll be honest in that I'm not sure how restrictive something like jaxlib v0.3.25+cuda11.cudnn82 actually is.
If you have the time, try to built multiple versions of the image with different nvidia/cuda base images to target as many CUDA versions as you can.
- e.g. pyhf/cuda:0.7.0-jax-cuda-11.2.2, pyhf/cuda:0.7.0-jax-cuda-11.6.0-cudnn8, pyhf/cuda:0.7.0-jax-cuda-11.8.0-cudnn8

bdice · 2023-01-23T00:45:27Z

@matthewfeickert Hi! Sorry, I've been working through an email backlog and just saw this.

I think your understanding is approximately correct. This webpage is the definitive resource for CUDA compatibility: https://docs.nvidia.com/deploy/cuda-compatibility/index.html It is admittedly complex, and I won't guarantee that my answers here are 100% correct. I have worked through a handful of exceptional cases for CUDA compatibility and I'm still learning a lot about the minutiae of this topic.

There are multiple kinds of compatibility described in that document above. I will attempt to summarize some of the pieces I think are most important to know:

Minor Version Compatibility: compile aganst any CUDA 11.x Toolkit, run on any driver >= 450.80.02. This has two caveats:
- If your application uses CUDA features that are not available in older drivers, then it will return an error code cudaErrorCallRequiresNewerDriver. An example of this would be calling cudaMemcpyAsync with a driver older than 11.2, when that was introduced (blog post).
- If your application compiles device code to PTX (like some JIT compiled features of various libraries), you need drivers that match the runtime. For deploying RAPIDS, we have designed some workarounds with tools like ptxcompiler for numba, but this is rather complex and I'd recommend just building containers for more CUDA versions if that can solve your problem.
Forward Compatibility: Use the cuda-compat package to enable compatibility with newer toolkits on older drivers
- The nvidia/cuda Docker images come with cuda-compat installed (example) and that's why they can work with older drivers. I can't recall exactly how this works, though -- I feel like there might be requirements imposed by nvidia-container-runtime or nvidia-docker2? I would defer to others' knowledge of how containers and cuda-compat work together in practice.

I haven't dealt with compatibility questions including cudnn before. I know that is versioned separately and I suspect that might have an impact on the compatibility matrix, but I am uncertain.

Conclusion: Leveraging CUDA compatibility is great if it works for your use case. If you're not sure about your application requirements (or your dependencies' requirements) or if things just aren't working, you can always build multiple containers for each version of CUDA you need to support and things should be fine.

matthewfeickert self-assigned this Dec 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add context around CUDA driver vs kernel versions #2

Add context around CUDA driver vs kernel versions #2

matthewfeickert commented Dec 2, 2022 •

edited

Loading

matthewfeickert commented Dec 2, 2022 •

edited

Loading

matthewfeickert commented Dec 2, 2022

bdice commented Jan 23, 2023 •

edited

Loading

Add context around CUDA driver vs kernel versions #2

Add context around CUDA driver vs kernel versions #2

Comments

matthewfeickert commented Dec 2, 2022 • edited Loading

matthewfeickert commented Dec 2, 2022 • edited Loading

matthewfeickert commented Dec 2, 2022

bdice commented Jan 23, 2023 • edited Loading

matthewfeickert commented Dec 2, 2022 •

edited

Loading

matthewfeickert commented Dec 2, 2022 •

edited

Loading

bdice commented Jan 23, 2023 •

edited

Loading