Newer Linux kernels are no longer compatible with EFA and closed Source Nvidia drivers in instances with GPU Direct RDMA support

Problem description

The Linux Kernel community recently introduced a change that is incompatible with EFA and Nvidia drivers. This change has propagated to recent releases of Linux distributions including Amazon Linux. When using instance types with GPUDirect RDMA (the option to write/read directly from the EFA device to the GPU memory), EFA kernel module is unable to retrieve GPU memory information.

Nvidia introduced a open-source (OSS) version of their drivers, known as OpenRM, that is compatible with this kernel change. EFA released a new version, 1.29.0, that is compatible with recent kernels and with OSS Nvidia driver.

The use of P4 or P5 instance types, with a recently released Linux Kernel RPM, in combination with EFA and the non-OSS Nvidia drivers (ParallelCluster < 3.8.0) will cause the communication between your workload nodes (via EFA) to stop working.

Starting from ParallelCluster 3.8.0 we installed the OSS Nvidia drivers and EFA 1.29.0, as default in all official ParallelCluster AMIs, to permit the customers to use recent kernels and safely ingest security fixes.

Unfortunately, OSS Nvidia drivers can only be used on any Turing, Ampere, Hopper or later GPU. Full list of compatible GPUs is available here. This means that P3, P2, G3 and G2 instances are no longer supported with official ParallelCluster 3.8.0+ AMIs.

Impact

P4 or P5 instance types in combination with EFA and the non-OSS Nvidia drivers (ParallelCluster <= 3.7.2) won’t work after updating Linux kernel starting with the following version numbers: 4.14.326, 5.4.257, 5.10.195, 5.15.131, 6.1.52. In the logs you can find an error like the following:

kernel: failing symbol_get of non-GPLONLY symbol nvidia_p2p_get_pages.

P3, P2, G3 and G2 instances, with ParallelCluster >= 3.8.0 official AMIs, are unable to bootstrap. You can find in the logs of the failing instances an error like the following:

[2024-01-10T18:34:22+00:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: gdrcopy[Configure gdrcopy] (aws-parallelcluster-platform::nvidia_config line 22) had an error: Mixlib::ShellOut::ShellCommandFailed: service[gdrcopy] (aws-parallelcluster-platform::nvidia_config line 103) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'init-output
---- Begin output of ["/bin/systemctl", "--system", "start", "gdrcopy"] ----
STDOUT:
STDERR: Job for gdrcopy.service failed because the control process exited with error code. See "systemctl status gdrcopy.service" and "journalctl -xe" for details.
---- End output of ["/bin/systemctl", "--system", "start", "gdrcopy"] ----
Ran ["/bin/systemctl", "--system", "start", "gdrcopy"] returned
+ error_exit 'Failed to run bootstrap recipes. If --norollback was specified, check /var/log/cfn-init.log and /var/log/cloud-init-output.log.'
+ echo 'Bootstrap failed with error: Failed to run bootstrap recipes. If --norollback was specified, check /var/log/cfn-init.log and /var/log/cloud-init-output.log.'
Bootstrap failed with error: Failed to run bootstrap recipes. If --norollback was specified, check /var/log/cfn-init.log and /var/log/cloud-init-output.log.

How to check affected components:

Kernel version: uname -r
Installed Nvidia driver version: nvidia-smi
Nvidia license of the kernel: modinfo -F license nvidia, it will return Dual MIT/GPL or NVIDIA for the open source or closed driver, respectively
GDRCopy version: modinfo -F version gdrdrv
EFA installed version: cat /opt/amazon/efa_installed_packages | grep -E -o "EFA installer version: [0-9.]+"

Mitigation for ParallelCluster == 3.8.0 - How to build a custom AMI with Closed Source Nvidia drivers for P3, P2, G3 and G2

If you want to use P3, P2, G3 or G2 instance types in ParallelCluster 3.8.0 you need to build your own custom AMI. The possibility to choose between Open and Closed Source drivers has been introduced in the Nvidia installer starting from Nvidia version 515.

To build a custom AMI for ParallelCluster == 3.8.0. with non-OSS Nvidia drivers, you can follow the build-image approach. Through an attribute, specified at AMI creation time, it’s possible to select the version of Nvidia drivers to install but up to ParallelCluster 3.8.0 is not possible to choose between Closed and Open Source Nvidia drivers. This means that if you’re using ParallelCluster 3.8.0, you need to use an Nvidia version where the installer does not permit to choose the open source version of the drivers (e.g. 470).

You should:

use a GPU instance type for the build,
select the ParentImage you prefer,
pass an old non-OSS Nvidia version driver (<515) as ExtraChefAttributes (e.g. 470.223.02),
Nvidia drivers, CUDA, Fabric Manager, GDRCopy and EFA will be installed by the build-image process.

Build:
  InstanceType: p2.xlarge # graphic instance type
  ParentImage: ami-0c0b74d29acd0cd97 # amzn2-ami-kernel-5.10-hvm-2.0.20240109.0-x86_64-gp2 us-east-1
  Imds:
    ImdsSupport: v2.0
DevSettings:
  Cookbook:
    ExtraChefAttributes: |
      {"cluster": {"nvidia": {"enabled": true, "driver_version": "470.223.02"}}}

If you want to use in the same cluster, instance types with GPUDirect RDMA support, like P4 and P5, you should use the official ParallelCluster AMIs for the related Compute Resources.

Mitigation for ParallelCluster > 3.8.0 - How to build a custom AMI with Closed Source Nvidia drivers for P3, P2, G3 and G2

For ParallelCluster > 3.8.0 we added the possibility to select the type of NVIDIA kernel to install. It means that through an attribute, specified at AMI creation time, it’s possible to select both the version and the type of Nvidia drivers to install.

This is useful if you want to use P3, P2, G3 or G2 instance and want to create an AMI with a recent version of non-OSS Nvidia drivers.

To build a custom AMI for ParallelCluster > 3.8.0, with non-OSS Nvidia drivers you can follow the build-image approach:

use a GPU instance type for the build,
select the ParentImage you prefer,
select the Nvidia driver version you prefer and pass it as ExtraChefAttributes (e.g. 535.183.01),
Nvidia drivers, CUDA, Fabric Manager, GDRCopy and EFA will be installed by the build-image process.

Build:
  InstanceType: p2.xlarge # graphic instance type
  ParentImage: ami-0c0b74d29acd0cd97 # amzn2-ami-kernel-5.10-hvm-2.0.20240109.0-x86_64-gp2 us-east-1
  Imds:
    ImdsSupport: v2.0
DevSettings:
  Cookbook:
    ExtraChefAttributes: |
      {"cluster": {"nvidia": {"enabled": true, "driver_version": "535.183.01", "kernel_open": "false" }}}

If you want to use in the same cluster, instance types with GPUDirect RDMA support, like P4 and P5, you should use the official ParallelCluster AMIs for the related Compute Resources.

Mitigation for ParallelCluster <= 3.7.2 - How to create a custom AMI with Open Source Nvidia drivers for P4 and P5

If you’re using ParallelCluster 3.7.2 and you’re using official ParallelCluster AMIs, you will be affected by the issue only if you update the kernel to a newer version.

A first alternative is to lock kernel. The instructions for that can be found in the DLAMI release notes in the kernel section: https://aws.amazon.com/releasenotes/aws-deep-learning-base-ami-amazon-linux-2/. Going forward, you should migrate to the OSS Nvidia driver to keep your ability to ingest security fixes in the kernel.

To build a custom AMI for 3.7.2 with an updated kernel and with OSS Nvidia drivers, you need to create your base AMI with EFA and OSS Nvidia driver installed on it and then use this image as ParentImage for the build-image approach to install ParallelCluster stuff on top of it.

Select the base AMI you prefer and launch an instance for it. Login to the instance and install: Nvidia, Cuda, Fabric Manager, GDRCopy and EFA:

to install OSS Nvidia drivers, the Nouveau drivers must first be disabled. Each distribution of Linux has a different method for disabling Nouveau, see official instructions,
install Nvidia drivers, selecting the appropriate driver version you prefer but adding the -m=kernel-open flag to the nvidia.run command:
```
sudo CC=/usr/bin/gcc10-gcc ./nvidia.run --silent --dkms --disable-nouveau --no-cc-version-check -m=kernel-open
```
install CUDA,
install Nvidia Fabric manager,
install GDRCopy to improve libfabric performances,
install EFA 1.29.0+ in the AMI, by following official documentation.

Then create new AMI starting from this.

Creating custom AMI with Open Source Nvidia drivers from ParallelCluster official AMIs ( 3.7.x)

An alternative approach to this is to start from a ParallelCluster official AMIs, update the kernel and then uninstall existing Nvidia drivers, GDRCopy and EFA, and install an updated GDRCopy and EFA version and OSS Nvidia drivers.

We will be manually creating the custom AMI, by launching a GPU based instance type and installing the required versions of OpenRM, GDRCopy and EFA and then creating the AMI using the EC2 Console/CLI Command.

NOTE: We recommend Instance reboot to prevent undesired side effects after uninstalling the older version and installing the newer version because there might be potential instability with open calls using older version. However, we have not experienced faulty behaviors after skipping it so far. The steps below can be run as custom scripts.

Launch a GPU based instance (e.g. g4dn.xlarge) with official ParallelCluster alinux2 x86 AMI (e.g. ami-02be5d4c5f5696b63 in us-east-1)

$ pcluster list-official-images
 {
       "amiId": "ami-02be5d4c5f5696b63",
       "os": "alinux2",
       "name": "aws-parallelcluster-3.7.2-amzn2-hvm-x86_64-202310120952 2023-10-12T09-56-59.493Z",
       "version": "3.7.2",
       "architecture": "x86_64"
 }

Update the OS, reboot and then check kernel. We update the kernel as the issue has surfaced when the kernel is 5.10.196+.

## Kernel Before Update
$ uname -r
5.10.192-183.736.amzn2.x86_64

$ sudo yum update -y && sudo reboot
## Kernel After Update
$ uname -r
5.10.199-190.747.amzn2.x86_64

Check the License of Nvidia drivers

$ modinfo -F license nvidia
NVIDIA

Download the OpenRM run file for NVIDIA 535.54.03 and provide executable permission to the run file

$ wget https://us.download.nvidia.com/tesla/535.54.03/NVIDIA-Linux-x86_64-535.54.03.run -O /tmp/nvidia.run
$ chmod +x /tmp/nvidia.run

Uninstall the existing NVIDIA GPU drivers

$ cd /tmp && sudo ./nvidia.run --uninstall --silent

Login to instance and verify NVIDIA is uninstalled by running nvidia-smi command

$ nvidia-smi
-bash: /usr/bin/nvidia-smi: No such file or directory

Install OpenRM Drivers

$ cd /tmp/ && sudo CC=/usr/bin/gcc10-gcc ./nvidia.run --silent --dkms --disable-nouveau --no-cc-version-check -m=kernel-open

Verify the drivers are installed using nvidia-smi command and License of NVIDIA driver is Dual MIT/GPL

$ nvidia-smi
Mon Nov 27 21:49:33 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:1E.0 Off |                    0 |
| N/A   53C    P0              27W /  70W |      2MiB / 15360MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

$ modinfo -F license nvidia
Dual MIT/GPL

Check GDRCopy version

$ modinfo -F version gdrdrv
2.3

Uninstall GDRCopy

$ sudo rpm -e gdrcopy-kmod 
Stopping gdrcopy (via systemctl):                          [  OK  ]
Uninstalling and removing the driver.
This process may take a few minutes ...
$ sudo rpm -e gdrcopy-devel 
$ sudo rpm -e gdrcopy
$ rpm -qa | grep gdrc*

Download GDRCopy 2.4 as per Guide

$ sudo wget https://github.com/NVIDIA/gdrcopy/archive/refs/tags/v2.4.tar.gz -O /opt/parallelcluster/sources/gdrcopy-2.4.tar.gz

Install dependencies

$ sudo yum -y install dkms rpm-build make check check-devel subunit subunit-devel

Install GDRCopy 2.4

$ cd /opt/parallelcluster/sources
$ sudo tar -xf gdrcopy-2.4.tar.gz 
$ cd gdrcopy-2.4/packages/

$ sudo CUDA=/usr/local/cuda ./build-rpm-packages.sh
$ sudo rpm -q gdrcopy-kmod-2.4-1dkms || sudo rpm -Uvh gdrcopy-kmod-2.4-1dkms.amzn-2.noarch.rpm
$ sudo rpm -q gdrcopy-2.4-1.x86_64 || sudo rpm -Uvh gdrcopy-2.4-1.amzn-2.x86_64.rpm
$ sudo rpm -q gdrcopy-devel-2.4-1.noarch || sudo rpm -Uvh gdrcopy-devel-2.4-1.amzn-2.noarch.rpm

Verify Gdrcopy version

$ modinfo -F version gdrdrv
2.4

Check old EFA Version

$ cat /opt/amazon/efa_installed_packages | grep -E -o "EFA installer version: [0-9.]+"
EFA installer version: 1.26.1

Download EFA 1.29.0 Get started with EFA and MPI - Amazon Elastic Compute CloudGet started...

$ sudo wget https://efa-installer.amazonaws.com/aws-efa-installer-1.29.0.tar.gz -P /opt/parallelcluster/sources/

Uninstall EFA older version

$ cd /opt/parallelcluster/sources/
$ sudo tar -xzf aws-efa-installer-1.29.0.tar.gz
$ cd aws-efa-installer
$ sudo ./efa_installer.sh -u

Install EfA 1.29.0

$ sudo ./efa_installer.sh -y

Reboot and Check packages Installed

$ sudo reboot
cat /opt/amazon/efa_installed_packages | grep -E -o "EFA installer version: [0-9.]+"
EFA installer version: 1.29.0

Clean Up

$ rm -rf /tmp/nvidia.run 
$ rm -rf /opt/parallelcluster/sources/aws-efa-installer
sudo ./usr/local/sbin/ami_cleanup.sh

Now create a ParallelCluster 3.7.2 Custom AMI using this instance with Create Image on EC2 Console

$ aws ec2 create-image --instance-id <Instance-id> --name "My server" --description "An AMI for my server"

References

AWS

Nvidia

Linux

Kernel Source change

Provide feedback

Saved searches

Use saved searches to filter your results more quickly