-
Notifications
You must be signed in to change notification settings - Fork 312
NVIDIA Fabric Manager stops running on Ubuntu 18.04 and Ubuntu 20.04
On clusters created with Ubuntu 18.04 and Ubuntu 20.04 official AMIs, nvidia-fabricmanager will be automatically updated to an incompatible version and stop working when nodes are launched.
The impact is limited to EC2 instances and applications that make use of NVIDIA Fabric Manager. At the time of writing only p4d instances are affected.
Affected ParallelCluster versions: >= 2.10.0, <= 2.11.1
Issue started on Jul 21 2021 when Ubuntu published the nvidia-fabricmanager package to its official repo: http://archive.ubuntu.com/ubuntu/pool/multiverse/f/fabric-manager-460/. Since then, unattended-upgrades, that are enabled by default on ParallelCluster Ubuntu AMIs, are causing the Fabric Manager to be upgraded to a version that is incompatible with the installed NVIDIA drivers.
While we work on addressing the issue and publish a patched version of the product, here is how you can patch clusters created with affected versions.
- Create a bash script, e.g.
fix-fabricmanager.sh
, with the following content (or add the code to your existing pre installation script)
#!/bin/bash
set -ex
nvswitches=$(lspci -d 10de:1af1 | wc -l)
if [ "${nvswitches}" -gt "1" ]; then
# From https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html#ubuntu-lts
distribution=$(. /etc/os-release;echo ${ID}${VERSION_ID} | sed -e 's/\.//g')
echo "deb http://developer.download.nvidia.com/compute/cuda/repos/${distribution}/x86_64 /" | sudo tee /etc/apt/sources.list.d/cuda.list
sudo apt-get update --allow-releaseinfo-change
driver_version=$(nvidia-smi | grep -oP "(?<=Driver Version: )[0-9.]+")
driver_major=$(echo ${driver_version} | cut -d. -f1)
sudo apt-get install -y --allow-downgrades nvidia-fabricmanager-${driver_major}=${driver_version}*
sudo apt-mark hold nvidia-fabricmanager-${driver_major}
sudo systemctl enable nvidia-fabricmanager.service
sudo systemctl start nvidia-fabricmanager.service
fi
-
Upload the script to an S3 bucket with correct permissions, see: https://docs.aws.amazon.com/parallelcluster/latest/ug/pre_post_install.html E.g.:
aws s3 cp fix-fabricmanager.sh s3://yourbucket/
-
Add the following setting to your cluster configuration
[cluster yourcluster]
pre_install = s3://yourbucket/fix-fabricmanager.sh
...
- Either create a new cluster or follow the next steps to update an existing cluster
Update an existing cluster with the pre installation script configured in the previous steps.
- Stop the cluster with
pcluster stop
command - Update the cluster with
pcluster update
command - Restart the cluster with
pcluster start
command
This option applies only to new clusters.
- Follow the official documentation to modify an existing ParallelCluster AMI
- As part of the AMI customization step, connect to the instance and run the following commands:
sudo sed -i "s/Update-Package-Lists \"1\"/Update-Package-Lists \"0\"/g" /etc/apt/apt.conf.d/20auto-upgrades
sudo sed -i "s/Unattended-Upgrade \"1\"/Unattended-Upgrade \"0\"/g" /etc/apt/apt.conf.d/20auto-upgrades
- Complete the steps to create a custom AMI
- Create a cluster using the generated AMI with the custom_ami parameter.