- HPC SDK 21.9 and up (HPC SDK 22.3+ includes a preview of cuFFTMp and is recommended.)
- A system with at least one Ampere (SM80) or Volta (SM70) GPU. When using multi GPUs, GPUs have to be peer-to-peer accessible to/from each other or connected using Infiniband.
- The
c2c_pencils
andr2c_c2r_pencils
samples require at least 4 GPUs.
- The
Please see the "Hardware and software requirments" session of documentation for the full list of requirements.
The following environmental variables need to be defined in order to build and run the samples, for example:
MPI_HOME=/hpc_sdk/Linux_x86_64/.../comm_libs/hpcx/latest/ompi
, the path to your MPI installation and should contain alib
andinclude
folderCUFFT_LIB=/hpc_sdk/Linux_x86_64/.../math_libs/lib64/
, wherelibcufftMp.so
is locatedCUFFT_INC=/hpc_sdk/Linux_x86_64/.../math_libs/include/cufftmp
, where all the cuFFT and cuFFTMp headers files are locatedNVSHMEM_LIB=/hpc_sdk/Linux_x86_64/.../comm_libs/nvshmem/lib
, wherenvshmem_bootstrap_mpi.so
is located
As cuFFTMP is released in HPC SDK 22.3 and up, to build and run the samples (or your applications) with cuFFTMp it is highly recommended to have $MPI_HOME, $CUFFT_LIB, $CUFFT_INC, and $NVSHMEM_LIB all pointing to the same HPC SDK version.
If you have to use an older version of HPC SDK (21.9 or 21.11), you can find the early-access version of cuFFTMp in cuFFTMP EA.
Then build and run the C2C sample by:
$ cd samples/c2c
$ make run
Hello from rank 1/2 using GPU 1 transform of size 16 x 16 x 16, local size 8 x 16 x 16
Hello from rank 0/2 using GPU 0 transform of size 16 x 16 x 16, local size 8 x 16 x 16
Shuffled (Y-Slabs) GPU data, global 3D index [0 8 0], local index 0, rank 1 is (-13.323235,-48.004234)
[...]
Shuffled (Y-Slabs) GPU data, global 3D index [0 0 9], local index 9, rank 0 is (15.618601,-9.228624)
Relative Linf error on rank 0, 3.226381e-07
Relative Linf error on rank 1, 3.109569e-07
PASSED on rank 1
PASSED on rank 0
If you see PASSED
, the test ran successfully.
- You can repeat the same procedure for the other samples
samples/c2c_pencils
samples/r2c_c2r
samples/r2c_c2r_shared_scratch
samples/r2c_c2r_pencils
samples/reshape
A Fortran wrapper library for cuFFTMp is provided in Fortran_wrappers_nvhpc subfolder. The wrapper library will be included in HPC SDK 22.5 and later.
The Fortran samples can be built and run similarly with make run
in each of the directories:
Fortran_samples/c2c
Fortran_samples/c2c_pencils
Fortran_samples/r2c_c2r
Fortran_samples/r2c_c2r_shared_scratch
Fortran_samples/r2c_c2r_pencils
Fortran_samples/reshape
Those samples use NVSHMEM. If the system doesn't have Infiniband, you can use
NVSHMEM_REMOTE_TRANSPORT=none
to avoid Infiniband initialization-related errors. This will then fallback to p2p (single-node) only.
In case a custom MPI, other than the MPI implementations provided in HPC SDK, is used, the bootstrapping plugin may fail with an error such as
src/bootstrap/bootstrap_loader.cpp:46: NULL value Bootstrap unable to load 'nvshmem_bootstrap_mpi.so'
libmpi.so.40: cannot open shared object file: No such file or directory
src/bootstrap/bootstrap.cpp:26: non-zero status: -1 bootstrap_loader_init returned error
src/init/init.cpp:90: non-zero status: 7 bootstrap_init failed
src/init/init.cpp:501: non-zero status: 7 nvshmem_bootstrap failed
indicating it cannot load libmpi.so.40
, most likely because a non-compatible version of MPI is used to link with the nvshmem bootstrapping library.
In this case a custom bootstrap library can be built to enable users to use its own MPI implementation. We include an extra_bootstraps folder in the samples to help creating the custom bootstrap library. Find more information at the "Bootstrapping Mechanism" session of the documentation.
HPC-SDK containers contain all the required dependencies. For instance,
docker pull nvcr.io/nvidia/nvhpc:22.3-devel-cuda11.6-ubuntu20.04