Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issue (cuSPARSE_STATUS_INSUFFICIENT_RESOURCES) when running Tandem-Static Mini-App on Leonardo HPC System #79

Open
mredenti opened this issue Oct 13, 2024 · 14 comments
Labels
bug Something isn't working

Comments

@mredenti
Copy link

mredenti commented Oct 13, 2024

Description

System Leonardo Booster
Branch dmay/petsc_dev_hip (fixes residual convergence difference between CPU and GPU)
Commit ID 1015c31d0f29eab4983497a3ad3f607057285388
Backends CUDA via PETSc
Target static

I'm encountering errors when running the Tandem mini-app static on the Leonardo Booster HPC system. Specifically,

  • the yateto kernels test fails during execution.
  • A cuSPARSE_STATUS_INSUFFICIENT_RESOURCES error occurs when launching the mini-app on less than (~) 48 nodes with 4 gpus per node (4*64*48 GB). I am not sure whether it is simply that the problem size is too large or something else.

Problem setup

Get audit scenario

wget https://syncandshare.lrz.de/dl/fi34J422UiAKKnYKNBkuTR/audit-scenario.zip
unzip audit-scenario.zip

Create intermediate size mesh with gmsh (same setup as Eviden-WP3)

gmsh fault_many_wide.geo -3 -setnumber h 10.0 -setnumber h_fault 0.25 -o fault_many_wide.msh

Change mesh in ridge.toml

mesh_file = "fault_many_wide.msh"
#mesh_file = "fault_many_wide_4_025.msh"

type = "elasticity"

matrix_free = true

ref_normal = [0, -1, 0]
lib = "scenario_ridgecrest.lua"
scenario = "shaker"
#[domain_output]

Steps to reproduce errors

Attempt 1: Use system installation of [email protected]

Click to expand

Load Modules

module purge
module load petsc/3.20.1--openmpi--4.1.6--gcc--12.2.0-cuda-12.1-mumps # <---petsc
module load cuda/12.1 
module load eigen/3.4.0--gcc--12.2.0-5jcagas
module load spack/0.21.0-68a
module load cmake/3.27.7

Spack environment for Lua and Python+Numpy dependencies

spack create -d ./spack-env-tandem
spack env activate ./spack-env-tandem -p
spack add py-numpy [email protected] 
spack concretize -f 
spack install

Install CSV module

luarocks install csv

Clone Tandem

git clone -b dmay/petsc_dev_hip https://github.com/TEAR-ERC/tandem.git tandem-petsc_dev_hip
cd tandem-petsc_dev_hip && git submodule update --init
cd ..

Build Tandem

Note: Petsc on Leonardo has been installed without a specific value for --with-memalign.

When running the CMake configuration step

cmake -B ./build -S ./tandem-petsc_dev_hip -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DPOLYNOMIAL_DEGREE=4 -DDOMAIN_DIMENSION=3

I get the following error

-- Could NOT find LibxsmmGenerator (missing: LibxsmmGeneratorExecutable) 
CMake Error at app/CMakeLists.txt:72 (message):
  The memory alignment of PETSc is 16 bytes but an alignment of at least 32
  bytes is required for ARCH=hsw.  Please compile PETSc with
  --with-memalign=32.

and so I temporarily commented out Tandem's requirement on memory alignment for Petsc in app/CMakeLists.txt(just to verify whether I got the same error as for the custom installation of Petsc)

#[=[
if(PETSC_MEMALIGN LESS ALIGNMENT)
    message(SEND_ERROR "The memory alignment of PETSc is ${PETSC_MEMALIGN} bytes but an alignment of "
                       "at least ${ALIGNMENT} bytes is required for ARCH=${ARCH}. "
                       "Please compile PETSc with --with-memalign=${ALIGNMENT}.")
endif()
#]=]

and then I build and run the tests on a login node

cmake --build ./build --parallel 4
ctest --test-dir ./build

where the yateto kernels failed

ctest --test-dir ./build --rerun-failed 
Start testing: Oct 13 11:39 CEST
----------------------------------------------------------
3/21 Testing: yateto kernels
3/21 Test: yateto kernels
Command: "/leonardo_work/cin_staff/mredenti/ChEESE/TANDEM/build/app/test-elasticity-kernel" "--test-case=yateto kernels"
Directory: /leonardo_work/cin_staff/mredenti/ChEESE/TANDEM/build/app
"yateto kernels" start time: Oct 13 11:39 CEST
Output:
----------------------------------------------------------
[doctest] doctest version is "2.3.7"
[doctest] run with "--help" for options
===============================================================================
/leonardo_work/cin_staff/mredenti/ChEESE/TANDEM/build/app/kernels/elasticity/test-kernel.cpp:10:
TEST CASE:  yateto kernels
  apply_inverse_mass

/leonardo_work/cin_staff/mredenti/ChEESE/TANDEM/build/app/kernels/elasticity/test-kernel.cpp:4938: ERROR: CHECK( sqrt(error/refNorm) < 2.22e-14 ) is NOT correct!
  values: CHECK( 0.0 <  0.0 )

===============================================================================
[doctest] test cases:      1 |      0 passed |      1 failed |      0 skipped
[doctest] assertions:     65 |     64 passed |      1 failed |
[doctest] Status: FAILURE!
<end of output>
Test time =   0.05 sec
----------------------------------------------------------
Test Failed.
"yateto kernels" end time: Oct 13 11:39 CEST
"yateto kernels" time elapsed: 00:00:00
----------------------------------------------------------

End testing: Oct 13 11:39 CEST

Running the audit scenario test case

#!/bin/bash
#SBATCH -A <account>
#SBATCH -p boost_usr_prod
#SBATCH --time 00:10:00     # format: HH:MM:SS
#SBATCH -N 4              # 1 node
#SBATCH --ntasks-per-node=4 # 4 tasks out of 32
#SBATCH --cpus-per-task=8
#SBATCH --exclusive
#SBATCH --gres=gpu:4        # 4 gpus per node out of 4
#SBATCH --job-name=my_batch_job

module purge
module load petsc/3.20.1--openmpi--4.1.6--gcc--12.2.0-cuda-12.1-mumps # <---petsc
module load cuda/12.1
module load eigen/3.4.0--gcc--12.2.0-5jcagas
module load spack/0.21.0-68a
module load cmake/3.27.7

# activate spack env
spack env activate $WORK/mredenti/ChEESE/TANDEM/spack-env-tandem

srun bash \
-c 'export CUDA_VISIBLE_DEVICES=$((SLURM_LOCALID % 4)); \
exec ./static \
ridge.toml \
--output ridgecrest \
--mg_strategy twolevel \
--mg_coarse_level 1 \
--petsc \
-ksp_view \
-ksp_monitor \
-ksp_converged_reason \
-ksp_max_it 40 \
-pc_type mg \
-mg_levels_ksp_max_it 4 \
-mg_levels_ksp_type cg \
-mg_levels_pc_type bjacobi \
-options_left \
-ksp_rtol 1.0e-6 \
-mg_coarse_pc_type gamg \
-mg_coarse_ksp_type cg \
-mg_coarse_ksp_rtol 1.0e-1 \
-mg_coarse_ksp_converged_reason \
-ksp_type gcr \
-vec_type cuda \
-mat_type aijcusparse \
-ksp_view -log_view'

i get the aforementioned cuSPARSE_STATUS_INSUFFICIENT_RESOURCES error. See log
slurm-tandem_cusparse_error.log

Note: It seems I have to go up to 48 nodes to have enough memory. See log
slurm-tandem_cusparse_success_48nodes.log

Attempt 2: Instal [email protected] from source

Note: Even when I install Petsc from source the resulting outcome is not any different from the errors documented in attempt one, and therefore I will only report the installation steps of Petsc

Click to expand

Set Petsc Version

export PETSC_VERSION=3.21.5

Clone Petsc

git clone -b v$PETSC_VERSION https://gitlab.com/petsc/petsc.git petsc-$PETSC_VERSION

Petsc Installation

#!/bin/bash
#SBATCH -A <account> 
#SBATCH -p lrd_all_serial
#SBATCH --time 00:30:00       
#SBATCH -N 1               
#SBATCH --ntasks-per-node=1 
#SBATCH --cpus-per-task=4
#SBATCH --exclusive 
#SBATCH --gres=gpu:0        
#SBATCH --job-name=petsc_installation_3_21_5

module load gcc/12.2.0 
module load openmpi/4.1.6--gcc--12.2.0
module load cuda/12.1   
module load superlu-dist/8.1.2--openmpi--4.1.6--gcc--12.2.0-cuda-12.1-zsspaca
module load metis/5.1.0--gcc--12.2.0
module load mumps/5.5.1--openmpi--4.1.6--gcc--12.2.0-4hwekmx
module load parmetis/4.0.3--openmpi--4.1.6--gcc--12.2.0
module load cmake/3.27.7
module load openblas/0.3.24--gcc--12.2.0
module load hypre/2.29.0--openmpi--4.1.6--gcc--12.2.0-cuda-12.1-iln2jw4 
module load netlib-scalapack/2.2.0--openmpi--4.1.6--gcc--12.2.0
module load eigen/3.4.0--gcc--12.2.0-5jcagas
module load fftw/3.3.10--openmpi--4.1.6--gcc--12.2.0
module load cmake/3.27.7
module load spack/0.21.0-68a

spack env activate $WORK/mredenti/ChEESE/TANDEM/spack-env-tandem

export PETSC_VERSION=3.21.5

cd petsc-${PETSC_VERSION}

./config/configure.py \
    --prefix=$WORK/mredenti/ChEESE/TANDEM/petsc-${PETSC_VERSION}-opt \
    --with-ssl=0 \
    --download-c2html=0 \
    --download-sowing=0 \
    --download-hwloc=0 \
    --with-cc=${MPICC} \
    --with-cxx=${MPICXX} \
    --with-fc=${MPIF90} \
    --with-precision=double \
    --with-scalar-type=real \
    --with-shared-libraries=1 \
    --with-debugging=0 \
    --with-openmp=0 \
    --with-64-bit-indices=0 \
    --with-blaslapack-lib=${OPENBLAS_LIB}/libopenblas.so \
    --with-x=0 \
    --with-clanguage=C \
    --with-cuda=1 \
    --with-cuda-dir=${CUDA_HOME} \
    --with-hip=0 \
    --with-metis=1 \
    --with-metis-include=${METIS_INC} \
    --with-metis-lib=${METIS_LIB}/libmetis.so \
    --with-hypre=1 \
    --with-hypre-include=${HYPRE_INC} \
    --with-hypre-lib=${HYPRE_LIB}/libHYPRE.so \
    --with-parmetis=1 \
    --with-parmetis-include=${PARMETIS_INC} \
    --with-parmetis-lib=${PARMETIS_LIB}/libparmetis.so \
    --with-kokkos=0 \
    --with-kokkos-kernels=0 \
    --with-superlu_dist=1 \
    --with-superlu_dist-include=${SUPERLU_DIST_INC} \
    --with-superlu_dist-lib=${SUPERLU_DIST_LIB}/libsuperlu_dist.so \
    --with-ptscotch=0 \
    --with-suitespars \
    --with-zlib=1 \
    --with-zlib-include=${ZLIB_INC} \
    --with-zlib-lib=${ZLIB_LIB}/libz.so \
    --with-mumps=1 \
    --with-mumps-include=${MUMPS_INC} \
    --with-mumps-lib="${MUMPS_LIB}/libcmumps.so ${MUMPS_LIB}/libsmumps.so ${MUMPS_LIB}/libdmumps.so ${MUMPS_LIB}/libzmumps.so ${MUMPS_LIB}/libmumps_common.so ${MUMPS_LIB}/libpord.so" \
    --with-trilinos=0 \
    --with-fftw=1 \
    --with-fftw-include=${FFTW_INC} \
    --with-fftw-lib="${FFTW_LIB}/libfftw3_mpi.so ${FFTW_LIB}/libfftw3.so" \
    --with-valgrind=0 \
    --with-gmp=0 \
    --with-libpng=0 \
    --with-giflib=0 \
    --with-mpfr=0 \
    --with-netcdf=0 \
    --with-pnetcdf=0 \
    --with-moab=0 \
    --with-random123=0 \
    --with-exodusii=0 \
    --with-cgns=0 \
    --with-memkind=0 \
    --with-memalign=64 \
    --with-p4est=0 \
    --with-saws=0 \
    --with-yaml=0 \
    --with-hwloc=0 \
    --with-libjpeg=0 \
    --with-scalapack=1 \
    --with-scalapack-lib=${NETLIB_SCALAPACK_LIB}/libscalapack.so \
    --with-strumpack=0 \
    --with-mmg=0 \
    --with-parmmg=0 \
    --with-tetgen=0 \
    --with-cuda-arch=80 \
    --FOPTFLAGS=-O3 \
    --CXXOPTFLAGS=-O3 \
    --COPTFLAGS=-O3


make PETSC_DIR=$WORK/mredenti/ChEESE/TANDEM/petsc-${PETSC_VERSION} PETSC_ARCH="arch-linux-c-opt" all
make PETSC_DIR=$WORK/mredenti/ChEESE/TANDEM/petsc-${PETSC_VERSION} PETSC_ARCH=arch-linux-c-opt install

Check Petsc installation on GPU node

#!/bin/bash
#SBATCH -A cin_staff 
#SBATCH -p boost_usr_prod
#SBATCH -q boost_qos_dbg
#SBATCH --time 00:10:00     
#SBATCH -N 1               
#SBATCH --ntasks-per-node=1 
#SBATCH --cpus-per-task=4
##SBATCH --exclusive 
#SBATCH --gres=gpu:1        
#SBATCH --job-name=petsc_test_installation

module load gcc/12.2.0 
module load openmpi/4.1.6--gcc--12.2.0
module load cuda/12.1 
module load superlu-dist/8.1.2--openmpi--4.1.6--gcc--12.2.0-cuda-12.1-zsspaca
module load metis/5.1.0--gcc--12.2.0
module load mumps/5.5.1--openmpi--4.1.6--gcc--12.2.0-4hwekmx
module load parmetis/4.0.3--openmpi--4.1.6--gcc--12.2.0
module load openblas/0.3.24--gcc--12.2.0
module load hypre/2.29.0--openmpi--4.1.6--gcc--12.2.0-cuda-12.1-iln2jw4 
module load netlib-scalapack/2.2.0--openmpi--4.1.6--gcc--12.2.0
module load eigen/3.4.0--gcc--12.2.0-5jcagas
module load fftw/3.3.10--openmpi--4.1.6--gcc--12.2.0
module load spack/0.21.0-68a
module load zlib/1.2.13--gcc--12.2.0-b3ocy4r
module load cmake/3.27.7

# activate spack env
spack env activate $WORK/mredenti/ChEESE/TANDEM/spack-env-tandem

export PETSC_VERSION=3.21.5

cd petsc-${PETSC_VERSION}

make PETSC_DIR=$WORK/mredenti/ChEESE/TANDEM/petsc-${PETSC_VERSION}-opt PETSC_ARCH="" check

make -j 4 -f $WORK/mredenti/ChEESE/TANDEM/petsc-${PETSC_VERSION}-opt/share/petsc/examples/gmakefile.test test
@mredenti mredenti added the bug Something isn't working label Oct 13, 2024
@Thomas-Ulrich
Copy link
Collaborator

Hi,

Thank you for this very detailed issue.
The problem is that tandem require you using:
--with-memalign=32 --with-64-bit-indices

see. e.g.:
https://tandem.readthedocs.io/en/latest/getting-started/installation.html#install-petsc
(but to be honnest, I tried to to install with my usual spack workflow, and get:
Cannot use scalapack with 64-bit BLAS/LAPACK indices)
which is strange because I'm not getting this problem in our local cluster.
(and I also tried starting for the lastest spack).

@mredenti
Copy link
Author

mredenti commented Oct 23, 2024

Hi,

Thank you for this very detailed issue. The problem is that tandem require you using: --with-memalign=32 --with-64-bit-indices

see. e.g.: https://tandem.readthedocs.io/en/latest/getting-started/installation.html#install-petsc (but to be honnest, I tried to to install with my usual spack workflow, and get: Cannot use scalapack with 64-bit BLAS/LAPACK indices) which is strange because I'm not getting this problem in our local cluster. (and I also tried starting for the lastest spack).

I am getting the same error when installing from source and enabling 64 bit indices. I've seen you opened an issue on Spack, so will wait to see their reply :D

@Thomas-Ulrich
Copy link
Collaborator

Hi,
so I have installed tandem on Leonardo (and you should be able to use my module).
See:

https://github.com/TEAR-ERC/tandem/pull/70/files#diff-83c59cd7879a96bf1ef1af99dc062278a5eae6770030bb5cd7722e467000ad37

Now trying to run with:

#!/bin/bash
#SBATCH -p boost_usr_prod
#SBATCH --time 00:10:00     # format: HH:MM:SS
#SBATCH -N 4              # 1 node
#SBATCH --ntasks-per-node=4 # 4 tasks out of 32
#SBATCH --cpus-per-task=8
#SBATCH --exclusive
#SBATCH --gres=gpu:4        # 4 gpus per node out of 4
#SBATCH --job-name=tandem

module load tandem/develop-gcc-12.2.0-d3-p4-cuda-3nd3swf

srun bash \
-c 'export CUDA_VISIBLE_DEVICES=$((SLURM_LOCALID % 4)); \
exec static \
ridge.toml \
--output ridgecrest \
--mg_strategy twolevel \
--mg_coarse_level 1 \
--petsc \
-ksp_view \
-ksp_monitor \
-ksp_converged_reason \
-ksp_max_it 40 \
-pc_type mg \
-mg_levels_ksp_max_it 4 \
-mg_levels_ksp_type cg \
-mg_levels_pc_type bjacobi \
-options_left \
-ksp_rtol 1.0e-6 \
-mg_coarse_pc_type gamg \
-mg_coarse_ksp_type cg \
-mg_coarse_ksp_rtol 1.0e-1 \
-mg_coarse_ksp_converged_reason \
-ksp_type gcr \
-vec_type cuda \
-mat_type aijcusparse \
-ksp_view -log_view'                  

and getting the following error:

[1729762288.421559] [lrdn0300:3869301:0]     ucp_context.c:1849 UCX  WARN  UCP API version is incompatible: required >= 1.14, actual 1.13.0 (loaded from /lib64/libucp.so.0)

               ___          ___         _____         ___          ___
      ___     /  /\        /__/\       /  /::\       /  /\        /__/\
     /  /\   /  /::\       \  \:\     /  /:/\:\     /  /:/_      |  |::\
    /  /:/  /  /:/\:\       \  \:\   /  /:/  \:\   /  /:/ /\     |  |:|:\
   /  /:/  /  /:/~/::\  _____\__\:\ /__/:/ \__\:| /  /:/ /:/_  __|__|:|\:\
  /  /::\ /__/:/ /:/\:\/__/::::::::\\  \:\ /  /://__/:/ /:/ /\/__/::::| \:\
 /__/:/\:\\  \:\/:/__\/\  \:\~~\~~\/ \  \:\  /:/ \  \:\/:/ /:/\  \:\~~\__\/
 \__\/  \:\\  \::/      \  \:\  ~~~   \  \:\/:/   \  \::/ /:/  \  \:\
      \  \:\\  \:\       \  \:\        \  \::/     \  \:\/:/    \  \:\
       \__\/ \  \:\       \  \:\        \__\/       \  \::/      \  \:\
              \__\/        \__\/                     \__\/        \__\/

                          tandem version 5aa7407
                            Domain dimension 3
                            polynomial degree 4
                    Minimum order of quadrature rule 9

                       stack size limit = unlimited

                              Worker affinity
                    0123456789|0123456789|0123456789|01
                          Worker affinity on node
                    0123456789|0123456789|0123456789|01


DOFs: 8976660
Mesh size: 37.1951
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[1]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[1]PETSC ERROR: GPU error
[3]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[3]PETSC ERROR: GPU error
[3]PETSC ERROR: Cannot lazily initialize PetscDevice: cuda error 34 (cudaErrorStubLibrary) : CUDA driver is a stub library
[0]PETSC ERROR: GPU error
[0]PETSC ERROR: Cannot lazily initialize PetscDevice: cuda error 34 (cudaErrorStubLibrary) : CUDA driver is a stub library

@mredenti
Copy link
Author

mredenti commented Oct 24, 2024

Ok, thank you. I will take a look at this

@Thomas-Ulrich
Copy link
Collaborator

Hi,
I fixed the problem.
Now I can run the problem.... on 8 nodes again!
(else I get the cuSPARSE_STATUS_INSUFFICIENT_RESOURCES oO)

@mredenti
Copy link
Author

Hi,

Ok, Is this for the intermediate size mesh like the one used in WP3 or the larger one?
Just out of curiosity

  • can the HIP version on LUMI-G run on a single node? Otherwise, what is the minimum number of nodes required? (off course taking into account there are 8 GCDs)
  • can the CPU version run on a single node? Otherwise, what is the minimum number of nodes required?

@Thomas-Ulrich
Copy link
Collaborator

Thomas-Ulrich commented Oct 24, 2024

Yes.

LUMI-C, one node, 128 ranks

DOFs: 8976660
Mesh size: 37.1951
Multigrid P-levels: 1 4 
Assembly: 6.84564 s
Solver warmup: 8.1214 s
Solve: 38.534 s
Residual norm: 0.0160344
Iterations: 20

LUMI-G one node with 4 GPUs/node 

DOFs: 8976660
Mesh size: 37.1951
Multigrid P-levels: 1 4
Assembly: 93.7656 s
Solver warmup: 20.765 s
Solve: 95.2059 s
Residual norm: 0.019239
Iterations: 16


LUMI-G one node with 8 GPUs/node 

DOFs: 8976660
Mesh size: 37.1951
Multigrid P-levels: 1 4
Assembly: 48.5922 s
Solver warmup: 12.4144 s
Solve: 52.6558 s
Residual norm: 0.012698
Iterations: 17


Leonardo 8 nodes with 4 GPUs per nodes (less than 8 nodes crashes).

DOFs: 8976660
Mesh size: 37.1951
Multigrid P-levels: 1 4
Assembly: 17.2949 s
Solver warmup: 5.88255 s
Solve: 15.075 s
Residual norm: 0.0195129
Iterations: 18

@mredenti
Copy link
Author

mredenti commented Oct 24, 2024

Ok, I can see if applying the modifications discussed here Reduce GPU memory consumption and avoid the CUSPARSE_STATUS_INSUFFICIENT_RESOURCES might reduce the memory requirements as mentioned in the CuSPARSE Algorithms Documentation . We can then check we obtain the same results.

@Thomas-Ulrich
Copy link
Collaborator

Thomas-Ulrich commented Oct 25, 2024

I've added the proposed change to petsc (from your gitlab issue), but still can't run with 4 nodes of 4 GPUs.

@mredenti
Copy link
Author

So has it reduced the minimum number of nodes required from 8 to 4?

@Thomas-Ulrich
Copy link
Collaborator

no

@mredenti
Copy link
Author

mredenti commented Nov 11, 2024

Yes.

LUMI-C, one node, 128 ranks

DOFs: 8976660
Mesh size: 37.1951
Multigrid P-levels: 1 4 
Assembly: 6.84564 s
Solver warmup: 8.1214 s
Solve: 38.534 s
Residual norm: 0.0160344
Iterations: 20

LUMI-G one node with 4 GPUs/node 

DOFs: 8976660
Mesh size: 37.1951
Multigrid P-levels: 1 4
Assembly: 93.7656 s
Solver warmup: 20.765 s
Solve: 95.2059 s
Residual norm: 0.019239
Iterations: 16


LUMI-G one node with 8 GPUs/node 

DOFs: 8976660
Mesh size: 37.1951
Multigrid P-levels: 1 4
Assembly: 48.5922 s
Solver warmup: 12.4144 s
Solve: 52.6558 s
Residual norm: 0.012698
Iterations: 17


Leonardo 8 nodes with 4 GPUs per nodes (less than 8 nodes crashes).

DOFs: 8976660
Mesh size: 37.1951
Multigrid P-levels: 1 4
Assembly: 17.2949 s
Solver warmup: 5.88255 s
Solve: 15.075 s
Residual norm: 0.0195129
Iterations: 18

Hi @Thomas-Ulrich,

I am running the static mini app on a GH machine but it seems I am running a different problem size from yours. Following your instructions I am running this version of Tandem

System Leonardo Booster
Branch dmay/petsc_dev_hip (fixes residual convergence difference between CPU and GPU)
Commit ID 1015c31d0f29eab4983497a3ad3f607057285388
Backends CUDA via PETSc
Target static

and the following problem setup

Problem setup

Get audit scenario

wget https://syncandshare.lrz.de/dl/fi34J422UiAKKnYKNBkuTR/audit-scenario.zip
unzip audit-scenario.zip

Create intermediate size mesh with gmsh (same setup as Eviden-WP3)

gmsh fault_many_wide.geo -3 -setnumber h 10.0 -setnumber h_fault 0.25 -o fault_many_wide.msh

Change mesh in ridge.toml

mesh_file = "fault_many_wide.msh"
#mesh_file = "fault_many_wide_4_025.msh"

type = "elasticity"

matrix_free = true

ref_normal = [0, -1, 0]
lib = "scenario_ridgecrest.lua"
scenario = "shaker"
#[domain_output]

However, I see from my output that

DOFs: 60906720
Mesh size: 19.1229

whereas in yours is

DOFs: 8976660
Mesh size: 37.1951

Could you please share with me the same problem setup you are using on Leonardo and Lumi? Your problem size is smaller than mine. Otherwise, it might be easier if you just share the fault_many_wide.msh file

@Thomas-Ulrich
Copy link
Collaborator

Here is the mesh:
https://syncandshare.lrz.de/getlink/fi6U78D9rk9HYP4MZpjUQx/fault_many_wide.msh
static is compiled with p4.

@mredenti
Copy link
Author

Yes.

LUMI-C, one node, 128 ranks

DOFs: 8976660
Mesh size: 37.1951
Multigrid P-levels: 1 4 
Assembly: 6.84564 s
Solver warmup: 8.1214 s
Solve: 38.534 s
Residual norm: 0.0160344
Iterations: 20

LUMI-G one node with 4 GPUs/node 

DOFs: 8976660
Mesh size: 37.1951
Multigrid P-levels: 1 4
Assembly: 93.7656 s
Solver warmup: 20.765 s
Solve: 95.2059 s
Residual norm: 0.019239
Iterations: 16


LUMI-G one node with 8 GPUs/node 

DOFs: 8976660
Mesh size: 37.1951
Multigrid P-levels: 1 4
Assembly: 48.5922 s
Solver warmup: 12.4144 s
Solve: 52.6558 s
Residual norm: 0.012698
Iterations: 17


Leonardo 8 nodes with 4 GPUs per nodes (less than 8 nodes crashes).

DOFs: 8976660
Mesh size: 37.1951
Multigrid P-levels: 1 4
Assembly: 17.2949 s
Solver warmup: 5.88255 s
Solve: 15.075 s
Residual norm: 0.0195129
Iterations: 18

Could you also extract the GPU MFlop/s for each of these runs on LUMI-G and Leonardo? It should be as simple as passing one of the flags (or both) -log_view -log_view_gpu_time to Petsc options. This way we can then compute the percentage of the achieved peak performance of the solver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants