-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory issue (cuSPARSE_STATUS_INSUFFICIENT_RESOURCES
) when running Tandem-Static Mini-App on Leonardo HPC System
#79
Comments
Hi, Thank you for this very detailed issue. see. e.g.: |
I am getting the same error when installing from source and enabling 64 bit indices. I've seen you opened an issue on Spack, so will wait to see their reply :D |
Hi, Now trying to run with:
and getting the following error:
|
Ok, thank you. I will take a look at this |
Hi, |
Hi, Ok, Is this for the intermediate size mesh like the one used in WP3 or the larger one?
|
Yes.
|
Ok, I can see if applying the modifications discussed here Reduce GPU memory consumption and avoid the CUSPARSE_STATUS_INSUFFICIENT_RESOURCES might reduce the memory requirements as mentioned in the CuSPARSE Algorithms Documentation . We can then check we obtain the same results. |
I've added the proposed change to petsc (from your gitlab issue), but still can't run with 4 nodes of 4 GPUs. |
So has it reduced the minimum number of nodes required from 8 to 4? |
no |
Hi @Thomas-Ulrich, I am running the static mini app on a GH machine but it seems I am running a different problem size from yours. Following your instructions I am running this version of Tandem
and the following problem setup Problem setupGet audit scenario wget https://syncandshare.lrz.de/dl/fi34J422UiAKKnYKNBkuTR/audit-scenario.zip
unzip audit-scenario.zip Create intermediate size mesh with gmsh (same setup as Eviden-WP3) gmsh fault_many_wide.geo -3 -setnumber h 10.0 -setnumber h_fault 0.25 -o fault_many_wide.msh Change mesh in ridge.toml mesh_file = "fault_many_wide.msh"
#mesh_file = "fault_many_wide_4_025.msh"
type = "elasticity"
matrix_free = true
ref_normal = [0, -1, 0]
lib = "scenario_ridgecrest.lua"
scenario = "shaker"
#[domain_output] However, I see from my output that
whereas in yours is
Could you please share with me the same problem setup you are using on Leonardo and Lumi? Your problem size is smaller than mine. Otherwise, it might be easier if you just share the |
Here is the mesh: |
Could you also extract the |
Description
dmay/petsc_dev_hip
(fixes residual convergence difference between CPU and GPU)1015c31d0f29eab4983497a3ad3f607057285388
static
I'm encountering errors when running the Tandem mini-app static on the Leonardo Booster HPC system. Specifically,
yateto kernels
test fails during execution.cuSPARSE_STATUS_INSUFFICIENT_RESOURCES
error occurs when launching the mini-app on less than (~) 48 nodes with 4 gpus per node (4*64*48 GB). I am not sure whether it is simply that the problem size is too large or something else.Problem setup
Get audit scenario
Create intermediate size mesh with gmsh (same setup as Eviden-WP3)
Change mesh in ridge.toml
Steps to reproduce errors
Attempt 1: Use system installation of [email protected]
Click to expand
Load Modules
module purge module load petsc/3.20.1--openmpi--4.1.6--gcc--12.2.0-cuda-12.1-mumps # <---petsc module load cuda/12.1 module load eigen/3.4.0--gcc--12.2.0-5jcagas module load spack/0.21.0-68a module load cmake/3.27.7
Spack environment for Lua and Python+Numpy dependencies
Install CSV module
Clone Tandem
Build Tandem
Note: Petsc on Leonardo has been installed without a specific value for
--with-memalign
.When running the CMake configuration step
I get the following error
-- Could NOT find LibxsmmGenerator (missing: LibxsmmGeneratorExecutable) CMake Error at app/CMakeLists.txt:72 (message): The memory alignment of PETSc is 16 bytes but an alignment of at least 32 bytes is required for ARCH=hsw. Please compile PETSc with --with-memalign=32.
and so I temporarily commented out Tandem's requirement on memory alignment for Petsc in
app/CMakeLists.txt
(just to verify whether I got the same error as for the custom installation of Petsc)and then I build and run the tests on a login node
where the
yateto kernels
failedRunning the audit scenario test case
i get the aforementioned
cuSPARSE_STATUS_INSUFFICIENT_RESOURCES
error. See logslurm-tandem_cusparse_error.log
Note: It seems I have to go up to 48 nodes to have enough memory. See log
slurm-tandem_cusparse_success_48nodes.log
Attempt 2: Instal [email protected] from source
Note: Even when I install Petsc from source the resulting outcome is not any different from the errors documented in attempt one, and therefore I will only report the installation steps of Petsc
Click to expand
Set Petsc Version
export PETSC_VERSION=3.21.5
Clone Petsc
Petsc Installation
Check Petsc installation on GPU node
The text was updated successfully, but these errors were encountered: