Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue] Recovering from out of memory error #289

Open
Samev opened this issue Jan 11, 2024 · 3 comments
Open

[Issue] Recovering from out of memory error #289

Samev opened this issue Jan 11, 2024 · 3 comments
Labels

Comments

@Samev
Copy link

Samev commented Jan 11, 2024

Describe the issue

When running AMGX on a too large case for the GPU it reports the following error

Thrust failure: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
File and line number are not available for this exception.

when calling AMGX_solver_setup. Following this we try to reset the AMGX solver but when AMGX_solver_destroy is called it crashes the application (despite being done within a try-catch block) with the following:

terminate called after throwing an instance of 'amgx::amgx_exception'
  what():  Cuda failure: 'an illegal memory access was encountered'

 /<censored>/lib/libamgxsh.so : amgx::handle_signals(int)+0xa2
 /lib/x86_64-linux-gnu/libc.so.6 : ()+0x42520
 /lib/x86_64-linux-gnu/libc.so.6 : pthread_kill()+0x12c
 /lib/x86_64-linux-gnu/libc.so.6 : raise()+0x16
 /lib/x86_64-linux-gnu/libc.so.6 : abort()+0xd3
 /lib/x86_64-linux-gnu/libstdc++.so.6 : ()+0xa2b9e
 /lib/x86_64-linux-gnu/libstdc++.so.6 : ()+0xae20c
 /lib/x86_64-linux-gnu/libstdc++.so.6 : ()+0xad1e9
 /lib/x86_64-linux-gnu/libstdc++.so.6 : __gxx_personality_v0()+0x99
 /lib/x86_64-linux-gnu/libgcc_s.so.1 : ()+0x16884
 /lib/x86_64-linux-gnu/libgcc_s.so.1 : _Unwind_RaiseException()+0x311
 /lib/x86_64-linux-gnu/libstdc++.so.6 : __cxa_throw()+0x3b
 /<censored>/lib/libamgxsh.so : amgx::dense_lu_solver::DenseLUSolver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~DenseLUSolver()+0x998
 /<censored>/lib/libamgxsh.so : amgx::dense_lu_solver::DenseLUSolver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~DenseLUSolver()+0xd
 /<censored>/lib/libamgxsh.so : amgx::AMG<(AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>::~AMG()+0x42
 /<censored>/lib/libamgxsh.so : amgx::AlgebraicMultigrid_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~AlgebraicMultigrid_Solver()+0x26
 /<censored>/lib/libamgxsh.so : amgx::AlgebraicMultigrid_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~AlgebraicMultigrid_Solver()+0xd
 /<censored>/lib/libamgxsh.so : amgx::PBiCGStab_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~PBiCGStab_Solver()+0x35
 /<censored>/lib/libamgxsh.so : amgx::PBiCGStab_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~PBiCGStab_Solver()+0xd
 /<censored>/lib/libamgxsh.so : amgx::AMG_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~AMG_Solver()+0x180
 /<censored>/lib/libamgxsh.so : std::_Sp_counted_ptr<amgx::AMG_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x16
 /<censored>/lib/libamgxsh.so : std::_Sp_counted_ptr<amgx::CWrapHandle<AMGX_solver_handle_struct*, amgx::AMG_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > >*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x56
 /<censored>/lib/libamgxsh.so : ()+0x1394590
 /<censored>/lib/libamgxsh.so : AMGX_solver_destroy()+0xe24

I'm wondering if it is at all intended to be possible to recover from out of memory errors like this? I tried looking in the documentation and couldn't find anything specific indicating that a failure in AMGX_solver_setup needs some special handling.

Obviously AMGX won't be able to handle the specific matrix+solver combo in question on the specific GPU but this crash currently prevents us from destructing our AMGX solver object in case we run into this limit which is a bit of a problem since it results in the application crashing completely.

I tried skipping the call to AMGX_solver_destroy (proceeding with the rest of the *destroy commands and finalize commands, but then I run into the !!! detected some memory leaks in the code: trying to free non-empty temporary device pool !!! error which makes sense since the solver object isn't destroyed in the intended order.

Environment information:

  • OS: Ubuntu 22.04 (through WSL on Windows 11)
  • CUDA runtime: CUDA 11.7.1
  • MPI version (if applicable): Not applicable
  • AMGX version or commit hash v2.3.0 + cherry picked 8bb693b42acc64c1893835d95858cad350c790c1
  • NVIDIA driver: 528.24 (probably the Windows driver version as nvidia-smi reports the same version in Windows + WSL)
  • NVIDIA GPU: RTX4080
  • Any related environment variables information: Not applicable

Same problem has been reported on same build but for at least a RTX3090 card as well.

AMGX solver configuration

config_version=2,
determinism_flag=0,
solver(mainSolver)=PBICGSTAB,
mainSolver:preconditioner(precon)=AMG,
precon:cycle=V,
precon:max_levels=15,
precon:selector=PMIS,
precon:smoother(smooth)=BLOCK_JACOBI,
precon:presweeps=1,
precon:postsweeps=1,
precon:max_iters=1,
precon:interpolator=D2,
precon:interp_max_elements=6,
mainSolver:monitor_residual=1,
mainSolver:store_res_history=1,
mainSolver:norm=L2,
mainSolver:print_vis_data=1,
mainSolver:max_iters=10000,
mainSolver:tolerance=1e-09,
mainSolver:gmres_n_restart=30,
mainSolver:convergence=RELATIVE_INI_CORE

Matrix Data

My currently used matrix I'm not able to share. If you need me to I can see if I can recreate this crash with a matrix that isn't sensitive.

Reproduction steps

Call order:

  • Setup:
    • AMGX_solver_register_print_callback
    • AMGX_initialize
    • AMGX_initialize_plugins
    • AMGX_install_signal_handler
    • AMGX_config_create (global config)
    • AMGX_resources_create_simple
    • AMGX_config_create (for the specific solver)
    • AMGX_matrix_create
    • AMGX_vector_create (both rhs and solution)
    • AMGX_solver_create
    • AMGX_matrix_upload_all
    • AMGX_solver_setup
      • Crash due to insufficient memory, exception is caught
  • Try to tear down AMGX
    • AMGX_solver_destroy
      • Results in process crashing, can't catch the exception

Additional context

-

@Samev Samev added the bug label Jan 11, 2024
@hamsteri15
Copy link

I'm getting the same error and it is quite confusing. While "Illegal memory access" is probably why the solver crashes the error message should probably say that the illegal access happens due to out of memory. The error can be reproduced using the amg_mpi_poisson7 example:

mpirun -np 1 ./amgx_mpi_poisson7 -mode dDDI -p 600 600 600 1 1 1 -c ./../configs/PCG_AGGREGATION_JACOBI.json

log500.txt
log600.txt

For 500³, on a A100-80Gb the solver passes but for 600³ grid the solver crashes. We use AMGx as a part of a flow solver and we have other GPU memory requirements and in practice this means that the error occurs already at cell counts with 20M cells.

@marsaev
Copy link
Collaborator

marsaev commented Nov 5, 2024

@hamsteri15 Classical multigrid is quite memory hungry. I can suggest you adding aggressive_levels and/or max_row_sum to the amg configuration (see examples https://github.com/NVIDIA/AMGX/blob/main/src/configs/AMG_CLASSICAL_AGGRESSIVE_L1_TRUNC.json or https://github.com/NVIDIA/AMGX/blob/main/src/configs/FGMRES_CLASSICAL_AGGRESSIVE_PMIS.json ) to reduce memory usage.

@Samev
Copy link
Author

Samev commented Nov 13, 2024

@marsaev Do you have any input on the original issue? I.e. should it be possible to gracefully destruct the AMGX solver if one runs into an out of memory error?

Or maybe this isn't an out of memory error at all and we are simply misintepreting it as such?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants