You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running AMGX on a too large case for the GPU it reports the following error
Thrust failure: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
File and line number are not available for this exception.
when calling AMGX_solver_setup. Following this we try to reset the AMGX solver but when AMGX_solver_destroy is called it crashes the application (despite being done within a try-catch block) with the following:
I'm wondering if it is at all intended to be possible to recover from out of memory errors like this? I tried looking in the documentation and couldn't find anything specific indicating that a failure in AMGX_solver_setup needs some special handling.
Obviously AMGX won't be able to handle the specific matrix+solver combo in question on the specific GPU but this crash currently prevents us from destructing our AMGX solver object in case we run into this limit which is a bit of a problem since it results in the application crashing completely.
I tried skipping the call to AMGX_solver_destroy (proceeding with the rest of the *destroy commands and finalize commands, but then I run into the !!! detected some memory leaks in the code: trying to free non-empty temporary device pool !!! error which makes sense since the solver object isn't destroyed in the intended order.
Environment information:
OS: Ubuntu 22.04 (through WSL on Windows 11)
CUDA runtime: CUDA 11.7.1
MPI version (if applicable): Not applicable
AMGX version or commit hash v2.3.0 + cherry picked 8bb693b42acc64c1893835d95858cad350c790c1
NVIDIA driver: 528.24 (probably the Windows driver version as nvidia-smi reports the same version in Windows + WSL)
NVIDIA GPU: RTX4080
Any related environment variables information: Not applicable
Same problem has been reported on same build but for at least a RTX3090 card as well.
I'm getting the same error and it is quite confusing. While "Illegal memory access" is probably why the solver crashes the error message should probably say that the illegal access happens due to out of memory. The error can be reproduced using the amg_mpi_poisson7 example:
For 500³, on a A100-80Gb the solver passes but for 600³ grid the solver crashes. We use AMGx as a part of a flow solver and we have other GPU memory requirements and in practice this means that the error occurs already at cell counts with 20M cells.
@marsaev Do you have any input on the original issue? I.e. should it be possible to gracefully destruct the AMGX solver if one runs into an out of memory error?
Or maybe this isn't an out of memory error at all and we are simply misintepreting it as such?
Describe the issue
When running AMGX on a too large case for the GPU it reports the following error
when calling
AMGX_solver_setup
. Following this we try to reset the AMGX solver but whenAMGX_solver_destroy
is called it crashes the application (despite being done within a try-catch block) with the following:I'm wondering if it is at all intended to be possible to recover from out of memory errors like this? I tried looking in the documentation and couldn't find anything specific indicating that a failure in
AMGX_solver_setup
needs some special handling.Obviously AMGX won't be able to handle the specific matrix+solver combo in question on the specific GPU but this crash currently prevents us from destructing our AMGX solver object in case we run into this limit which is a bit of a problem since it results in the application crashing completely.
I tried skipping the call to
AMGX_solver_destroy
(proceeding with the rest of the*destroy
commands andfinalize
commands, but then I run into the!!! detected some memory leaks in the code: trying to free non-empty temporary device pool !!!
error which makes sense since the solver object isn't destroyed in the intended order.Environment information:
Ubuntu 22.04
(through WSL on Windows 11)CUDA 11.7.1
v2.3.0
+ cherry picked8bb693b42acc64c1893835d95858cad350c790c1
nvidia-smi
reports the same version in Windows + WSL)Same problem has been reported on same build but for at least a RTX3090 card as well.
AMGX solver configuration
Matrix Data
My currently used matrix I'm not able to share. If you need me to I can see if I can recreate this crash with a matrix that isn't sensitive.
Reproduction steps
Call order:
AMGX_solver_register_print_callback
AMGX_initialize
AMGX_initialize_plugins
AMGX_install_signal_handler
AMGX_config_create
(global config)AMGX_resources_create_simple
AMGX_config_create
(for the specific solver)AMGX_matrix_create
AMGX_vector_create
(both rhs and solution)AMGX_solver_create
AMGX_matrix_upload_all
AMGX_solver_setup
AMGX_solver_destroy
Additional context
-
The text was updated successfully, but these errors were encountered: