Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulation freezes #5494

Open
lisajulia opened this issue Jul 30, 2024 · 2 comments
Open

Simulation freezes #5494

lisajulia opened this issue Jul 30, 2024 · 2 comments

Comments

@lisajulia
Copy link
Contributor

lisajulia commented Jul 30, 2024

I wrote a test for sth different and on Jenkins, the simulation froze:
Datafile: https://github.com/lisajulia/opm-tests/blob/8b84e28bd63d705ec659976a789cbc5cd7f0a80a/actionx/ACTIONX_COMPDAT_SHORT.DATA
Log file from Jenkins with frozen simulation, ended by a timeout then: https://ci.opm-project.org/job/opm-simulators-PR-builder/6452/testReport/junit/(root)/mpi/compareSeparateECLFiles_flow_actionx_compdat_8_procs/

Flow compiled with the following commits:
opm-common: d075bc889ead20424c695382a077275ddb1c66a3
opm-models: 29582a9f59feec1c9d04286977ab6adef89b12e3
opm-grid: bc501ad7f48676918c594d0c8dd42c405958f758
opm-simulators: ed5f371

I ran flow on 8 processes.

In case this error is gone when testing this again, please close!

@blattms
Copy link
Member

blattms commented Jul 30, 2024

Sigh, one of my most favorite parallel deadlocks in opm-flow:

7 processes in

(gdb) bt
#0  0x00007ffdcd906f94 in ?? ()
   from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so
#1  0x00007fffeac21e1c in opal_progress ()
   from /lib/x86_64-linux-gnu/libopen-pal.so.40
#2  0x00007fffee512bc5 in ompi_request_default_wait ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#3  0x00007fffee56e35b in ompi_coll_base_sendrecv_actual ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#4  0x00007fffee56f9e0 in ompi_coll_base_allreduce_intra_recursivedoubling ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#5  0x00007ffdcd80d8eb in ompi_coll_tuned_allreduce_intra_dec_fixed ()
   from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so
#6  0x00007fffee52a31a in PMPI_Allreduce ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#7  0x0000555555b61a50 in Dune::Communication<ompi_communicator_t*>::allreduce<Dune::Max<Opm::ExceptionType::ExcEnum>, Opm::ExceptionType::ExcEnum> (
    this=0x7fffffffbc80, in=0x7fffffffbc9c, out=0x7fffffffbc6c, len=1)
    at /usr/include/dune/common/parallel/mpicommunication.hh:457
#8  0x0000555555b5df25 in Dune::Communication<ompi_communicator_t*>::max<Opm::ExceptionType::ExcEnum> (this=0x7fffffffbc80, 
    in=@0x7fffffffbc9c: Opm::ExceptionType::NONE)
    at /usr/include/dune/common/parallel/mpicommunication.hh:253
#9  0x0000555555c51b48 in (anonymous namespace)::_throw (
    exc_type=Opm::ExceptionType::NONE, 
    message="BlackoilWellModel::initializeWellState() failed: ", comm=...)
    at .../opm-simulators/opm/simulators/utils/DeferredLoggingErrorHelpers.hpp:77
#10 0x0000555555c7b473 in checkForExceptionsAndThrow (
    exc_type=Opm::ExceptionType::NONE, 
    message="BlackoilWellModel::initializeWellState() failed: ", comm=...)
    at .../opm-simulators/opm/simulators/utils/DeferredLoggingErrorHelpers.hpp:108
#11 0x0000555555d61d51 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::initializeWellState (this=0x55555ca17248, timeStepIdx=6)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:835
#12 0x0000555555d32a6d in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::initializeLocalWellStructure (this=0x55555ca17248, reportStepIdx=6, 
    enableWellPIScaling=true)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:327
#13 0x0000555555cff11a in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::beginReportStep (this=0x55555ca17248, timeStepIdx=6)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:270
#14 0x0000555555d673e1 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::beginEpisode (this=0x55555ca17248)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel.hpp:204
#15 0x0000555555d36c33 in Opm::FlowProblem<Opm::Properties::TTag::FlowProblemTPFA>::beginEpisode (this=0x55555ca16460)
    at .../opm-simulators/opm/simulators/flow/FlowProblem.hpp:566
#16 0x0000555555cffdf3 in Opm::BlackoilModel<Opm::Properties::TTag::FlowProblemTPFA>::beginReportStep (this=0x55555babef60)
    at .../opm-simulators/opm/simulators/flow/BlackoilModel.hpp:1175
#17 0x0000555555ccbf23 in Opm::SimulatorFullyImplicitBlackoil<Opm::Properties::TTag::FlowProblemTPFA>::runStep (this=0x55555d58dd20, timer=...)
    at .../opm-simulators/opm/simulators/flow/SimulatorFullyImplicitBlackoil.hpp:373
#18 0x0000555555cb7545 in Opm::SimulatorFullyImplicitBlackoil<Opm::Properties::TTag::FlowProblemTPFA>::run (this=0x55555d58dd20, timer=...)
    at .../opm-simulators/opm/simulators/flow/SimulatorFullyImplicitBlackoil.hpp:268
#19 0x0000555555ca0a6b in Opm::FlowMain<Opm::Properties::TTag::FlowProblemTPFA>::runSimulatorRunCallback_ (this=0x7fffffffcf70)
    at .../opm-simulators/opm/simulators/flow/FlowMain.hpp:484
...

and 1 threw an unexpected exception:

#9  0x0000555555c7b5db in logAndCheckForExceptionsAndThrow (
    deferred_logger=..., exc_type=Opm::ExceptionType::RUNTIME_ERROR, 
    message="Failed to initialize local well structure: [.../opm-simulators/opm/simulators/wells/ParallelWellInfo.cpp:708] Cells with these i,j,k indices were not found in grid (well = PROD3)"..., 
    terminal_output=false, comm=...)
    at .../opm-simulators/opm/simulators/utils/DeferredLoggingErrorHelpers.hpp:121
121         Opm::DeferredLogger global_deferredLogger = gatherDeferredLogger(deferred_logger, comm);
(gdb) 
#10 0x0000555555d32be0 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::initializeLocalWellStructure (this=0x55555d3283b8, reportStepIdx=6, 
    enableWellPIScaling=true)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:342
342             OPM_END_PARALLEL_TRY_CATCH_LOG(local_deferredLogger,
(gdb) bt
#0  0x00007fffeac798b5 in ?? () from /lib/x86_64-linux-gnu/libopen-pal.so.40
#1  0x00007fffeac21ce7 in ?? () from /lib/x86_64-linux-gnu/libopen-pal.so.40
#2  0x00007fffeac21e74 in opal_progress ()
   from /lib/x86_64-linux-gnu/libopen-pal.so.40
#3  0x00007fffee512bc5 in ompi_request_default_wait ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#4  0x00007fffee56e35b in ompi_coll_base_sendrecv_actual ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#5  0x00007fffee56caf3 in ompi_coll_base_allgather_intra_recursivedoubling ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#6  0x00007ffdcd80eb1a in ompi_coll_tuned_allgather_intra_dec_fixed ()
   from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so
#7  0x00007fffee529437 in PMPI_Allgather ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#8  0x00007ffff6792f2b in Opm::gatherDeferredLogger (local_deferredlogger=..., 
    mpi_communicator=...)
    at .../opm-simulators/opm/simulators/utils/gatherDeferredLogger.cpp:145
#9  0x0000555555c7b5db in logAndCheckForExceptionsAndThrow (
    deferred_logger=..., exc_type=Opm::ExceptionType::RUNTIME_ERROR, 
    message="Failed to initialize local well structure: [.../opm-simulators/opm/simulators/wells/ParallelWellInfo.cpp:708] Cells with these i,j,k indices were not found in grid (well = PROD3)"..., 
    terminal_output=false, comm=...)
    at .../opm-simulators/opm/simulators/utils/DeferredLoggingErrorHelpers.hpp:121
#10 0x0000555555d32be0 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::initializeLocalWellStructure (this=0x55555d3283b8, reportStepIdx=6, 
    enableWellPIScaling=true)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:342
#11 0x0000555555cff11a in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::beginReportStep (this=0x55555d3283b8, timeStepIdx=6)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:270
#12 0x0000555555d673e1 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::beginEpisode (this=0x55555d3283b8)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel.hpp:204
#13 0x0000555555d36c33 in Opm::FlowProblem<Opm::Properties::TTag::FlowProblemTPFA>::beginEpisode (this=0x55555d3275d0)
    at .../opm-simulators/opm/simulators/flow/FlowProblem.hpp:566
#14 0x0000555555cffdf3 in Opm::BlackoilModel<Opm::Properties::TTag::FlowProblemTPFA>::beginReportStep (this=0x55555babef60)
    at .../opm-simulators/opm/simulators/flow/BlackoilModel.hpp:1175
#15 0x0000555555ccbf23 in Opm::SimulatorFullyImplicitBlackoil<Opm::Properties::TTag::FlowProblemTPFA>::runStep (this=0x55555d4a8f90, timer=...)
    at .../opm-simulators/opm/simulators/flow/SimulatorFullyImplicitBlackoil.hpp:373
#16 0x0000555555cb7545 in Opm::SimulatorFullyImplicitBlackoil<Opm::Properties::TTag::FlowProblemTPFA>::run (this=0x55555d4a8f90, timer=...)
    at .../opm-simulators/opm/simulators/flow/SimulatorFullyImplicitBlackoil.hpp:268
#17 0x0000555555ca0a6b in Opm::FlowMain<Opm::Properties::TTag::FlowProblemTPFA>::runSimulatorRunCallback_ (this=0x7fffffffcf70)
    at .../opm-simulators/opm/simulators/flow/FlowMain.hpp:48

I think this due to COMPDAT in ACTIONX. Outside of ACTIONX this check is performed on process 0 and the cell is known there. Now this is performed on the parallel loadbalanced grid (without the our futureComletions) and the cell is maybe on another process?

The real problem here is that our simulator should fail gracefully and not deadlock even without your upcoming PR #5488 which is closing this.

@lisajulia
Copy link
Contributor Author

Sigh, one of my most favorite parallel deadlocks in opm-flow:

7 processes in

(gdb) bt
#0  0x00007ffdcd906f94 in ?? ()
   from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so
#1  0x00007fffeac21e1c in opal_progress ()
   from /lib/x86_64-linux-gnu/libopen-pal.so.40
#2  0x00007fffee512bc5 in ompi_request_default_wait ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#3  0x00007fffee56e35b in ompi_coll_base_sendrecv_actual ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#4  0x00007fffee56f9e0 in ompi_coll_base_allreduce_intra_recursivedoubling ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#5  0x00007ffdcd80d8eb in ompi_coll_tuned_allreduce_intra_dec_fixed ()
   from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so
#6  0x00007fffee52a31a in PMPI_Allreduce ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#7  0x0000555555b61a50 in Dune::Communication<ompi_communicator_t*>::allreduce<Dune::Max<Opm::ExceptionType::ExcEnum>, Opm::ExceptionType::ExcEnum> (
    this=0x7fffffffbc80, in=0x7fffffffbc9c, out=0x7fffffffbc6c, len=1)
    at /usr/include/dune/common/parallel/mpicommunication.hh:457
#8  0x0000555555b5df25 in Dune::Communication<ompi_communicator_t*>::max<Opm::ExceptionType::ExcEnum> (this=0x7fffffffbc80, 
    in=@0x7fffffffbc9c: Opm::ExceptionType::NONE)
    at /usr/include/dune/common/parallel/mpicommunication.hh:253
#9  0x0000555555c51b48 in (anonymous namespace)::_throw (
    exc_type=Opm::ExceptionType::NONE, 
    message="BlackoilWellModel::initializeWellState() failed: ", comm=...)
    at .../opm-simulators/opm/simulators/utils/DeferredLoggingErrorHelpers.hpp:77
#10 0x0000555555c7b473 in checkForExceptionsAndThrow (
    exc_type=Opm::ExceptionType::NONE, 
    message="BlackoilWellModel::initializeWellState() failed: ", comm=...)
    at .../opm-simulators/opm/simulators/utils/DeferredLoggingErrorHelpers.hpp:108
#11 0x0000555555d61d51 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::initializeWellState (this=0x55555ca17248, timeStepIdx=6)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:835
#12 0x0000555555d32a6d in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::initializeLocalWellStructure (this=0x55555ca17248, reportStepIdx=6, 
    enableWellPIScaling=true)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:327
#13 0x0000555555cff11a in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::beginReportStep (this=0x55555ca17248, timeStepIdx=6)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:270
#14 0x0000555555d673e1 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::beginEpisode (this=0x55555ca17248)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel.hpp:204
#15 0x0000555555d36c33 in Opm::FlowProblem<Opm::Properties::TTag::FlowProblemTPFA>::beginEpisode (this=0x55555ca16460)
    at .../opm-simulators/opm/simulators/flow/FlowProblem.hpp:566
#16 0x0000555555cffdf3 in Opm::BlackoilModel<Opm::Properties::TTag::FlowProblemTPFA>::beginReportStep (this=0x55555babef60)
    at .../opm-simulators/opm/simulators/flow/BlackoilModel.hpp:1175
#17 0x0000555555ccbf23 in Opm::SimulatorFullyImplicitBlackoil<Opm::Properties::TTag::FlowProblemTPFA>::runStep (this=0x55555d58dd20, timer=...)
    at .../opm-simulators/opm/simulators/flow/SimulatorFullyImplicitBlackoil.hpp:373
#18 0x0000555555cb7545 in Opm::SimulatorFullyImplicitBlackoil<Opm::Properties::TTag::FlowProblemTPFA>::run (this=0x55555d58dd20, timer=...)
    at .../opm-simulators/opm/simulators/flow/SimulatorFullyImplicitBlackoil.hpp:268
#19 0x0000555555ca0a6b in Opm::FlowMain<Opm::Properties::TTag::FlowProblemTPFA>::runSimulatorRunCallback_ (this=0x7fffffffcf70)
    at .../opm-simulators/opm/simulators/flow/FlowMain.hpp:484
...

and 1 threw an unexpected exception:

#9  0x0000555555c7b5db in logAndCheckForExceptionsAndThrow (
    deferred_logger=..., exc_type=Opm::ExceptionType::RUNTIME_ERROR, 
    message="Failed to initialize local well structure: [.../opm-simulators/opm/simulators/wells/ParallelWellInfo.cpp:708] Cells with these i,j,k indices were not found in grid (well = PROD3)"..., 
    terminal_output=false, comm=...)
    at .../opm-simulators/opm/simulators/utils/DeferredLoggingErrorHelpers.hpp:121
121         Opm::DeferredLogger global_deferredLogger = gatherDeferredLogger(deferred_logger, comm);
(gdb) 
#10 0x0000555555d32be0 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::initializeLocalWellStructure (this=0x55555d3283b8, reportStepIdx=6, 
    enableWellPIScaling=true)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:342
342             OPM_END_PARALLEL_TRY_CATCH_LOG(local_deferredLogger,
(gdb) bt
#0  0x00007fffeac798b5 in ?? () from /lib/x86_64-linux-gnu/libopen-pal.so.40
#1  0x00007fffeac21ce7 in ?? () from /lib/x86_64-linux-gnu/libopen-pal.so.40
#2  0x00007fffeac21e74 in opal_progress ()
   from /lib/x86_64-linux-gnu/libopen-pal.so.40
#3  0x00007fffee512bc5 in ompi_request_default_wait ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#4  0x00007fffee56e35b in ompi_coll_base_sendrecv_actual ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#5  0x00007fffee56caf3 in ompi_coll_base_allgather_intra_recursivedoubling ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#6  0x00007ffdcd80eb1a in ompi_coll_tuned_allgather_intra_dec_fixed ()
   from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so
#7  0x00007fffee529437 in PMPI_Allgather ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#8  0x00007ffff6792f2b in Opm::gatherDeferredLogger (local_deferredlogger=..., 
    mpi_communicator=...)
    at .../opm-simulators/opm/simulators/utils/gatherDeferredLogger.cpp:145
#9  0x0000555555c7b5db in logAndCheckForExceptionsAndThrow (
    deferred_logger=..., exc_type=Opm::ExceptionType::RUNTIME_ERROR, 
    message="Failed to initialize local well structure: [.../opm-simulators/opm/simulators/wells/ParallelWellInfo.cpp:708] Cells with these i,j,k indices were not found in grid (well = PROD3)"..., 
    terminal_output=false, comm=...)
    at .../opm-simulators/opm/simulators/utils/DeferredLoggingErrorHelpers.hpp:121
#10 0x0000555555d32be0 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::initializeLocalWellStructure (this=0x55555d3283b8, reportStepIdx=6, 
    enableWellPIScaling=true)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:342
#11 0x0000555555cff11a in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::beginReportStep (this=0x55555d3283b8, timeStepIdx=6)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:270
#12 0x0000555555d673e1 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::beginEpisode (this=0x55555d3283b8)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel.hpp:204
#13 0x0000555555d36c33 in Opm::FlowProblem<Opm::Properties::TTag::FlowProblemTPFA>::beginEpisode (this=0x55555d3275d0)
    at .../opm-simulators/opm/simulators/flow/FlowProblem.hpp:566
#14 0x0000555555cffdf3 in Opm::BlackoilModel<Opm::Properties::TTag::FlowProblemTPFA>::beginReportStep (this=0x55555babef60)
    at .../opm-simulators/opm/simulators/flow/BlackoilModel.hpp:1175
#15 0x0000555555ccbf23 in Opm::SimulatorFullyImplicitBlackoil<Opm::Properties::TTag::FlowProblemTPFA>::runStep (this=0x55555d4a8f90, timer=...)
    at .../opm-simulators/opm/simulators/flow/SimulatorFullyImplicitBlackoil.hpp:373
#16 0x0000555555cb7545 in Opm::SimulatorFullyImplicitBlackoil<Opm::Properties::TTag::FlowProblemTPFA>::run (this=0x55555d4a8f90, timer=...)
    at .../opm-simulators/opm/simulators/flow/SimulatorFullyImplicitBlackoil.hpp:268
#17 0x0000555555ca0a6b in Opm::FlowMain<Opm::Properties::TTag::FlowProblemTPFA>::runSimulatorRunCallback_ (this=0x7fffffffcf70)
    at .../opm-simulators/opm/simulators/flow/FlowMain.hpp:48

I think this due to COMPDAT in ACTIONX. Outside of ACTIONX this check is performed on process 0 and the cell is known there. Now this is performed on the parallel loadbalanced grid (without the our futureComletions) and the cell is maybe on another process?

The real problem here is that our simulator should fail gracefully and not deadlock even without your upcoming PR #5488 which is closing this.

Yes, true, then the simulator should stop.
I'd suggest to keep this issue open then and we can have a look later, since this is reproducible and debuggable with the commit ids.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants