Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ELM might fail on Crusher with MPI_Bcast error #5554

Open
dqwu opened this issue Mar 24, 2023 · 41 comments
Open

ELM might fail on Crusher with MPI_Bcast error #5554

dqwu opened this issue Mar 24, 2023 · 41 comments
Assignees
Labels
Crusher Frontier Land Performance SCORPIO The E3SM I/O library (derived from PIO)

Comments

@dqwu
Copy link
Contributor

dqwu commented Mar 24, 2023

Steps to reproduce with compset I1850GSWCNPRDCTCBC and res hcru_hcru:

git clone https://github.com/E3SM-Project/E3SM.git
cd E3SM
git submodule update --init

cd cime/scripts
./create_newcase --case I1850GSWCNPRDCTCBC_hcru_hcru --compset I1850GSWCNPRDCTCBC --res hcru_hcru --pecount 1344 --walltime 00:05:00
cd I1850GSWCNPRDCTCBC_hcru_hcru

./case.setup

cat <<EOF >> user_nl_elm
&elm_inparam
 hist_mfilt = 24
 hist_nhtfrq = -1
 hist_dov2xy = .true.
 nyears_ad_carbon_only = 25
 spinup_mortality_factor = 10
 metdata_type = 'gswp3'
 metdata_bypass = '/gpfs/alpine/cli115/world-shared/e3sm_olcf/inputdata/atm/datm7/atm_forcing.datm7.GSWP3.0.5d.v2.c180716/cpl_bypass_full'
 co2_file = '/gpfs/alpine/cli115/world-shared/e3sm_olcf/inputdata/atm/datm7/CO2/fco2_datm_1765-2007_c100614.nc'
 aero_file = '/gpfs/alpine/cli115/world-shared/e3sm_olcf/inputdata/atm/cam/chem/trop_mozart_aero/aero/aerosoldep_rcp4.5_monthly_1849-2104_1.9x2.5_c100402.nc'
EOF

./xmlchange MOSART_MODE=NULL
./xmlchange STOP_N=1
./xmlchange REST_OPTION="none"
./xmlchange PIO_NETCDF_FORMAT="64bit_data"

./case.setup --reset

./case.build

./case.submit

Error logs:

...
1008: MPICH ERROR [Rank 1008] [job id 288171.0] [Thu Mar 23 17:35:52 2023] [crusher174] - Abort(740892687) (rank 1008 in comm 0): Fatal error in PMPI_Bcast: Other MPI error, error stack:
1008: PMPI_Bcast(454)..........: MPI_Bcast(buf=0x7fffffff1314, count=1, MPI_INT, root=0, comm=comm=0xc4000086) failed
1008: PMPI_Bcast(439)..........:
1008: MPIR_CRAY_Bcast(437).....:
1008: MPIR_CRAY_Bcast_Tree(162):
1008: (unknown)(): Other MPI error
1008:
1008: aborting job:
1008: Fatal error in PMPI_Bcast: Other MPI error, error stack:
1008: PMPI_Bcast(454)..........: MPI_Bcast(buf=0x7fffffff1314, count=1, MPI_INT, root=0, comm=comm=0xc4000086) failed
1008: PMPI_Bcast(439)..........:
1008: MPIR_CRAY_Bcast(437).....:
1008: MPIR_CRAY_Bcast_Tree(162):
1008: (unknown)(): Other MPI error
 448: MPICH ERROR [Rank 448] [job id 288171.0] [Thu Mar 23 17:35:52 2023] [crusher164] - Abort(1012473999) (rank 448 in comm 0): Fatal error in PMPI_Bcast: Other MPI error, error stack:
 448: PMPI_Bcast(454)................: MPI_Bcast(buf=0x7fffffff1354, count=1, MPI_INT, root=0, comm=comm=0xc4000086) failed
 448: PMPI_Bcast(439)................:
 448: MPIR_CRAY_Bcast(437)...........:
 448: MPIR_CRAY_Bcast_Tree(162)......:
 448: MPIC_Recv(197).................:
 448: MPIC_Wait(71)..................:
 448: MPIR_Wait_impl(41).............:
 448: MPID_Progress_wait(184)........:
 448: MPIDI_Progress_test(80)........:
 448: MPIDI_OFI_handle_cq_error(1059): OFI poll failed (ofi_events.c:1061:MPIDI_OFI_handle_cq_error:Message too long - OK)
 448: MPIR_CRAY_Bcast(493)...........:
 448: MPIR_CRAY_Bcast_Tree(250)......: Failure during collective
 448:
 448: aborting job:
 448: Fatal error in PMPI_Bcast: Other MPI error, error stack:
 448: PMPI_Bcast(454)................: MPI_Bcast(buf=0x7fffffff1354, count=1, MPI_INT, root=0, comm=comm=0xc4000086) failed
...

Could be an issue similar to E3SM-Project/scream#1920 on Perlmutter.

Like the workaround in PR #5291, we just need to add an environment variable in config_machines.xml for Crusher (works for the above case):
<env name="MPICH_COLL_SYNC">MPI_Bcast</env>

@ndkeen
Copy link
Contributor

ndkeen commented Mar 24, 2023

Because we see this same issue on another machine, it seems likely the root cause is in our code and/or MPI/netcdf layers. There have been updating to HW/SW on perlmutter, but I've not yet tried testing without that environment variable set. @ndkeen

@bishtgautam
Copy link
Contributor

@dqwu I'm definitely not the right person to debug/fix this issue.

@rljacob
Copy link
Member

rljacob commented Mar 24, 2023

What is it about this case config that is different from a regular I case ?

@rljacob rljacob assigned grnydawn and unassigned bishtgautam Mar 24, 2023
@dqwu
Copy link
Contributor Author

dqwu commented Mar 24, 2023

What is it about this case config that is different from a regular I case ?

The above test case is simplified from a benchmark I case (listed in E3SM confluence page). The original benchmark I case (STOP_N=10) can also reproduce this issue on Crusher.

@sarats
Copy link
Member

sarats commented Mar 24, 2023

Generally speaking, I would really like someone from the land group to take over these debugging activities now that the initial machine readiness stuff is done.

This specific issue could even be a transient interconnect issue. We can check once Frontier becomes available and the network stabilizes. No rush on running standalone ELM on Crusher/Frontier anyway.

@rljacob
Copy link
Member

rljacob commented Mar 24, 2023

I thought the fix was to modify the crusher config?

@dqwu does the error trace point to a routine in the land model? There's no context for those MPICH errors above.

@dqwu
Copy link
Contributor Author

dqwu commented Mar 24, 2023

I thought the fix was to modify the crusher config?

@dqwu does the error trace point to a routine in the land model? There's no context for those MPICH errors above.

The workaround seems to work (used by Perlmutter already).
The error trace does not show detailed call stacks so far. However, there could be many MPI_Bcast calls made inside SCORPIO with this specific test case. See E3SM-Project/scorpio#493 for more information (that PR has reduced some MPI_Bcast calls).

@dqwu dqwu assigned dqwu and unassigned grnydawn Mar 24, 2023
@dqwu
Copy link
Contributor Author

dqwu commented Mar 24, 2023

@grnydawn I will modify the crusher config to add MPICH_COLL_SYNC ENV variable. Before that, I will also check whether the failed MPI_Bcast calls are from SCORPIO. You can test and integrate the fix later.

@dqwu dqwu added the SCORPIO The E3SM I/O library (derived from PIO) label Mar 24, 2023
@dqwu
Copy link
Contributor Author

dqwu commented Mar 24, 2023

In E3SM-Project/scream#1920, a scream developer also reported a similar error trace related to MPI_Bcast calls on Perlmutter:

 1536: MPICH ERROR [Rank 1536] [job id 3458129.0] [Sat Oct 22 15:35:58 2022] [nid005474] - Abort(134243855) (rank 1536 in comm 0): Fatal error in MPIR_CRAY_Bcast_Tree: Other MPI error, error stack:
 1536: MPIR_CRAY_Bcast_Tree(183): message sizes do not match across processes in the collective routine: Received -32766 but expected 1
 1536:
 1536: aborting job:
 1536: Fatal error in MPIR_CRAY_Bcast_Tree: Other MPI error, error stack:
 1536: MPIR_CRAY_Bcast_Tree(183): message sizes do not match across processes in the collective routine: Received -32766 but expected 1

@grnydawn
Copy link
Contributor

@grnydawn I will modify the crusher config to add MPICH_COLL_SYNC ENV variable. Before that, I will also check whether the failed MPI_Bcast calls are from SCORPIO. You can test and integrate the fix later.

@dqwu Ok. I will run tests on my side once you finish the fix.

@sarats
Copy link
Member

sarats commented Mar 24, 2023

Honestly, at least for Crusher/Frontier: this issue can wait until the slingshot network stabilizes.

I will discuss this with Cray folks during our next call. However, they will probably ask for a small reproducer which again requires effort to put together. If this is truly a blocker for SCREAM or MMF on Frontier, we can assign assign people to work on it.

That flag imposes additional sync overhead which is probably not desirable for rest of the model configurations' performance, so a machine-wide default is not warranted. As alluded to in the Scream issue, there are probably too many Bcast's issued which are exhausting some internal resources. If that's the root cause, we should identify and increase the necessary thresholds.

@sarats
Copy link
Member

sarats commented Mar 25, 2023

Btw, as Pat pointed out in the discussion at E3SM-Project/scorpio#493, if the root cause is need for flow control, we should address that.

@dqwu
Copy link
Contributor Author

dqwu commented Mar 28, 2023

Here are some updated info

  • The issue might be MPI_Bcast failure (with error messages) or hanging on one MPI_Bcast call
  • "MPICH_COLL_SYNC=MPI_Bcast" always works, by enforcing a barrier before each MPI_Bcast call (inside or outside SCORPIO)
  • For STOP_N=1, there are more than 768K (768,664) MPI_Bcast calls inside SCORPIO
  • For STOP_N=10, there are more than than 6.8M (6,896,197) MPI_Bcast calls inside SCORPIO
  • To avoid MPI_Bcast failure or hanging, SCORPIO does not have to add a barrier before each MPI_Bcast call. For STOP_N=1, the issue occurs after about 20K MPI_Bcast calls. Adding a barrier each time after 2K accumulated calls works.

Alternatively, if we do not use MPI barriers, a particular libfabric environment variable (for cxi provider, mentioned in PR #5275) also works.

FI_CXI_DEFAULT_CQ_SIZE
  Change the provider default completion queue size.  This may be useful for applications which
  rely on middleware, and middleware defaults the completion queue size to the provider default.

It seems that we do not need explicitly set FI_CXI_RX_MATCH_MODE, FI_CXI_REQ_BUF_SIZE and FI_UNIVERSE_SIZE (they can all use the default settings). In https://docs.nersc.gov/performance/network, it is mentioned that "setting FI_CXI_RX_MATCH_MODE=hardware can cause jobs to fail when they exhaust the hardware message queue (usually by sending too many MPI messages)."

However, we do need a larger value (the suggested value in PR #5275 is only 70K) for FI_CXI_DEFAULT_CQ_SIZE, even when FI_CXI_RX_MATCH_MODE is explicitly set to "software":

  • For STOP_N=1, FI_CXI_DEFAULT_CQ_SIZE=100K works, but 90K does not.
  • For STOP_N=10, FI_CXI_DEFAULT_CQ_SIZE=300K works, but 200K does not.

@rljacob
Copy link
Member

rljacob commented Mar 28, 2023

The total number of calls to MPI-Bcast over a run shouldn't matter. So I don't understand how a 10-day run would need a larger CQ_SIZE then a 1 day run.

@sarats
Copy link
Member

sarats commented Mar 29, 2023

Looking on Frontier, the default according to into_mpi man page appears to be 131,072 which should suffice for your first case. However, I also don't understand why a 10-day run needs more resources as completed MPI_Bcasts should release any internal resources.

Evidently, the FI_CXI_RX_MATCH_MODE by default is 'hardware' on Frontier. Curious, have you tried 'hybrid' mode and did you face the same issues? 'hybrid' might be better as the switch to software mode is done on a rank by rank basis.

hybrid: Message matching begins fully offloaded to the NIC, but if hardware resources become exhuasted at any point, the message matching will transition to a "hybrid" of both hardware and software matching. This is done on a rank by rank basis. If a rank exhausts its hardware resources, that rank will transparently transition to software endpoint mode. ...

FI_CXI_DEFAULT_CQ_SIZE
           This is a cxi libfabric ENV variable. It specifies the maximum number of entries in the CXI provider completion queue. Too small of a queue can result in "Cassini Event Queue overflow detected" errors. Only applies to Slingshot 11.

       Default: 131072

@sarats
Copy link
Member

sarats commented Mar 29, 2023

I wonder what the error message regarding OFI poll failure "Message too long" refers to.

448: MPIDI_OFI_handle_cq_error(1059): OFI poll failed >(ofi_events.c:1061:MPIDI_OFI_handle_cq_error:Message too long - OK)

Btw, do you see ""LE resources not recovered during flow control" error message when using 'hardware' mode?

@abbotts Does an application need to clear/flush the CXI provider completion queue periodically?

@sarats
Copy link
Member

sarats commented Mar 29, 2023

Just noticed that Crusher has a FI_CXI_DEFAULT_CQ_SIZE default of 32k. So, OLCF has increased this on Frontier.

@abbotts
Copy link

abbotts commented Mar 29, 2023

The increase in the default CQ size should actually be due to either an MPI or libfabric update. I forget which. The network software on Crusher is the oldest of the three machines (crusher, frontier, perlmutter). It's almost a year old and is missing a lot of fixes.

Does an application need to clear/flush the CXI provider completion queue periodically?

No, this isn't something you should need to do. As incoming messages are matched to posted receives the queue should empty. The barrier you insert by setting MPICH_COLL_SYNC is effectively saying "all pending communication needs to finish", which should empty out the queue.

We have been seeing applications with a many-to-one communication pattern running out of queue space and going into flow control as they scale out. For example, if there's something like a root process that posts a receive with MPI_ANY_SOURCE to handle client requests, and every client talks to it at once, then the root process might drop into flow control.

In principle all our MPI collectives should internal throttling to avoid dropping info flow control. There could be a bug in MPI_Bcast, or it could be that the Bcast is the victim not the cause.

What does the communication pattern look like between the Bcasts?
If there's no communication between the Bcasts, then maybe we can reproduce with just a long series of broadcasts.

Evidently, the FI_CXI_RX_MATCH_MODE by default is 'hardware' on Frontier. Curious, have you tried 'hybrid' mode and did you face the same issues? 'hybrid' might be better as the switch to software mode is done on a rank by rank basis.

I'd stick with "software" for now, especially on Crusher. Long term "hybrid" will probably be the ideal solution, but the transition from hardware to software matching still needs more testing.

Does this issue occur only at small scale, or only as you scale out?

@dqwu
Copy link
Contributor Author

dqwu commented Mar 29, 2023

@sarats @rljacob
I originally tested with modified SCORPIO code (for debug purpose only), which might affect the behaviors of MPI_Bcast calls. For my latest runs with the default SCORPIO submodule (no code changes), Setting FI_CXI_DEFAULT_CQ_SIZE to 100K seems sufficient for 10-day run as well. More precisely, 96K works but 95K still hangs.

For this issue on Crusher, maybe we can set FI_CXI_DEFAULT_CQ_SIZE ENV variable to 128K (the default value on Frontier) in config_machines.xml?

@sarats
Copy link
Member

sarats commented Mar 29, 2023

AFAIK, we haven't run into this issue running MMF or SCREAM even at scale. If I understand correctly, you encounter this issue on 24 nodes (1344 PEs), right?

@dqwu Do you know why this issue presents with Land I/O? Are they doing a lot of small broadcasts or using MPI_ANY_SOURCE?

@sarats
Copy link
Member

sarats commented Mar 29, 2023

The above test case is simplified from a benchmark I case (listed in E3SM confluence page).

Can you point to this page?

@dqwu
Copy link
Contributor Author

dqwu commented Mar 29, 2023

AFAIK, we haven't run into this issue running MMF or SCREAM even at scale. If I understand correctly, you encounter this issue on 24 nodes (1344 PEs), right?

@dqwu Do you know why this issue presents with Land I/O? Are they doing a lot of small broadcasts or using MPI_ANY_SOURCE?

Maybe the I case invokes a lot of pio_inq calls in SCORPIO which use MPI_Bcast. For 10-day run there are more than 6M MPI_Bcast calls.

@dqwu
Copy link
Contributor Author

dqwu commented Mar 29, 2023

The above test case is simplified from a benchmark I case (listed in E3SM confluence page).

Can you point to this page?

https://acme-climate.atlassian.net/wiki/spaces/EPG/pages/1161691492/High+resolution+I+O+Benchmark+process#I-Case

@rljacob
Copy link
Member

rljacob commented Mar 29, 2023

@thorntonpe or @bishtgautam : who is an expert in how ELM does I/O ?

@bishtgautam
Copy link
Contributor

@rljacob, I don't believe we have an expert within the land team and have been using @jayeshkrishna's and @dqwu's expertise.

@peterdschwartz , Are you familiar with land I/O.

@dqwu With the following changes, you are writing ELM output every time step:

 hist_mfilt = 24
 hist_nhtfrq = -1

Are you outputting ELM data at this high frequency just to reproduce the error? Does anyone think this error could be avoided if ELM didn't use a round-robin domain decomposition?

@jayeshkrishna
Copy link
Contributor

jayeshkrishna commented Mar 29, 2023

My two cents : This issue seems to be a bug in the MPI/libfabric library. We have an E3SM reproducer at the top of the issue to reproduce the problem on crusher. With regard to I/O patterns in ELM, the code shouldn't crash with the current MPI/ IO usage in the component (unless there is a bug in the code, which seems unlikely since the code/case works on other machines & one of the ways to avoid the crash seems to be related to buffer_size/synchronization settings for the libfabric/MPI lib)

We can continue to work on reproducers that only use MPI, but that may or may not reliably reproduce the issue.

@dqwu
Copy link
Contributor Author

dqwu commented Mar 29, 2023

@bishtgautam The original I case benchmark test intentionally outputs ELM data at high frequency to output large files to test the write performance of PnetCDF and ADIOS types.

@sarats
Copy link
Member

sarats commented Mar 29, 2023

I'm still curious what about ELM's I/O triggers this issue on Slingshot interconnect on both Perlmutter and Frontier.

It's understandable if this issue presents with high-frequency output from other components as well. We should plan on running high-res Scream and Ocean benchmarks on Frontier.
Let's try running this case once the machine becomes available and see if the improved n/w stack resolves this.

@rljacob
Copy link
Member

rljacob commented Mar 29, 2023

@bishtgautam SCORPIO and layers below that are indeed up to Jayesh and Danqing. But how ELM calls SCORIPO is ultimately up to the land team to understand and help optimize. For example, someone on the land team has to own components/elm/src/main/histfileMod.F90.

@sarats
Copy link
Member

sarats commented Mar 31, 2023

@bishtgautam / @thorntonpe / @peterdschwartz

A suggestion: Add a barrier periodically (every n-steps etc.) in the land driver when performing high-frequency I/O to flush the communication queues/buffers. It localizes and minimizes the sync overhead to just land I/O and allows fine-tuning as needed.

Maybe the I case invokes a lot of pio_inq calls in SCORPIO which use MPI_Bcast.

Something to follow-up in the future.

@sarats
Copy link
Member

sarats commented Jun 1, 2023

Can the above suggestion (periodic sync in land) be implemented to see if it addresses the issue?

@sarats sarats added the Frontier label Jun 1, 2023
@bishtgautam
Copy link
Contributor

@sarats Sure, I can implement it. I'm considering implementing it as a namelist option to control the frequency of syncing. Sounds good?

@sarats
Copy link
Member

sarats commented Jun 2, 2023

Perfect, that would allow fine tuning.

@bishtgautam
Copy link
Contributor

@sarats #5741 is now open for you to experiment with.

@sarats
Copy link
Member

sarats commented Jun 2, 2023

@dqwu Check above branch and see what mpi_sync_nstep_freq value avoids the problem for the benchmark.

@bishtgautam
Copy link
Contributor

@dqwu Just curious if you had a chance to try out #5741 to see if that fixes this issue.

@dqwu
Copy link
Contributor Author

dqwu commented Jun 22, 2023

@dqwu Just curious if you had a chance to try out #5741 to see if that fixes this issue.

Not yet. I will test that fix later.

@sarats
Copy link
Member

sarats commented Jul 14, 2023

@dqwu Please report back if this issue is resolved with above PR and what frequency was needed.

@dqwu
Copy link
Contributor Author

dqwu commented Jul 19, 2023

@sarats It seems that #5741 has been merged to master. I tested latest E3SM master and this issue is not reproducible. I did not explicitly add "mpi_sync_nstep_freq = XX" in the user_nl_elm, though. According to that PR, the MPI_Barrier is not called by default.

@sarats
Copy link
Member

sarats commented Jul 19, 2023

Did you test it today or earlier? There have been a bunch of system updates yesterday.

@dqwu
Copy link
Contributor Author

dqwu commented Jul 19, 2023

Did you test it today or earlier? There have been a bunch of system updates yesterday.

Today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Crusher Frontier Land Performance SCORPIO The E3SM I/O library (derived from PIO)
Projects
None yet
Development

No branches or pull requests

9 participants