ELM might fail on Crusher with MPI_Bcast error #5554

dqwu · 2023-03-24T15:19:40Z

Steps to reproduce with compset I1850GSWCNPRDCTCBC and res hcru_hcru:

git clone https://github.com/E3SM-Project/E3SM.git
cd E3SM
git submodule update --init

cd cime/scripts
./create_newcase --case I1850GSWCNPRDCTCBC_hcru_hcru --compset I1850GSWCNPRDCTCBC --res hcru_hcru --pecount 1344 --walltime 00:05:00
cd I1850GSWCNPRDCTCBC_hcru_hcru

./case.setup

cat <<EOF >> user_nl_elm
&elm_inparam
 hist_mfilt = 24
 hist_nhtfrq = -1
 hist_dov2xy = .true.
 nyears_ad_carbon_only = 25
 spinup_mortality_factor = 10
 metdata_type = 'gswp3'
 metdata_bypass = '/gpfs/alpine/cli115/world-shared/e3sm_olcf/inputdata/atm/datm7/atm_forcing.datm7.GSWP3.0.5d.v2.c180716/cpl_bypass_full'
 co2_file = '/gpfs/alpine/cli115/world-shared/e3sm_olcf/inputdata/atm/datm7/CO2/fco2_datm_1765-2007_c100614.nc'
 aero_file = '/gpfs/alpine/cli115/world-shared/e3sm_olcf/inputdata/atm/cam/chem/trop_mozart_aero/aero/aerosoldep_rcp4.5_monthly_1849-2104_1.9x2.5_c100402.nc'
EOF

./xmlchange MOSART_MODE=NULL
./xmlchange STOP_N=1
./xmlchange REST_OPTION="none"
./xmlchange PIO_NETCDF_FORMAT="64bit_data"

./case.setup --reset

./case.build

./case.submit

Error logs:

...
1008: MPICH ERROR [Rank 1008] [job id 288171.0] [Thu Mar 23 17:35:52 2023] [crusher174] - Abort(740892687) (rank 1008 in comm 0): Fatal error in PMPI_Bcast: Other MPI error, error stack:
1008: PMPI_Bcast(454)..........: MPI_Bcast(buf=0x7fffffff1314, count=1, MPI_INT, root=0, comm=comm=0xc4000086) failed
1008: PMPI_Bcast(439)..........:
1008: MPIR_CRAY_Bcast(437).....:
1008: MPIR_CRAY_Bcast_Tree(162):
1008: (unknown)(): Other MPI error
1008:
1008: aborting job:
1008: Fatal error in PMPI_Bcast: Other MPI error, error stack:
1008: PMPI_Bcast(454)..........: MPI_Bcast(buf=0x7fffffff1314, count=1, MPI_INT, root=0, comm=comm=0xc4000086) failed
1008: PMPI_Bcast(439)..........:
1008: MPIR_CRAY_Bcast(437).....:
1008: MPIR_CRAY_Bcast_Tree(162):
1008: (unknown)(): Other MPI error
 448: MPICH ERROR [Rank 448] [job id 288171.0] [Thu Mar 23 17:35:52 2023] [crusher164] - Abort(1012473999) (rank 448 in comm 0): Fatal error in PMPI_Bcast: Other MPI error, error stack:
 448: PMPI_Bcast(454)................: MPI_Bcast(buf=0x7fffffff1354, count=1, MPI_INT, root=0, comm=comm=0xc4000086) failed
 448: PMPI_Bcast(439)................:
 448: MPIR_CRAY_Bcast(437)...........:
 448: MPIR_CRAY_Bcast_Tree(162)......:
 448: MPIC_Recv(197).................:
 448: MPIC_Wait(71)..................:
 448: MPIR_Wait_impl(41).............:
 448: MPID_Progress_wait(184)........:
 448: MPIDI_Progress_test(80)........:
 448: MPIDI_OFI_handle_cq_error(1059): OFI poll failed (ofi_events.c:1061:MPIDI_OFI_handle_cq_error:Message too long - OK)
 448: MPIR_CRAY_Bcast(493)...........:
 448: MPIR_CRAY_Bcast_Tree(250)......: Failure during collective
 448:
 448: aborting job:
 448: Fatal error in PMPI_Bcast: Other MPI error, error stack:
 448: PMPI_Bcast(454)................: MPI_Bcast(buf=0x7fffffff1354, count=1, MPI_INT, root=0, comm=comm=0xc4000086) failed
...

Could be an issue similar to E3SM-Project/scream#1920 on Perlmutter.

Like the workaround in PR #5291, we just need to add an environment variable in config_machines.xml for Crusher (works for the above case):
<env name="MPICH_COLL_SYNC">MPI_Bcast</env>

The text was updated successfully, but these errors were encountered:

ndkeen · 2023-03-24T15:30:21Z

Because we see this same issue on another machine, it seems likely the root cause is in our code and/or MPI/netcdf layers. There have been updating to HW/SW on perlmutter, but I've not yet tried testing without that environment variable set. @ndkeen

bishtgautam · 2023-03-24T15:51:19Z

@dqwu I'm definitely not the right person to debug/fix this issue.

rljacob · 2023-03-24T16:11:54Z

What is it about this case config that is different from a regular I case ?

dqwu · 2023-03-24T16:19:05Z

What is it about this case config that is different from a regular I case ?

The above test case is simplified from a benchmark I case (listed in E3SM confluence page). The original benchmark I case (STOP_N=10) can also reproduce this issue on Crusher.

sarats · 2023-03-24T18:11:27Z

Generally speaking, I would really like someone from the land group to take over these debugging activities now that the initial machine readiness stuff is done.

This specific issue could even be a transient interconnect issue. We can check once Frontier becomes available and the network stabilizes. No rush on running standalone ELM on Crusher/Frontier anyway.

rljacob · 2023-03-24T18:27:54Z

I thought the fix was to modify the crusher config?

@dqwu does the error trace point to a routine in the land model? There's no context for those MPICH errors above.

dqwu · 2023-03-24T18:49:55Z

I thought the fix was to modify the crusher config?

@dqwu does the error trace point to a routine in the land model? There's no context for those MPICH errors above.

The workaround seems to work (used by Perlmutter already).
The error trace does not show detailed call stacks so far. However, there could be many MPI_Bcast calls made inside SCORPIO with this specific test case. See E3SM-Project/scorpio#493 for more information (that PR has reduced some MPI_Bcast calls).

dqwu · 2023-03-24T19:08:38Z

@grnydawn I will modify the crusher config to add MPICH_COLL_SYNC ENV variable. Before that, I will also check whether the failed MPI_Bcast calls are from SCORPIO. You can test and integrate the fix later.

dqwu · 2023-03-24T19:27:12Z

In E3SM-Project/scream#1920, a scream developer also reported a similar error trace related to MPI_Bcast calls on Perlmutter:

 1536: MPICH ERROR [Rank 1536] [job id 3458129.0] [Sat Oct 22 15:35:58 2022] [nid005474] - Abort(134243855) (rank 1536 in comm 0): Fatal error in MPIR_CRAY_Bcast_Tree: Other MPI error, error stack:
 1536: MPIR_CRAY_Bcast_Tree(183): message sizes do not match across processes in the collective routine: Received -32766 but expected 1
 1536:
 1536: aborting job:
 1536: Fatal error in MPIR_CRAY_Bcast_Tree: Other MPI error, error stack:
 1536: MPIR_CRAY_Bcast_Tree(183): message sizes do not match across processes in the collective routine: Received -32766 but expected 1

grnydawn · 2023-03-24T20:10:38Z

@grnydawn I will modify the crusher config to add MPICH_COLL_SYNC ENV variable. Before that, I will also check whether the failed MPI_Bcast calls are from SCORPIO. You can test and integrate the fix later.

@dqwu Ok. I will run tests on my side once you finish the fix.

sarats · 2023-03-24T21:14:26Z

Honestly, at least for Crusher/Frontier: this issue can wait until the slingshot network stabilizes.

I will discuss this with Cray folks during our next call. However, they will probably ask for a small reproducer which again requires effort to put together. If this is truly a blocker for SCREAM or MMF on Frontier, we can assign assign people to work on it.

That flag imposes additional sync overhead which is probably not desirable for rest of the model configurations' performance, so a machine-wide default is not warranted. As alluded to in the Scream issue, there are probably too many Bcast's issued which are exhausting some internal resources. If that's the root cause, we should identify and increase the necessary thresholds.

sarats · 2023-03-25T00:31:11Z

Btw, as Pat pointed out in the discussion at E3SM-Project/scorpio#493, if the root cause is need for flow control, we should address that.

dqwu · 2023-03-28T21:07:31Z

Here are some updated info

The issue might be MPI_Bcast failure (with error messages) or hanging on one MPI_Bcast call
"MPICH_COLL_SYNC=MPI_Bcast" always works, by enforcing a barrier before each MPI_Bcast call (inside or outside SCORPIO)
For STOP_N=1, there are more than 768K (768,664) MPI_Bcast calls inside SCORPIO
For STOP_N=10, there are more than than 6.8M (6,896,197) MPI_Bcast calls inside SCORPIO
To avoid MPI_Bcast failure or hanging, SCORPIO does not have to add a barrier before each MPI_Bcast call. For STOP_N=1, the issue occurs after about 20K MPI_Bcast calls. Adding a barrier each time after 2K accumulated calls works.

Alternatively, if we do not use MPI barriers, a particular libfabric environment variable (for cxi provider, mentioned in PR #5275) also works.

FI_CXI_DEFAULT_CQ_SIZE
  Change the provider default completion queue size.  This may be useful for applications which
  rely on middleware, and middleware defaults the completion queue size to the provider default.

It seems that we do not need explicitly set FI_CXI_RX_MATCH_MODE, FI_CXI_REQ_BUF_SIZE and FI_UNIVERSE_SIZE (they can all use the default settings). In https://docs.nersc.gov/performance/network, it is mentioned that "setting FI_CXI_RX_MATCH_MODE=hardware can cause jobs to fail when they exhaust the hardware message queue (usually by sending too many MPI messages)."

However, we do need a larger value (the suggested value in PR #5275 is only 70K) for FI_CXI_DEFAULT_CQ_SIZE, even when FI_CXI_RX_MATCH_MODE is explicitly set to "software":

For STOP_N=1, FI_CXI_DEFAULT_CQ_SIZE=100K works, but 90K does not.
For STOP_N=10, FI_CXI_DEFAULT_CQ_SIZE=300K works, but 200K does not.

rljacob · 2023-03-28T21:35:48Z

The total number of calls to MPI-Bcast over a run shouldn't matter. So I don't understand how a 10-day run would need a larger CQ_SIZE then a 1 day run.

sarats · 2023-03-29T04:44:22Z

Looking on Frontier, the default according to into_mpi man page appears to be 131,072 which should suffice for your first case. However, I also don't understand why a 10-day run needs more resources as completed MPI_Bcasts should release any internal resources.

Evidently, the FI_CXI_RX_MATCH_MODE by default is 'hardware' on Frontier. Curious, have you tried 'hybrid' mode and did you face the same issues? 'hybrid' might be better as the switch to software mode is done on a rank by rank basis.

hybrid: Message matching begins fully offloaded to the NIC, but if hardware resources become exhuasted at any point, the message matching will transition to a "hybrid" of both hardware and software matching. This is done on a rank by rank basis. If a rank exhausts its hardware resources, that rank will transparently transition to software endpoint mode. ...

FI_CXI_DEFAULT_CQ_SIZE
           This is a cxi libfabric ENV variable. It specifies the maximum number of entries in the CXI provider completion queue. Too small of a queue can result in "Cassini Event Queue overflow detected" errors. Only applies to Slingshot 11.

       Default: 131072

sarats · 2023-03-29T05:08:34Z

I wonder what the error message regarding OFI poll failure "Message too long" refers to.

448: MPIDI_OFI_handle_cq_error(1059): OFI poll failed >(ofi_events.c:1061:MPIDI_OFI_handle_cq_error:Message too long - OK)

Btw, do you see ""LE resources not recovered during flow control" error message when using 'hardware' mode?

@abbotts Does an application need to clear/flush the CXI provider completion queue periodically?

sarats · 2023-03-29T05:14:25Z

Just noticed that Crusher has a FI_CXI_DEFAULT_CQ_SIZE default of 32k. So, OLCF has increased this on Frontier.

abbotts · 2023-03-29T13:44:33Z

The increase in the default CQ size should actually be due to either an MPI or libfabric update. I forget which. The network software on Crusher is the oldest of the three machines (crusher, frontier, perlmutter). It's almost a year old and is missing a lot of fixes.

Does an application need to clear/flush the CXI provider completion queue periodically?

No, this isn't something you should need to do. As incoming messages are matched to posted receives the queue should empty. The barrier you insert by setting MPICH_COLL_SYNC is effectively saying "all pending communication needs to finish", which should empty out the queue.

We have been seeing applications with a many-to-one communication pattern running out of queue space and going into flow control as they scale out. For example, if there's something like a root process that posts a receive with MPI_ANY_SOURCE to handle client requests, and every client talks to it at once, then the root process might drop into flow control.

In principle all our MPI collectives should internal throttling to avoid dropping info flow control. There could be a bug in MPI_Bcast, or it could be that the Bcast is the victim not the cause.

What does the communication pattern look like between the Bcasts?
If there's no communication between the Bcasts, then maybe we can reproduce with just a long series of broadcasts.

Evidently, the FI_CXI_RX_MATCH_MODE by default is 'hardware' on Frontier. Curious, have you tried 'hybrid' mode and did you face the same issues? 'hybrid' might be better as the switch to software mode is done on a rank by rank basis.

I'd stick with "software" for now, especially on Crusher. Long term "hybrid" will probably be the ideal solution, but the transition from hardware to software matching still needs more testing.

Does this issue occur only at small scale, or only as you scale out?

dqwu · 2023-03-29T15:27:30Z

@sarats @rljacob
I originally tested with modified SCORPIO code (for debug purpose only), which might affect the behaviors of MPI_Bcast calls. For my latest runs with the default SCORPIO submodule (no code changes), Setting FI_CXI_DEFAULT_CQ_SIZE to 100K seems sufficient for 10-day run as well. More precisely, 96K works but 95K still hangs.

For this issue on Crusher, maybe we can set FI_CXI_DEFAULT_CQ_SIZE ENV variable to 128K (the default value on Frontier) in config_machines.xml?

sarats · 2023-03-29T15:28:08Z

AFAIK, we haven't run into this issue running MMF or SCREAM even at scale. If I understand correctly, you encounter this issue on 24 nodes (1344 PEs), right?

@dqwu Do you know why this issue presents with Land I/O? Are they doing a lot of small broadcasts or using MPI_ANY_SOURCE?

sarats · 2023-03-29T15:30:41Z

The above test case is simplified from a benchmark I case (listed in E3SM confluence page).

Can you point to this page?

dqwu · 2023-03-29T15:36:15Z

AFAIK, we haven't run into this issue running MMF or SCREAM even at scale. If I understand correctly, you encounter this issue on 24 nodes (1344 PEs), right?

@dqwu Do you know why this issue presents with Land I/O? Are they doing a lot of small broadcasts or using MPI_ANY_SOURCE?

Maybe the I case invokes a lot of pio_inq calls in SCORPIO which use MPI_Bcast. For 10-day run there are more than 6M MPI_Bcast calls.

dqwu · 2023-03-29T16:11:45Z

The above test case is simplified from a benchmark I case (listed in E3SM confluence page).

Can you point to this page?

https://acme-climate.atlassian.net/wiki/spaces/EPG/pages/1161691492/High+resolution+I+O+Benchmark+process#I-Case

rljacob · 2023-03-29T16:18:06Z

@thorntonpe or @bishtgautam : who is an expert in how ELM does I/O ?

bishtgautam · 2023-03-29T17:53:51Z

@rljacob, I don't believe we have an expert within the land team and have been using @jayeshkrishna's and @dqwu's expertise.

@peterdschwartz , Are you familiar with land I/O.

@dqwu With the following changes, you are writing ELM output every time step:

 hist_mfilt = 24
 hist_nhtfrq = -1

Are you outputting ELM data at this high frequency just to reproduce the error? Does anyone think this error could be avoided if ELM didn't use a round-robin domain decomposition?

jayeshkrishna · 2023-03-29T18:06:41Z

My two cents : This issue seems to be a bug in the MPI/libfabric library. We have an E3SM reproducer at the top of the issue to reproduce the problem on crusher. With regard to I/O patterns in ELM, the code shouldn't crash with the current MPI/ IO usage in the component (unless there is a bug in the code, which seems unlikely since the code/case works on other machines & one of the ways to avoid the crash seems to be related to buffer_size/synchronization settings for the libfabric/MPI lib)

We can continue to work on reproducers that only use MPI, but that may or may not reliably reproduce the issue.

dqwu · 2023-03-29T18:08:19Z

@bishtgautam The original I case benchmark test intentionally outputs ELM data at high frequency to output large files to test the write performance of PnetCDF and ADIOS types.

sarats · 2023-03-29T18:35:58Z

I'm still curious what about ELM's I/O triggers this issue on Slingshot interconnect on both Perlmutter and Frontier.

It's understandable if this issue presents with high-frequency output from other components as well. We should plan on running high-res Scream and Ocean benchmarks on Frontier.
Let's try running this case once the machine becomes available and see if the improved n/w stack resolves this.

rljacob · 2023-03-29T18:47:26Z

@bishtgautam SCORPIO and layers below that are indeed up to Jayesh and Danqing. But how ELM calls SCORIPO is ultimately up to the land team to understand and help optimize. For example, someone on the land team has to own components/elm/src/main/histfileMod.F90.

sarats · 2023-03-31T17:08:20Z

@bishtgautam / @thorntonpe / @peterdschwartz

A suggestion: Add a barrier periodically (every n-steps etc.) in the land driver when performing high-frequency I/O to flush the communication queues/buffers. It localizes and minimizes the sync overhead to just land I/O and allows fine-tuning as needed.

Maybe the I case invokes a lot of pio_inq calls in SCORPIO which use MPI_Bcast.

Something to follow-up in the future.

sarats · 2023-06-01T21:45:57Z

Can the above suggestion (periodic sync in land) be implemented to see if it addresses the issue?

bishtgautam · 2023-06-02T15:11:22Z

@sarats Sure, I can implement it. I'm considering implementing it as a namelist option to control the frequency of syncing. Sounds good?

sarats · 2023-06-02T17:21:31Z

Perfect, that would allow fine tuning.

bishtgautam · 2023-06-02T21:25:48Z

@sarats #5741 is now open for you to experiment with.

sarats · 2023-06-02T22:36:49Z

@dqwu Check above branch and see what mpi_sync_nstep_freq value avoids the problem for the benchmark.

bishtgautam · 2023-06-22T02:04:54Z

@dqwu Just curious if you had a chance to try out #5741 to see if that fixes this issue.

dqwu · 2023-06-22T13:59:36Z

@dqwu Just curious if you had a chance to try out #5741 to see if that fixes this issue.

Not yet. I will test that fix later.

sarats · 2023-07-14T04:06:11Z

@dqwu Please report back if this issue is resolved with above PR and what frequency was needed.

dqwu · 2023-07-19T20:10:22Z

@sarats It seems that #5741 has been merged to master. I tested latest E3SM master and this issue is not reproducible. I did not explicitly add "mpi_sync_nstep_freq = XX" in the user_nl_elm, though. According to that PR, the MPI_Barrier is not called by default.

sarats · 2023-07-19T20:38:42Z

Did you test it today or earlier? There have been a bunch of system updates yesterday.

dqwu · 2023-07-19T21:00:49Z

Did you test it today or earlier? There have been a bunch of system updates yesterday.

Today.

dqwu added Land Crusher labels Mar 24, 2023

dqwu assigned bishtgautam Mar 24, 2023

rljacob assigned grnydawn and unassigned bishtgautam Mar 24, 2023

dqwu assigned dqwu and unassigned grnydawn Mar 24, 2023

dqwu added the SCORPIO The E3SM I/O library (derived from PIO) label Mar 24, 2023

sarats mentioned this issue Mar 29, 2023

Investigate performance of pio inq functions E3SM-Project/scorpio#158

Open

sarats added the Performance label Mar 31, 2023

lastephey mentioned this issue May 3, 2023

NWChem Shifter image fails with MPI errors nwchemgit/nwchem#775

Closed

sarats mentioned this issue May 16, 2023

ELM hangs while writing output/restarts due to MPI_bcast flooding E3SM-Project/scream#1920

Closed

sarats assigned bishtgautam and peterdschwartz Jun 1, 2023

sarats added the Frontier label Jun 1, 2023

ELM might fail on Crusher with MPI_Bcast error #5554

ELM might fail on Crusher with MPI_Bcast error #5554

Comments

dqwu commented Mar 24, 2023

ndkeen commented Mar 24, 2023 • edited Loading

bishtgautam commented Mar 24, 2023

rljacob commented Mar 24, 2023

dqwu commented Mar 24, 2023

sarats commented Mar 24, 2023 • edited Loading

rljacob commented Mar 24, 2023

dqwu commented Mar 24, 2023

dqwu commented Mar 24, 2023

dqwu commented Mar 24, 2023

grnydawn commented Mar 24, 2023

sarats commented Mar 24, 2023

sarats commented Mar 25, 2023

dqwu commented Mar 28, 2023

rljacob commented Mar 28, 2023

sarats commented Mar 29, 2023 • edited Loading

sarats commented Mar 29, 2023 • edited Loading

sarats commented Mar 29, 2023

abbotts commented Mar 29, 2023

dqwu commented Mar 29, 2023 • edited Loading

sarats commented Mar 29, 2023

sarats commented Mar 29, 2023 • edited Loading

dqwu commented Mar 29, 2023

dqwu commented Mar 29, 2023

rljacob commented Mar 29, 2023

bishtgautam commented Mar 29, 2023

jayeshkrishna commented Mar 29, 2023 • edited Loading

dqwu commented Mar 29, 2023

sarats commented Mar 29, 2023

rljacob commented Mar 29, 2023

sarats commented Mar 31, 2023 • edited Loading

sarats commented Jun 1, 2023

bishtgautam commented Jun 2, 2023

sarats commented Jun 2, 2023

bishtgautam commented Jun 2, 2023

sarats commented Jun 2, 2023

bishtgautam commented Jun 22, 2023

dqwu commented Jun 22, 2023

sarats commented Jul 14, 2023

dqwu commented Jul 19, 2023 • edited Loading

sarats commented Jul 19, 2023

dqwu commented Jul 19, 2023

ndkeen commented Mar 24, 2023 •

edited

Loading

sarats commented Mar 24, 2023 •

edited

Loading

sarats commented Mar 29, 2023 •

edited

Loading

sarats commented Mar 29, 2023 •

edited

Loading

dqwu commented Mar 29, 2023 •

edited

Loading

sarats commented Mar 29, 2023 •

edited

Loading

jayeshkrishna commented Mar 29, 2023 •

edited

Loading

sarats commented Mar 31, 2023 •

edited

Loading

dqwu commented Jul 19, 2023 •

edited

Loading