-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ELM might fail on Crusher with MPI_Bcast error #5554
Comments
Because we see this same issue on another machine, it seems likely the root cause is in our code and/or MPI/netcdf layers. There have been updating to HW/SW on perlmutter, but I've not yet tried testing without that environment variable set. @ndkeen |
@dqwu I'm definitely not the right person to debug/fix this issue. |
What is it about this case config that is different from a regular I case ? |
The above test case is simplified from a benchmark I case (listed in E3SM confluence page). The original benchmark I case (STOP_N=10) can also reproduce this issue on Crusher. |
Generally speaking, I would really like someone from the land group to take over these debugging activities now that the initial machine readiness stuff is done. This specific issue could even be a transient interconnect issue. We can check once Frontier becomes available and the network stabilizes. No rush on running standalone ELM on Crusher/Frontier anyway. |
I thought the fix was to modify the crusher config? @dqwu does the error trace point to a routine in the land model? There's no context for those MPICH errors above. |
The workaround seems to work (used by Perlmutter already). |
@grnydawn I will modify the crusher config to add MPICH_COLL_SYNC ENV variable. Before that, I will also check whether the failed MPI_Bcast calls are from SCORPIO. You can test and integrate the fix later. |
In E3SM-Project/scream#1920, a scream developer also reported a similar error trace related to MPI_Bcast calls on Perlmutter:
|
Honestly, at least for Crusher/Frontier: this issue can wait until the slingshot network stabilizes. I will discuss this with Cray folks during our next call. However, they will probably ask for a small reproducer which again requires effort to put together. If this is truly a blocker for SCREAM or MMF on Frontier, we can assign assign people to work on it. That flag imposes additional sync overhead which is probably not desirable for rest of the model configurations' performance, so a machine-wide default is not warranted. As alluded to in the Scream issue, there are probably too many Bcast's issued which are exhausting some internal resources. If that's the root cause, we should identify and increase the necessary thresholds. |
Btw, as Pat pointed out in the discussion at E3SM-Project/scorpio#493, if the root cause is need for flow control, we should address that. |
Here are some updated info
Alternatively, if we do not use MPI barriers, a particular libfabric environment variable (for cxi provider, mentioned in PR #5275) also works.
It seems that we do not need explicitly set FI_CXI_RX_MATCH_MODE, FI_CXI_REQ_BUF_SIZE and FI_UNIVERSE_SIZE (they can all use the default settings). In https://docs.nersc.gov/performance/network, it is mentioned that "setting FI_CXI_RX_MATCH_MODE=hardware can cause jobs to fail when they exhaust the hardware message queue (usually by sending too many MPI messages)." However, we do need a larger value (the suggested value in PR #5275 is only 70K) for FI_CXI_DEFAULT_CQ_SIZE, even when FI_CXI_RX_MATCH_MODE is explicitly set to "software":
|
The total number of calls to MPI-Bcast over a run shouldn't matter. So I don't understand how a 10-day run would need a larger CQ_SIZE then a 1 day run. |
Looking on Frontier, the default according to Evidently, the
|
I wonder what the error message regarding OFI poll failure "Message too long" refers to.
Btw, do you see ""LE resources not recovered during flow control" error message when using 'hardware' mode? @abbotts Does an application need to clear/flush the CXI provider completion queue periodically? |
Just noticed that Crusher has a |
The increase in the default CQ size should actually be due to either an MPI or libfabric update. I forget which. The network software on Crusher is the oldest of the three machines (crusher, frontier, perlmutter). It's almost a year old and is missing a lot of fixes.
No, this isn't something you should need to do. As incoming messages are matched to posted receives the queue should empty. The barrier you insert by setting MPICH_COLL_SYNC is effectively saying "all pending communication needs to finish", which should empty out the queue. We have been seeing applications with a many-to-one communication pattern running out of queue space and going into flow control as they scale out. For example, if there's something like a root process that posts a receive with MPI_ANY_SOURCE to handle client requests, and every client talks to it at once, then the root process might drop into flow control. In principle all our MPI collectives should internal throttling to avoid dropping info flow control. There could be a bug in MPI_Bcast, or it could be that the Bcast is the victim not the cause. What does the communication pattern look like between the Bcasts?
I'd stick with "software" for now, especially on Crusher. Long term "hybrid" will probably be the ideal solution, but the transition from hardware to software matching still needs more testing. Does this issue occur only at small scale, or only as you scale out? |
@sarats @rljacob For this issue on Crusher, maybe we can set FI_CXI_DEFAULT_CQ_SIZE ENV variable to 128K (the default value on Frontier) in config_machines.xml? |
AFAIK, we haven't run into this issue running MMF or SCREAM even at scale. If I understand correctly, you encounter this issue on 24 nodes (1344 PEs), right? @dqwu Do you know why this issue presents with Land I/O? Are they doing a lot of small broadcasts or using MPI_ANY_SOURCE? |
Can you point to this page? |
Maybe the I case invokes a lot of pio_inq calls in SCORPIO which use MPI_Bcast. For 10-day run there are more than 6M MPI_Bcast calls. |
|
@thorntonpe or @bishtgautam : who is an expert in how ELM does I/O ? |
@rljacob, I don't believe we have an expert within the land team and have been using @jayeshkrishna's and @dqwu's expertise. @peterdschwartz , Are you familiar with land I/O. @dqwu With the following changes, you are writing ELM output every time step:
Are you outputting ELM data at this high frequency just to reproduce the error? Does anyone think this error could be avoided if ELM didn't use a round-robin domain decomposition? |
My two cents : This issue seems to be a bug in the MPI/libfabric library. We have an E3SM reproducer at the top of the issue to reproduce the problem on crusher. With regard to I/O patterns in ELM, the code shouldn't crash with the current MPI/ IO usage in the component (unless there is a bug in the code, which seems unlikely since the code/case works on other machines & one of the ways to avoid the crash seems to be related to buffer_size/synchronization settings for the libfabric/MPI lib) We can continue to work on reproducers that only use MPI, but that may or may not reliably reproduce the issue. |
@bishtgautam The original I case benchmark test intentionally outputs ELM data at high frequency to output large files to test the write performance of PnetCDF and ADIOS types. |
I'm still curious what about ELM's I/O triggers this issue on Slingshot interconnect on both Perlmutter and Frontier. It's understandable if this issue presents with high-frequency output from other components as well. We should plan on running high-res Scream and Ocean benchmarks on Frontier. |
@bishtgautam SCORPIO and layers below that are indeed up to Jayesh and Danqing. But how ELM calls SCORIPO is ultimately up to the land team to understand and help optimize. For example, someone on the land team has to own components/elm/src/main/histfileMod.F90. |
@bishtgautam / @thorntonpe / @peterdschwartz A suggestion: Add a barrier periodically (every n-steps etc.) in the land driver when performing high-frequency I/O to flush the communication queues/buffers. It localizes and minimizes the sync overhead to just land I/O and allows fine-tuning as needed.
Something to follow-up in the future. |
Can the above suggestion (periodic sync in land) be implemented to see if it addresses the issue? |
@sarats Sure, I can implement it. I'm considering implementing it as a namelist option to control the frequency of syncing. Sounds good? |
Perfect, that would allow fine tuning. |
@dqwu Check above branch and see what |
@dqwu Please report back if this issue is resolved with above PR and what frequency was needed. |
Did you test it today or earlier? There have been a bunch of system updates yesterday. |
Today. |
Steps to reproduce with compset I1850GSWCNPRDCTCBC and res hcru_hcru:
Error logs:
Could be an issue similar to E3SM-Project/scream#1920 on Perlmutter.
Like the workaround in PR #5291, we just need to add an environment variable in config_machines.xml for Crusher (works for the above case):
<env name="MPICH_COLL_SYNC">MPI_Bcast</env>
The text was updated successfully, but these errors were encountered: