Data partitioning mechanism used in E3SM #6807
Replies: 15 comments
-
@wkliao Not sure you really want to take this on, but ... I have code to do online partitioning of the ocean meshes (and also code to compute the offsets for parallel IO) that we wrote for the new Omega model that matches the MPAS-Ocean decompositions. It requires linking/calling Metis routines. The cells are partitioned first and the edge/vertex locations are partitioned based on the cell partitioning. The MPAS-seaice partitioning is more complicated because they do some further balancing after Metis partitioning and I can't help with that. You'd end up with a good size chunk of code to maintain. |
Beta Was this translation helpful? Give feedback.
-
Are the "pre-generated data partitioning patterns" your refer to files like "mpas-o.graph.info.230422.part" ? Because an I-case should not require any of those. |
Beta Was this translation helpful? Give feedback.
-
Hi, @rljacob My understanding of the I case is that there are 5 such I/O decompositions and Hi, @philipwjones I am hoping to be able to generate such file offset lists, through a library I do not intend to obtain a library or utility program that can generate the exact I learned that in E3SM the file offsets written by each process are highly |
Beta Was this translation helpful? Give feedback.
-
@wkliao |
Beta Was this translation helpful? Give feedback.
-
@wkliao : Different model components in E3SM use different grids (which result in different I/O decompositions and I/O patterns) for its computation. Maybe we can start working on adding support for the ATM/EAM grid (Cubed sphere grid) in the E3SM I/O benchmark code first and then proceed with other components (It might be tricky with some components that use other software to efficiently partition the grids across processes). Also note that within the same component the model variables can be written out with different decompositions (So its not enough to just capture the grid but also capture any load balancing mechanism used). How soon do you need these changes in the benchmark? We can work on enhancing the benchmark to support/simulate different grids. But like @philipwjones noted above, please keep in mind that this would increase the amount/complexity of the code in the I/O benchmark. |
Beta Was this translation helpful? Give feedback.
-
Another thing to keep in mind is that from the E3SM perspective we are mostly interested in I/O performance for a limited set of configuration settings (grid resolution/configuration, number of MPI processes, model output configuration) which can right now be analyzed by reading out the I/O decomposition files dumped out by SCORPIO (that is already supported by the E3SM I/O benchmark tool). |
Beta Was this translation helpful? Give feedback.
-
@jayeshkrishna @dqwu and I have been investigating an MPI-IO hanging problem we |
Beta Was this translation helpful? Give feedback.
-
ok. For now to debug the hang issue @dqwu can you generate the I/O decomposition maps with a lower resolution run (ne256?) and use it with the I/O benchmark code? |
Beta Was this translation helpful? Give feedback.
-
@wkliao Since your alltomany.c test can indeed reproduce the hanging issue on Aurora (with len set to 1200), I think maybe only the F case dataset (with a large number of small non-contiguous sub-array requests) of E3SM-IO can reproduce the hanging issue. |
Beta Was this translation helpful? Give feedback.
-
@wkliao Also, since your column_wise.c can also reproduce the hanging issue with a large len arg, I think you can use it as a special map generator as well. |
Beta Was this translation helpful? Give feedback.
-
@jayeshkrishna To debug the hang issue, we already have some pure MPI programs (with or without I/O) as simple reproducers. |
Beta Was this translation helpful? Give feedback.
-
My guess is the hanging may happen to other problem sizes or run scales. Based on my understanding of ROMIO, I developed 'alltomany.c' and |
Beta Was this translation helpful? Give feedback.
-
@wkliao Also note that by using BOX rearranger by default (PnetCDF lib receives contiguous sub-array requests from SCORPIO), real I/O of E3SM might not be able to reproduce this hanging issue. |
Beta Was this translation helpful? Give feedback.
-
Is there an issue for the hang? This would at least be easy experiment to see if it makes a difference: |
Beta Was this translation helpful? Give feedback.
-
@ndkeen The hang is reproducible on Aurora but not on Perlmutter, and I have created a tickect (RITM0381937) for ALCF. Steps to reproduce the hang on Aurora: [Download the reproduction code] [Compile the program] [Create a job script]
[Submit the job] The hang is from MPI_Waitall call. Changing FI_MR_CACHE_MONITOR from disabled to kdreg2 does not work, either. |
Beta Was this translation helpful? Give feedback.
-
I am a developer of PnetCDF and have created E3SM-IO benchmark to study the
I/O performance of E3SM on DOE parallel computers. This benchmark has
also been used to help Scorpio to improve its design.
However, E3SM-IO currently includes only 3 cases, namely, F, G, and I cases.
All requires an input file containing the pre-generated data partitioning patterns,
i.e. file offset-length pairs per MPI process (such offset-length pairs are also
referred to as 'decomposition maps' in Scorpio). Using the pre-generated
map files prevents from testing other E3SM problem domain sizes.
I wonder if I can obtain some assistance from the E3SM team to help me understand
the data partitioning mechanism used in E3SM codes. My goal is to add a function in
E3SM-IO that can generate the decomposition maps at the run time, given any
problem domain sizes.
Beta Was this translation helpful? Give feedback.
All reactions