On PM-CPU, WaveWatchIII text input file takes too long to read in. #6670

erinethomas · 2024-10-07T20:28:38Z

WW3 requires two large text files to be read in during the wave model initialization (the unresolved obstacles files).
These files are stored on global CFS with the rest of the E3SM data:
/global/cfs/cdirs/e3sm/inputdata/wav/ww3/obstructions_local.wQU225Icos30E3r5sp36x36.rtd.in. (size = 348M)
/global/cfs/cdirs/e3sm/inputdata/wav/ww3/obstructions_shadow.wQU225Icos30E3r5sp36x36.rtd.in (size = 626M)

These files take too long to be read in (often, simulations result in errors due to the wall clock time running out for short tests). For example, I have run the following tests:

a fully coupled E3SMv3+WW3 test on 8 nodes. The model initialization time is 2722 seconds (~ 45minutes)
a fully coupled E3SMv3+WW3 test on 4 nodes. The model initialization time is 1365 seconds
These initialization times are WAY too long. It also suggests the problem scales with the number of nodes (twice as many nodes take about twice as much time to initialize the model)

I have found two possible solutions/workarounds that possibly suggest the issue is reading the big files from the global CFS directory

copying the files to my local scratch directory: running on 8 nodes, this reduces the init. time to 180 seconds (about the same time observed on other machines, such as chrysalis)
changing DIN_LOC_ROOT to equal "/dvs_ro/cfs/cdirs/e3sm/inputdata". Running on 8 nodes, reduces the init time to 160seconds (again, similar time as observed on other machines)

erinethomas · 2024-10-07T20:33:23Z

@ndkeen @mahf708 - new issue on the large time needed on PM-CPU for reading in files by WW3- conversation/suggestions on this issue welcome.

rljacob · 2024-10-07T22:36:35Z

Why aren't these in netcdf format? You can never read that large of text file fast.

ndkeen · 2024-10-07T22:58:07Z

First, just sanity check of transfer speeds from CFS, CFS with dvs_ro, and scratch, I tried simple experiment below to show that they are all "about the same" in this scenario. I think dvs_ro is generally only faster with smaller file sizes, but... as to Rob's point, as these are text files, they are likely NOT being read in parallel method.

If possible, better to use diff file format with diff supported mechanisms to read in parallel. But if reading in text "manually", you for sure don't want to each MPI rank reading the entire file all at the same time. I don't yet know if that's happening here. Yes, this will be slower (how much slower really depends), but more importantly, it's error-prone and can cause filesystem problems (esp with increasing MPI ranks -- including other jobs trying to read same file). You could put together a quick patch to have rank0 read the file and use MPI_Bcast() to communicate data to other ranks.

time to copy file from X location to scratch on perlmutter

           CFS   dvs_ro CFS  scratch
ob local   .15    .16          .12
ob shadow  .28    .30          .22
time in seconds

perlmutter-login18% pwd
/global/homes/n/ndk/tmp

rm ob*in

perlmutter-login18% time cp /global/cfs/cdirs/e3sm/inputdata/wav/ww3/obstructions_local.wQU225Icos30E3r5sp36x36.rtd.in .
0.000u 0.149s 0:00.25 56.0% 0pf+0w

perlmutter-login18% time cp /global/cfs/cdirs/e3sm/inputdata/wav/ww3/obstructions_shadow.wQU225Icos30E3r5sp36x36.rtd.in .
0.000u 0.284s 0:00.48 58.3% 0pf+0w

rm ob*in

perlmutter-login18% time cp /dvs_ro/cfs/cdirs/e3sm/inputdata/wav/ww3/obstructions_local.wQU225Icos30E3r5sp36x36.rtd.in .
0.000u 0.158s 0:00.24 62.5% 0pf+0w

perlmutter-login18% time cp /dvs_ro/cfs/cdirs/e3sm/inputdata/wav/ww3/obstructions_shadow.wQU225Icos30E3r5sp36x36.rtd.in .
0.000u 0.298s 0:00.45 64.4% 0pf+0w


rm ob*in

perlmutter-login18% time cp /pscratch/sd/n/ndk/inputdata/wav/ww3/obstructions_local.wQU225Icos30E3r5sp36x36.rtd.in .
0.000u 0.124s 0:00.22 54.5% 0pf+0w

perlmutter-login18% time cp /pscratch/sd/n/ndk/inputdata/wav/ww3/obstructions_shadow.wQU225Icos30E3r5sp36x36.rtd.in .
0.000u 0.215s 0:00.38 55.2% 0pf+0w

sarats · 2024-10-08T06:22:41Z

As reading an ASCII or small file seems like a recurring pattern, I would suggest we put in a read and broadcast operation in Scorpio and have everyone use it rather than everyone implement this in their sub-model. Of course, it's trivial to do this right but one reusuable routine is better for maintenance. cc @jayeshkrishna

The best path is to have any input files to be read in parallel to be in netCDF.

rljacob · 2024-10-08T13:54:47Z

The better place for that would be E3SM/share/util. SCORPIO should remain focused on large scale parallel reads/writes.

erinethomas · 2024-10-08T15:17:41Z

@ndkeen - I'm pretty sure WW3 IS, in fact, reading the entire file with each task. not good.

philipwjones · 2024-10-08T16:58:58Z

@erinethomas Just for clarification - is this occurring in source under our control? i.e. Does this occur within the WWIII source? Or are these reads taking place within MPAS for use in WWIII?

erinethomas · 2024-10-08T17:11:31Z

@erinethomas Just for clarification - is this occurring in source under our control? i.e. Does this occur within the WWIII source? Or are these reads taking place within MPAS for use in WWIII?

This is happening within the WW3 source (not in MPAS) - we have a fork of WW3 source code for use within E3SM (as a submodule) that we have full control over and can modify to suit our needs.

philipwjones · 2024-10-08T17:32:25Z

So it seems like the most appropriate solution (besides changing the file location) is to modify the WWIII source. If this is reading a table or set of values that are shared by all tasks, we should do a read from master and broadcast. If the values are meant to be distributed (ie each task needs a subset of values), we should do a proper parallel I/O. Let us know if you need help - the broadcast is relatively easy, but the parallel I/O is a bit more involved.

ndkeen · 2024-10-09T19:33:34Z

A few comments:
a) Looks like you are on right track and we all agree that, at least for production cases, we don't want each MPI rank reading same file in serial
b) It's not clear to me if this is actually what is slowing you down. I think you said the init as a whole is faster with scratch (or using dvs_ro), but that can include other things besides reading these 2 files. I can look at your cases and learn more? and/or try to reproduce.
c) Yes, I have been communicating with NERSC about using scratch (lustre) space to experiment with as a location for inputdata. It would be non-purged, but there are still some other details to work out (like unix groups -- was hoping to avoid concept of collab accounts -- ideally we want it to behave exactly same way as it does for us now in CFS). And then, start experimenting. I already have my own personal copy of oft-used inputs in my scratch space /pscratch/sd/n/ndk/inputdata that I have been occasionally experimenting with for quite a while. I've found that it sometimes helps, sometimes does not -- so can depend on what we are doing. I was actually trying to steer us toward using /dvs_ro with CFS first, but may not be worth it as global solution. I do think that the huge files we are starting to add may def be better reading from scratch in whatever way we think is best (like those for 3km scream runs). Could we have concept of small vs large file in inputdata? I will explore option of inputdata on shared scratch more with NERSC. Can make a different issue for that discussion.

ndkeen · 2024-10-17T04:05:24Z

Can you please post a way to reproduce this issue?

erinethomas · 2024-10-24T20:16:36Z

it will be easiest to reproduce after the ICOS mesh for WaveWatchIII PR is finalized (#6706). I will post as soon as that is complete.
In the meantime, this test run on perlmutter: /pscratch/sd/e/ethomas/E3SMv3/ICOStest shows the very large initialization time. it is not obvious from a successful run, however, that a large amount of that init time is taken by WW3 since WW3 doesn't output timing stats in the log file...

erinethomas added Wave input file inputdata Changes affecting inputdata collection on blues pm-cpu Perlmutter at NERSC (CPU-only nodes) labels Oct 7, 2024

erinethomas changed the title ~~On PM-CPU, WaveWatchIII input file takes too long to read in.~~ On PM-CPU, WaveWatchIII text input file takes too long to read in. Oct 7, 2024

rljacob mentioned this issue Oct 8, 2024

Slow, potentially serial reading of netcdf files by E3SM #6666

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On PM-CPU, WaveWatchIII text input file takes too long to read in. #6670

On PM-CPU, WaveWatchIII text input file takes too long to read in. #6670

erinethomas commented Oct 7, 2024 •

edited

Loading

erinethomas commented Oct 7, 2024

rljacob commented Oct 7, 2024

ndkeen commented Oct 7, 2024

sarats commented Oct 8, 2024

rljacob commented Oct 8, 2024

erinethomas commented Oct 8, 2024

philipwjones commented Oct 8, 2024

erinethomas commented Oct 8, 2024

philipwjones commented Oct 8, 2024

ndkeen commented Oct 9, 2024 •

edited

Loading

ndkeen commented Oct 17, 2024

erinethomas commented Oct 24, 2024 •

edited

Loading

On PM-CPU, WaveWatchIII text input file takes too long to read in. #6670

On PM-CPU, WaveWatchIII text input file takes too long to read in. #6670

Comments

erinethomas commented Oct 7, 2024 • edited Loading

erinethomas commented Oct 7, 2024

rljacob commented Oct 7, 2024

ndkeen commented Oct 7, 2024

sarats commented Oct 8, 2024

rljacob commented Oct 8, 2024

erinethomas commented Oct 8, 2024

philipwjones commented Oct 8, 2024

erinethomas commented Oct 8, 2024

philipwjones commented Oct 8, 2024

ndkeen commented Oct 9, 2024 • edited Loading

ndkeen commented Oct 17, 2024

erinethomas commented Oct 24, 2024 • edited Loading

erinethomas commented Oct 7, 2024 •

edited

Loading

ndkeen commented Oct 9, 2024 •

edited

Loading

erinethomas commented Oct 24, 2024 •

edited

Loading