Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On PM-CPU, WaveWatchIII text input file takes too long to read in. #6670

Open
erinethomas opened this issue Oct 7, 2024 · 12 comments
Open
Labels
input file inputdata Changes affecting inputdata collection on blues pm-cpu Perlmutter at NERSC (CPU-only nodes) Wave

Comments

@erinethomas
Copy link
Contributor

erinethomas commented Oct 7, 2024

WW3 requires two large text files to be read in during the wave model initialization (the unresolved obstacles files).
These files are stored on global CFS with the rest of the E3SM data:
/global/cfs/cdirs/e3sm/inputdata/wav/ww3/obstructions_local.wQU225Icos30E3r5sp36x36.rtd.in. (size = 348M)
/global/cfs/cdirs/e3sm/inputdata/wav/ww3/obstructions_shadow.wQU225Icos30E3r5sp36x36.rtd.in (size = 626M)

These files take too long to be read in (often, simulations result in errors due to the wall clock time running out for short tests). For example, I have run the following tests:

  1. a fully coupled E3SMv3+WW3 test on 8 nodes. The model initialization time is 2722 seconds (~ 45minutes)
  2. a fully coupled E3SMv3+WW3 test on 4 nodes. The model initialization time is 1365 seconds
    These initialization times are WAY too long. It also suggests the problem scales with the number of nodes (twice as many nodes take about twice as much time to initialize the model)

I have found two possible solutions/workarounds that possibly suggest the issue is reading the big files from the global CFS directory

  1. copying the files to my local scratch directory: running on 8 nodes, this reduces the init. time to 180 seconds (about the same time observed on other machines, such as chrysalis)
  2. changing DIN_LOC_ROOT to equal "/dvs_ro/cfs/cdirs/e3sm/inputdata". Running on 8 nodes, reduces the init time to 160seconds (again, similar time as observed on other machines)
@erinethomas erinethomas added Wave input file inputdata Changes affecting inputdata collection on blues pm-cpu Perlmutter at NERSC (CPU-only nodes) labels Oct 7, 2024
@erinethomas
Copy link
Contributor Author

@ndkeen @mahf708 - new issue on the large time needed on PM-CPU for reading in files by WW3- conversation/suggestions on this issue welcome.

@erinethomas erinethomas changed the title On PM-CPU, WaveWatchIII input file takes too long to read in. On PM-CPU, WaveWatchIII text input file takes too long to read in. Oct 7, 2024
@rljacob
Copy link
Member

rljacob commented Oct 7, 2024

Why aren't these in netcdf format? You can never read that large of text file fast.

@ndkeen
Copy link
Contributor

ndkeen commented Oct 7, 2024

First, just sanity check of transfer speeds from CFS, CFS with dvs_ro, and scratch, I tried simple experiment below to show that they are all "about the same" in this scenario. I think dvs_ro is generally only faster with smaller file sizes, but... as to Rob's point, as these are text files, they are likely NOT being read in parallel method.

If possible, better to use diff file format with diff supported mechanisms to read in parallel. But if reading in text "manually", you for sure don't want to each MPI rank reading the entire file all at the same time. I don't yet know if that's happening here. Yes, this will be slower (how much slower really depends), but more importantly, it's error-prone and can cause filesystem problems (esp with increasing MPI ranks -- including other jobs trying to read same file). You could put together a quick patch to have rank0 read the file and use MPI_Bcast() to communicate data to other ranks.

time to copy file from X location to scratch on perlmutter

           CFS   dvs_ro CFS  scratch
ob local   .15    .16          .12
ob shadow  .28    .30          .22
time in seconds

perlmutter-login18% pwd
/global/homes/n/ndk/tmp

rm ob*in

perlmutter-login18% time cp /global/cfs/cdirs/e3sm/inputdata/wav/ww3/obstructions_local.wQU225Icos30E3r5sp36x36.rtd.in .
0.000u 0.149s 0:00.25 56.0% 0pf+0w

perlmutter-login18% time cp /global/cfs/cdirs/e3sm/inputdata/wav/ww3/obstructions_shadow.wQU225Icos30E3r5sp36x36.rtd.in .
0.000u 0.284s 0:00.48 58.3% 0pf+0w

rm ob*in

perlmutter-login18% time cp /dvs_ro/cfs/cdirs/e3sm/inputdata/wav/ww3/obstructions_local.wQU225Icos30E3r5sp36x36.rtd.in .
0.000u 0.158s 0:00.24 62.5% 0pf+0w

perlmutter-login18% time cp /dvs_ro/cfs/cdirs/e3sm/inputdata/wav/ww3/obstructions_shadow.wQU225Icos30E3r5sp36x36.rtd.in .
0.000u 0.298s 0:00.45 64.4% 0pf+0w


rm ob*in

perlmutter-login18% time cp /pscratch/sd/n/ndk/inputdata/wav/ww3/obstructions_local.wQU225Icos30E3r5sp36x36.rtd.in .
0.000u 0.124s 0:00.22 54.5% 0pf+0w

perlmutter-login18% time cp /pscratch/sd/n/ndk/inputdata/wav/ww3/obstructions_shadow.wQU225Icos30E3r5sp36x36.rtd.in .
0.000u 0.215s 0:00.38 55.2% 0pf+0w

@sarats
Copy link
Member

sarats commented Oct 8, 2024

As reading an ASCII or small file seems like a recurring pattern, I would suggest we put in a read and broadcast operation in Scorpio and have everyone use it rather than everyone implement this in their sub-model. Of course, it's trivial to do this right but one reusuable routine is better for maintenance. cc @jayeshkrishna

The best path is to have any input files to be read in parallel to be in netCDF.

@rljacob
Copy link
Member

rljacob commented Oct 8, 2024

The better place for that would be E3SM/share/util. SCORPIO should remain focused on large scale parallel reads/writes.

@erinethomas
Copy link
Contributor Author

@ndkeen - I'm pretty sure WW3 IS, in fact, reading the entire file with each task. not good.

@philipwjones
Copy link
Contributor

@erinethomas Just for clarification - is this occurring in source under our control? i.e. Does this occur within the WWIII source? Or are these reads taking place within MPAS for use in WWIII?

@erinethomas
Copy link
Contributor Author

@erinethomas Just for clarification - is this occurring in source under our control? i.e. Does this occur within the WWIII source? Or are these reads taking place within MPAS for use in WWIII?

This is happening within the WW3 source (not in MPAS) - we have a fork of WW3 source code for use within E3SM (as a submodule) that we have full control over and can modify to suit our needs.

@philipwjones
Copy link
Contributor

So it seems like the most appropriate solution (besides changing the file location) is to modify the WWIII source. If this is reading a table or set of values that are shared by all tasks, we should do a read from master and broadcast. If the values are meant to be distributed (ie each task needs a subset of values), we should do a proper parallel I/O. Let us know if you need help - the broadcast is relatively easy, but the parallel I/O is a bit more involved.

@ndkeen
Copy link
Contributor

ndkeen commented Oct 9, 2024

A few comments:
a) Looks like you are on right track and we all agree that, at least for production cases, we don't want each MPI rank reading same file in serial
b) It's not clear to me if this is actually what is slowing you down. I think you said the init as a whole is faster with scratch (or using dvs_ro), but that can include other things besides reading these 2 files. I can look at your cases and learn more? and/or try to reproduce.
c) Yes, I have been communicating with NERSC about using scratch (lustre) space to experiment with as a location for inputdata. It would be non-purged, but there are still some other details to work out (like unix groups -- was hoping to avoid concept of collab accounts -- ideally we want it to behave exactly same way as it does for us now in CFS). And then, start experimenting. I already have my own personal copy of oft-used inputs in my scratch space /pscratch/sd/n/ndk/inputdata that I have been occasionally experimenting with for quite a while. I've found that it sometimes helps, sometimes does not -- so can depend on what we are doing. I was actually trying to steer us toward using /dvs_ro with CFS first, but may not be worth it as global solution. I do think that the huge files we are starting to add may def be better reading from scratch in whatever way we think is best (like those for 3km scream runs). Could we have concept of small vs large file in inputdata? I will explore option of inputdata on shared scratch more with NERSC. Can make a different issue for that discussion.

@ndkeen
Copy link
Contributor

ndkeen commented Oct 17, 2024

Can you please post a way to reproduce this issue?

@erinethomas
Copy link
Contributor Author

erinethomas commented Oct 24, 2024

it will be easiest to reproduce after the ICOS mesh for WaveWatchIII PR is finalized (#6706). I will post as soon as that is complete.
In the meantime, this test run on perlmutter: /pscratch/sd/e/ethomas/E3SMv3/ICOStest shows the very large initialization time. it is not obvious from a successful run, however, that a large amount of that init time is taken by WW3 since WW3 doesn't output timing stats in the log file...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
input file inputdata Changes affecting inputdata collection on blues pm-cpu Perlmutter at NERSC (CPU-only nodes) Wave
Projects
None yet
Development

No branches or pull requests

5 participants