Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update chicoma-cpu modules #112

Closed

Conversation

xylar
Copy link
Collaborator

@xylar xylar commented Oct 11, 2024

Following the recent DST, this merge updates the module files and environment variables on Chicoma-CPU. We note that these updates work well for gnu and nvidia compilers but not yet for intel, which we are continuing to work on. A separate update will be needed to address Chicoma-GPU as well.

@xylar
Copy link
Collaborator Author

xylar commented Oct 11, 2024

This is just a draft so far. I'm having no luck with either gnu or intel on Chicoma-CPU so far. I haven't tried anything else yet.

@xylar
Copy link
Collaborator Author

xylar commented Oct 11, 2024

I've contacted LANL IC about the trouble I'm having with gnu:

/lustre/scratch5/xylar/E3SM/scratch/chicoma-cpu/SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_gnu.20241011_152619_r3zumf/bld/e3sm.exe: /opt/cray/pe/gcc-libs/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /lustre/scratch5/xylar/E3SM/scratch/chicoma-cpu/SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_gnu.20241011_152619_r3zumf/bld/e3sm.exe)

While it seems clear that there's an RPATH being set to /opt/cray/pe/gcc-libs, I haven't been able to track down where that's coming from. Setting the LD_LIBRARY_PATH didn't help.

@xylar
Copy link
Collaborator Author

xylar commented Oct 11, 2024

On the intel side, it's not finding NetCDF-C or -Fortran, even though we're passing a NETCDF_PATH environment variable that seems correct.

@xylar
Copy link
Collaborator Author

xylar commented Oct 14, 2024

The gnu issue seems similar to E3SM-Project#6677

@jonbob
Copy link
Collaborator

jonbob commented Oct 17, 2024

With the commits I just pushed, I was able to successfully build and run:

  • SMS.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_gnu
  • SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_gnu

So I think at this point we can say we support gnu on chicoma. I'll poke around at intel as well

Comment on lines -4218 to +3940
<CCSM_CPRNC>/usr/projects/climate/SHARED_CLIMATE/software/badger/cprnc</CCSM_CPRNC>
<CCSM_CPRNC>/usr/projects/e3sm/software/chicoma-cpu/cprnc</CCSM_CPRNC>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I bet I know what happened here. I deleted this thinking that it was old and no longer used. In my defense, it has badger in the path...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it no longer exists! Not just the machine but that file. But I think if that line is missing it forces each test to try to build it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it no longer exists!

That's what I was saying. I think I deleted it trying to free up space in /usr/projects/climate because I couldn't imagine we were still using software built for Badger.

Comment on lines +4266 to +3992
<command name="unload">PrgEnv-gnu</command>
<command name="unload">PrgEnv-intel</command>
<command name="unload">PrgEnv-nvidia</command>
<command name="unload">PrgEnv-cray</command>
<command name="unload">PrgEnv-aocc</command>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that these needed to be unloaded after their corresponding compiler modules or there would be an error about an undefined environment variable name.

@xylar xylar force-pushed the machine/update-chicoma-modules branch from df21910 to 9c0b308 Compare October 17, 2024 16:30
@xylar xylar changed the base branch from master to alternate October 17, 2024 16:32
@xylar xylar changed the base branch from alternate to master October 17, 2024 16:32
@xylar
Copy link
Collaborator Author

xylar commented Oct 17, 2024

@jonbob, I'm trying to run a test:

./create_test SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_gnu --walltime 00:30:00 --wait -p w23_freddy

This looks to be what you ran successfully. But for me it just seems to be hanging. It hasn't got to ocean time stepping yet and there's very little output in the e3sm log file.

Could you have a quick look and let me know if you see anything obvious?

/users/xylar/scratch5/E3SM/scratch/chicoma-cpu/SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_gnu.20241017_103208_7tp6zr

@jonbob
Copy link
Collaborator

jonbob commented Oct 17, 2024

@xylar -- let me take a peek

@jonbob
Copy link
Collaborator

jonbob commented Oct 17, 2024

@xylar - it seems to be struggling with the atm data? That doesn't make much sense

@xylar
Copy link
Collaborator Author

xylar commented Oct 17, 2024

In the meantime, I'm trying an optimized run to see how that goes.

@xylar
Copy link
Collaborator Author

xylar commented Oct 17, 2024

SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_gnu passed for me in the end. It just took 25 minutes and didn't get to time stepping for a long time. It seems like it might be a file system issue with /usr/projects/e3sm.

@xylar
Copy link
Collaborator Author

xylar commented Oct 17, 2024

SMS.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_gnu passed for me as well.

@xylar
Copy link
Collaborator Author

xylar commented Oct 17, 2024

I tested SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_nvidia and it built fine and appeared to be running but timed out before the 30 minutes I gave it (same file system issues as above). Waiting in the queue with a longer test.

@xylar
Copy link
Collaborator Author

xylar commented Oct 18, 2024

I realize it's not a high priority for us but SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_nvidia passed for me with a longer job runtime.

Comment on lines +4002 to +4003
<command name="load">PrgEnv-nvidia/8.5.0</command>
<command name="load">nvidia/24.7</command>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I successfully tested these updated modules as well.

This should be set to:
```
export GNU_CRAY_LDFLAGS="-Wl,--enable-new-dtags"
```
on Chicoma-CPU with gcc.
@@ -396,11 +396,11 @@ gnu-cray:
"FFLAGS_OPT = -O3 -m64 -ffree-line-length-none -fconvert=big-endian -ffree-form -ffpe-summary=none $${EXTRA_FFLAGS}" \
"CFLAGS_OPT = -O3 -m64" \
"CXXFLAGS_OPT = -O3 -m64" \
"LDFLAGS_OPT = -O3 -m64" \
"LDFLAGS_OPT = -O3 -m64 $(GNU_CRAY_LDFLAGS)" \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewhoffman, this environment variable (or argument to make) needs to be set to:

export GNU_CRAY_LDFLAGS="-Wl,--enable-new-dtags"

on Chicoma for now. I'll make sure Compass and Polaris do this. If someone is building for Chicoma outside of Compass or Polaris (good luck!), they would need to set this manually.

Are you okay with this fix? I don't want to put in anything into the Makefile that tries to detect the machine or anything crazy like that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xylar , this seems like the best solution given the circumstances

@xylar xylar mentioned this pull request Oct 21, 2024
35 tasks
@xylar xylar force-pushed the machine/update-chicoma-modules branch from 870b287 to d379003 Compare October 21, 2024 03:44
@xylar
Copy link
Collaborator Author

xylar commented Oct 21, 2024

@jonbob, at the risk of delaying this further, I think we probably want to follow what Noel is doing on Perlmutter:
https://github.com/E3SM-Project/E3SM/pull/6702/files
That should at least save us from having to make yet another PR in the near future.

@xylar xylar force-pushed the machine/update-chicoma-modules branch from 871a4b7 to b33833c Compare October 21, 2024 17:51
This is to match proposed updates to Perlmutter-CPU
E3SM-Project#6702
@xylar xylar force-pushed the machine/update-chicoma-modules branch from b33833c to c34336e Compare October 21, 2024 18:35
Comment on lines 3973 to 3975
<command name="unload">cray-parallel-netcdf</command>
<command name="unload">cray-netcdf</command>
<command name="unload">cray-hdf5</command>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried removing cpe like Noel did here:
https://github.com/E3SM-Project/E3SM/blob/bdcc2f551cfae2fca53bd8aa4ec604601ddf1c68/cime_config/machines/config_machines.xml#L193
But I got nasty error like:

        Lmod has detected the following error: These module(s) or extension(s) exist
        but cannot be loaded as requested: "git", "cmake/3.27.7"
           Try: "module spider git cmake/3.27.7" to see how to load the module(s).

@xylar
Copy link
Collaborator Author

xylar commented Oct 21, 2024

I have gnu and nvidia tests in the queue with the latest updates.

@xylar
Copy link
Collaborator Author

xylar commented Oct 21, 2024

The following both passed:

SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_gnu
SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_nvidia

@xylar
Copy link
Collaborator Author

xylar commented Oct 22, 2024

Closed in favor of E3SM-Project#6705

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants