Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

module changes after perlmutter downtime - maint-2.1 #6155

Open
nanr opened this issue Jan 18, 2024 · 7 comments
Open

module changes after perlmutter downtime - maint-2.1 #6155

nanr opened this issue Jan 18, 2024 · 7 comments
Assignees

Comments

@nanr
Copy link

nanr commented Jan 18, 2024

I am having trouble submitting jobs on pm using maint-2.1. The problem started today, notably after the machine downtime yesterday.

I followed the error codes to load upgraded modules, but I'm not able to figure out how to get past this error:
v21.LR.BSMYLE.1995-11.001/case_scripts.014> module --ignore-cache load "cray-netcdf-hdf5parallel/4.9.0.7"
Lmod has detected the following error: The following module(s) are unknown: "cray-netcdf-hdf5parallel/4.9.0.7"

Here are my env_mach_specific.xml settings:

 <command name="load">craype</command>
  <command name="load">cray-libsci</command>
  <command name="load">cray-mpich/8.1.28</command>
  <command name="load">cray-hdf5-parallel/1.12.2.9</command>
  <command name="load">cray-netcdf-hdf5parallel/4.9.0.7</command>
  <command name="load">cray-parallel-netcdf/1.12.3.9</command>
  <command name="load">cmake/3.22.0</command>

I also added this directly to my env_mach_specific.xml file:

PrgEnv-intel/8.5.0
intel/2023.2.0

Thanks in advance for any ideas!

@ndkeen
Copy link
Contributor

ndkeen commented Jan 18, 2024

OK, looks like I need to update the branches. E3SM master does have module versions that will work if you want to copy those for now.

@nanr
Copy link
Author

nanr commented Jan 18, 2024

Thanks! (Can you point me in the right direction on where to find a list of the working module versions)

Thank you!

@ndkeen
Copy link
Contributor

ndkeen commented Jan 18, 2024

Actually, it looks like that branch already had updated modules. I think you just have not pulled recently enough.

With fresh clone of maint-2.1, you should see:

      <modules>
        <command name="load">craype-accel-host</command>
        <command name="load">craype/2.7.20</command>
        <command name="load">cray-mpich/8.1.25</command>
        <command name="load">cray-hdf5-parallel/1.12.2.3</command>
        <command name="load">cray-netcdf-hdf5parallel/4.9.0.3</command>
        <command name="load">cray-parallel-netcdf/1.12.3.3</command>
        <command name="load">cmake/3.24.3</command>
      </modules>

I'm still going to make a change to this maint branch and others to update PE layouts.

@nanr
Copy link
Author

nanr commented Jan 18, 2024

I had to make these module updates in order to do a case.setup:

<modules>
    <command name="load">craype-accel-host</command>
    <command name="load">craype/2.7.20</command>
    <command name="load">cray-mpich/8.1.28</command>
    <command name="load">cray-hdf5-parallel/1.12.2.9</command>
    <command name="load">cray-netcdf-hdf5parallel/4.9.0.3</command>
    <command name="load">cray-parallel-netcdf/1.12.3.9</command>
    <command name="load">cmake/3.24.3</command>
  </modules>

But I'm still getting this error:

ERROR: module command /usr/share/lmod/lmod/libexec/lmod python load craype-accel-host craype/2.7.20 cray-mpich/8.1.28 cray-hdf5-parallel/1.12.2.9 cray-netcdf-hdf5parallel/4.9.0.3 cray-parallel-netcdf/1.12.3.9 cmake/3.24.3 failed with message:
Lmod has detected the following error: These module(s) or extension(s) exist but cannot be loaded as requested:
"cray-netcdf-hdf5parallel/4.9.0.3"
Try: "module spider cray-netcdf-hdf5parallel/4.9.0.3" to see how to load the module(s).
v21.LR.BSMYLE.1995-11.001/case_scripts.014> module avail cray-netcdf-hdf5parallel
No module(s) or extension(s) found!

@ndkeen
Copy link
Contributor

ndkeen commented Jan 19, 2024

You may made other changes. If you can try checking out fresh clone of maint-2.1 and build a test there, then it means you just have some differences between your config_machines.xml and the one in the repo.

@ndkeen
Copy link
Contributor

ndkeen commented Jan 19, 2024

Note I just merged the following PR to maint-2.1, but it should have no impact here as there are no needed module version changes.

#6158

@ndkeen
Copy link
Contributor

ndkeen commented Dec 9, 2024

I don't think this is still an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants