Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add aurora machine to e3sm #6117

Merged
merged 7 commits into from
Dec 23, 2023
Merged

add aurora machine to e3sm #6117

merged 7 commits into from
Dec 23, 2023

Conversation

xyuan
Copy link
Contributor

@xyuan xyuan commented Dec 11, 2023

Add ALCF Aurora to E3SM machines.

[BFB]

Copy link

github-actions bot commented Dec 11, 2023

PR Preview Action v1.4.6
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6117/
on branch gh-pages at 2023-12-21 21:12 UTC

@rljacob rljacob requested review from oksanaguba and removed request for amametjanov December 11, 2023 19:31
<SAVE_TIMING_DIR>/lus/gecko/CSC249ADSE15_CNDA/performance_archive</SAVE_TIMING_DIR>
<SAVE_TIMING_DIR_PROJECTS>.*</SAVE_TIMING_DIR_PROJECTS>
<CIME_OUTPUT_ROOT>/lus/gecko/projects/CSC249ADSE15_CNDA/$USER/scratch</CIME_OUTPUT_ROOT>
<DIN_LOC_ROOT>/lus/gecko/projects/CSC249ADSE15_CNDA/inputdata</DIN_LOC_ROOT>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What project is "CSC249ADSE15_CNDA"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for ECP Aurora early access project, it is valid through Apr. 2024.

@amametjanov amametjanov added Machine Files BFB PR leaves answers BFB Aurora labels Dec 11, 2023
@oksanaguba
Copy link
Contributor

@xyuan -- does this setup builds/runs on aurora? for both gpu/cpu setups?

Also update to CMake-friendly vars and rm diag queue.
amametjanov added a commit that referenced this pull request Dec 21, 2023
Add ALCF Aurora to E3SM machines.

[BFB]
amametjanov added a commit that referenced this pull request Dec 23, 2023
Re-merge to next to update PATH on Aurora
@amametjanov
Copy link
Member

Pushed a lot of updates to the branch to get e3sm_integration mostly passing: 100 out of 122 PASS.
CDash: https://my.cdash.org/viewTest.php?buildid=2460744

  • 19 debug-mode fails occur on Sunspot as well: fix is expected in future OneAPI module update
  • 3 threading hangs: I'll look into these in future work.

@amametjanov amametjanov merged commit 9a65cea into master Dec 23, 2023
3 checks passed
@amametjanov amametjanov deleted the xyuan/e3sm_aurora branch December 23, 2023 02:10
<env name="ONEAPI_DEVICE_SELECTOR">level_zero:gpu</env>
<env name="ONEAPI_MPICH_GPU">NO_GPU</env>
<env name="MPIR_CVAR_ENABLE_GPU">0</env>
<env name="romio_cb_read">disable</env>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these flags (romio_cb_*) still causing issues on Aurora?

Comment on lines +3161 to +3162
<env name="FI_CXI_DEFAULT_CQ_SIZE">131072</env>
<env name="FI_CXI_CQ_FILL_PERCENT">20</env>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to add context for these flags and why we had to steer away from defaults for these Slingshot networks variables.

<env name="romio_cb_write">disable</env>
<env name="SYCL_CACHE_PERSISTENT">1</env>
<env name="GATOR_INITIAL_MB">4000MB</env>
<env name="GATOR_DISABLE">0</env>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't GATOR_DISABLE 0 by default? Perhaps this was set as 1 while debugging?

</environment_variables>
<environment_variables compiler="oneapi-ifxgpu">
<env name="ONEAPI_DEVICE_SELECTOR">level_zero:gpu</env>
<env name="ONEAPI_MPICH_GPU">NO_GPU</env>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is disabling GPU to GPU MPI?

@oksanaguba oksanaguba restored the xyuan/e3sm_aurora branch January 29, 2024 20:28
<env name="NETCDF_FORTRAN_PATH">/lus/gecko/projects/CSC249ADSE15_CNDA/software/netcdf-fortran/4.6.1/oneapi.eng.2023.05.15.007</env>
<env name="PNETCDF_PATH">/lus/gecko/projects/CSC249ADSE15_CNDA/software/pnetcdf/1.12.3/oneapi.eng.2023.05.15.007</env>
<env name="LD_LIBRARY_PATH">/lus/gecko/projects/CSC249ADSE15_CNDA/software/pnetcdf/1.12.3/oneapi.eng.2023.05.15.007/lib:/lus/gecko/projects/CSC249ADSE15_CNDA/software/netcdf-fortran/4.6.1/oneapi.eng.2023.05.15.007/lib:/lus/gecko/projects/CSC249ADSE15_CNDA/software/netcdf-c/4.9.2/oneapi.eng.2023.05.15.007/lib:$ENV{LD_LIBRARY_PATH}</env>
<env name="PATH">/lus/gecko/projects/CSC249ADSE15_CNDA/software/pnetcdf/1.12.3/oneapi.eng.2023.05.15.007/bin:/lus/gecko/projects/CSC249ADSE15_CNDA/software/netcdf-fortran/4.6.1/oneapi.eng.2023.05.15.007/bin:/lus/gecko/projects/CSC249ADSE15_CNDA/software/netcdf-c/4.9.2/oneapi.eng.2023.05.15.007/bin:$ENV{PATH}</env>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these blocks, "modules", "env variables"..., get executed in order they appear in the file? if so, does it make sense to append env variables before modules are loaded?

It may be a user error, but i am in the situation when a module is loaded and it presumably modifies PATH, but then, I think, the command from above for PATH "erases" that module's path because, maybe, $ENV{PATH} value in use is from before the module was loaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Aurora BFB PR leaves answers BFB Machine Files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants