Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaea C6 support for UFSWM #2448

Open
wants to merge 29 commits into
base: develop
Choose a base branch
from

Conversation

BrianCurtis-NOAA
Copy link
Collaborator

@BrianCurtis-NOAA BrianCurtis-NOAA commented Oct 2, 2024

Commit Queue Requirements:

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

This PR will bring in all changes necessary to provide Gaea C6 support for UFSWM

Commit Message:

* UFSWM - Gaea C6 Support

Priority:

  • Normal

Git Tracking

UFSWM:

Sub component Pull Requests:

  • None

UFSWM Blocking Dependencies:

  • Blocked by #
  • None

Changes

Regression Test Changes (Please commit test_changes.list):

  • No Baseline Changes. (just adds logs for Gaea C6)

Input data Changes:

  • None.

Library Changes/Upgrades:

  • No Updates

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

@BrianCurtis-NOAA
Copy link
Collaborator Author

cpld_control_p8 intel fails for timing out, so there's work to tweak the configs to better match the C6 hardware.

I think there's still lots of other items to check here, this is just a placeholder for now. Please feel free to send PR's to my fork/branch to add/adjust/fix any issues etc...

@BrianCurtis-NOAA
Copy link
Collaborator Author

Also, once things start falling into place, we'll need to make sure intelllvm support is available for c6.

@RatkoVasic-NOAA
Copy link
Collaborator

@BrianCurtis-NOAA, name change suggestion:

gaea -----> gaea-c5
gaeac6 ---> gaea-c6

@sanAkel
Copy link

sanAkel commented Oct 4, 2024

@BrianCurtis-NOAA Shall I re-try building with these modulefiles/ufs_gaeac6.intel.lua in this PR?

tests/compile.sh Outdated
@@ -95,7 +98,7 @@ export SUITES
set -ex

# Valid applications
if [[ ${MACHINE_ID} != gaea ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we even need this logic here, adding or not adding -DMOM6SOLO=ON? As far as I know, we do not regression test MOM6SOLO. Can we remove this block of code entirely from this script?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. It was added there for a reason, and I don't recall if we ever RT'd MOM6SOLO. @junwang-noaa do you recall what this block of code was used for?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, this is to support standalone MOM testing. @jiandewang Do you know why MOM6 SOLO does not work on gaea?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I'm new to the UFS, but AFAIK, nobody seems to use -DMOM6SOLO=ON, though I would differ it to @junwang-noaa.

Copy link

@sanAkel sanAkel Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junwang-noaa My understanding from @jiandewang is that he (and others) are no longer routinely testing MOM solo config; I have always built using instructions at MOM6-examples

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, this is to support standalone MOM testing. @jiandewang Do you know why MOM6 SOLO does not work on gaea?

It was added here many years ago and we never tried this SOLO on any platform. My understanding is with nuopc_cap it has to be coupled with something.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junwang-noaa My understanding from @jiandewang is that he (and others) are no longer routinely testing MOM solo config; I have always built using instructions at MOM6-examples

yes we use MOM-example to do standalone test when it's needed to do some debug work (to help GFDL to narrow down issue when their big PR is not working as expected in UWM).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you do not use tests/compile.sh to build standalone test, is that correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you do not use tests/compile.sh to build standalone test, is that correct?

correct

Copy link
Collaborator

@DusanJovic-NOAA DusanJovic-NOAA Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should remove it from compile.sh

@BrianCurtis-NOAA
Copy link
Collaborator Author

cpld_control_p8 fails with:

  5: MPICH ERROR [Rank 5] [job id 207188364.0] [Fri Oct  4 13:33:08 2024] [c6n0210] - Abort(941244175) (rank 5 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:
  5: PMPI_Win_create(294)......: MPI_Win_create(base=0x7ffe81f20fe0, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc4000060, win=0x7ffe81f2113c) failed
  5: MPID_Win_create(89).......:
  5: MPIDIG_mpi_win_create(872):
  5: win_allgather(246)........: OFI mr_enable failed (ofi_win.c:246:win_allgather:Address already in use)

and control_p8 runs to completion:

0: * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . 
  0: *****************RESOURCE STATISTICS*******************************
  0: The total amount of wall time                        = 853.216145
  0: The total amount of time in user mode                = 216.242551
  0: The total amount of time in sys mode                 = 410.041583
  0: The maximum resident set size (KB)                   = 1720560
  0: Number of page faults without I/O activity           = 131391
  0: Number of page faults with I/O activity              = 173
  0: Number of times filesystem performed INPUT           = 1024
  0: Number of times filesystem performed OUTPUT          = 0
  0: Number of Voluntary Context Switches                 = 16903
  0: Number of InVoluntary Context Switches               = 9006
  0: *****************END OF RESOURCE STATISTICS*************************

@BrianCurtis-NOAA
Copy link
Collaborator Author

@DusanJovic-NOAA this look ok?:

diff --git a/tests/compile.sh b/tests/compile.sh
index 2c3c7796..26e3a788 100755
--- a/tests/compile.sh
+++ b/tests/compile.sh
@@ -97,17 +97,6 @@ SUITES=$(grep -Po "\-DCCPP_SUITES=\K[^ ]*" <<< "${MAKE_OPT}")
 export SUITES
 set -ex
 
-# Valid applications
-if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
-  if [[ "${MAKE_OPT}" == *"-DAPP=S2S"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-
-  if [[ "${MAKE_OPT}" == *"-DAPP=NG-GODAS"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-fi
-
 CMAKE_FLAGS=$(set -e; trim "${CMAKE_FLAGS}")
 echo "CMAKE_FLAGS = ${CMAKE_FLAGS}"

@DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA this look ok?:

diff --git a/tests/compile.sh b/tests/compile.sh
index 2c3c7796..26e3a788 100755
--- a/tests/compile.sh
+++ b/tests/compile.sh
@@ -97,17 +97,6 @@ SUITES=$(grep -Po "\-DCCPP_SUITES=\K[^ ]*" <<< "${MAKE_OPT}")
 export SUITES
 set -ex
 
-# Valid applications
-if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
-  if [[ "${MAKE_OPT}" == *"-DAPP=S2S"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-
-  if [[ "${MAKE_OPT}" == *"-DAPP=NG-GODAS"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-fi
-
 CMAKE_FLAGS=$(set -e; trim "${CMAKE_FLAGS}")
 echo "CMAKE_FLAGS = ${CMAKE_FLAGS}"

Yes.

@ulmononian
Copy link
Collaborator

ulmononian commented Oct 16, 2024

@BrianCurtis-NOAA @jkbk2004 @FernandoAndrade-NOAA i believe EPIC now has full access to the bil-fire8 project (disk space and compute resources). i was able to run a control_c48 test using this allocation in /gpfs/f6/bil-fire8/scratch/role.epic/ufs-wm_2448 with run_dir at /gpfs/f6/bil-fire8/scratch/role.epic/RT_RUNDIRS/role.epic/FV3_RT/rt_1552059, but i had to create new baselines since they are not yet staged on c6. seems like rocoto should be installed on c6 as well (@natalie-perlin).

@jkbk2004
Copy link
Collaborator

@BrianCurtis-NOAA can you sync up branch? I think I am able to create baseline on c6: /gpfs/f6/bil-fire8/world-shared/role.epic/UFS-WM_RT/NEMSfv3gfs.

@jkbk2004
Copy link
Collaborator

Continue to see failures with various cases.

atmaero_control_p8_intel failed in run_test
cpld_bmark_p8_intel failed in run_test
cpld_control_ciceC_p8_intel failed in run_test
cpld_control_p8_faster_intel failed in run_test
cpld_control_p8_intel failed in run_test
cpld_control_p8_mixedmode_intel failed in run_test
cpld_control_p8.v2.sfc_intel failed in run_test
cpld_debug_p8_intel failed in run_test
hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel failed in run_test
hafs_regional_storm_following_1nest_atm_ocn_wav_intel failed in run_test
hafs_regional_storm_following_1nest_atm_ocn_wav_mom6_intel failed in run_test
regional_atmaq_debug_intel failed in run_test

About 3 different behaviors and error messages:

- cpld_bmark_p8_intel:
 769: libfabric:2470915:1729115914::cxi:core:cxip_ux_onload_cb():2657<warn> c6n0025: RXC (0x8b2:1) PtlTE 495:[Fatal] LE resources not recovered during flow control. FI_CXI_RX_MATCH_MODE=[hybrid|software] is required
- hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel:
592: PE 592: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.]  OFI retry continuing...
592: 0: slurmstepd: error: *** STEP 207205202.0 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 207205202 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
192: forrtl: error (78): process killed (SIGTERM)
- regional_atmaq_debug_intel:
srun: error: c6n0014: tasks 0-191: Killed
srun: Terminating StepId=207205194.0
327: forrtl: error (78): process killed (SIGTERM)
327: Image              PC                Routine            Line        Source
327: libpthread-2.31.s  00007F643D216910  Unknown               Unknown  Unknown
327: libc-2.31.so       00007F643A43EB57  __sched_yield         Unknown  Unknown
327: libmpi_intel.so.1  00007F643BECB44F  Unknown               Unknown  Unknown
327: libmpi_intel.so.1  00007F643BF5C4B6  Unknown               Unknown  Unknown
327: libmpi_intel.so.1  00007F643A7DE41D  MPI_Bcast             Unknown  Unknown
- all other failed cases :
 16: MPICH ERROR [Rank 16] [job id 207205189.0] [Wed Oct 16 21:12:57 2024] [c6n0220] - Abort(1009925903) (rank 16 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:
 16: PMPI_Win_create(294)................: MPI_Win_create(base=0x7ffce7fce7a0, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc400002a, win=0x7ffce7fce8fc) failed

@ulmononian @RatkoVasic-NOAA we need troubleshooting from lib side.

@aerorahul
Copy link
Contributor

aerorahul commented Oct 17, 2024

@BrianCurtis-NOAA, name change suggestion:

gaea -----> gaea-c5
gaeac6 ---> gaea-c6

@RatkoVasic-NOAA @BrianCurtis-NOAA
Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6.
Having no delimiter would be even better as in gaeac5 and gaeac6 Most
MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous.
Thanks for your consideration.

@RatkoVasic-NOAA
Copy link
Collaborator

@BrianCurtis-NOAA, name change suggestion:

gaea -----> gaea-c5
gaeac6 ---> gaea-c6

@RatkoVasic-NOAA @BrianCurtis-NOAA Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6. Having no delimiter would be even better as in gaeac5 and gaeac6 Most MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous. Thanks for your consideration.

Any combination is OK, as long as they are same length.

@ulmononian
Copy link
Collaborator

@BrianCurtis-NOAA, name change suggestion:

gaea -----> gaea-c5
gaeac6 ---> gaea-c6

@RatkoVasic-NOAA @BrianCurtis-NOAA Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6. Having no delimiter would be even better as in gaeac5 and gaeac6 Most MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous. Thanks for your consideration.

@MichaelLueken just fyi regarding c5/c6 naming conventions. i recall there was a desire to sync the srw ci/cd pipeline w/ certain gaea c5/c6 naming conventions.

@BrianCurtis-NOAA
Copy link
Collaborator Author

@BrianCurtis-NOAA, name change suggestion:

gaea -----> gaea-c5
gaeac6 ---> gaea-c6

@RatkoVasic-NOAA @BrianCurtis-NOAA Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6. Having no delimiter would be even better as in gaeac5 and gaeac6 Most MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous. Thanks for your consideration.

@MichaelLueken just fyi regarding c5/c6 naming conventions. i recall there was a desire to sync the srw ci/cd pipeline w/ certain gaea c5/c6 naming conventions.

I'll be going with gaeac6 and gaeac5, FYI. I'll make those changes at some point tomorrow.

@RatkoVasic-NOAA
Copy link
Collaborator

RatkoVasic-NOAA commented Oct 17, 2024

@BrianCurtis-NOAA @ulmononian @jkbk2004
Since Gaea C5, and Gaea C6 are almost identical, I suggest you expand this PR to include changes to C5 as well.

Changes in rt.sh:
    export LD_PRELOAD=/usr/lib64/libstdc++.so.6
    module load PrgEnv-intel/8.5.0
    module load intel-classic/2023.2.0
    module load cray-mpich/8.1.28
    module load python/3.9.12
Change in ./modulefiles/ufs_gaea.intel.lua:
    stack_intel_ver=os.getenv("stack_intel_ver") or "2023.2.0"
    load(pathJoin("stack-intel", stack_intel_ver))
    stack_cray_mpich_ver=os.getenv("stack_cray_mpich_ver") or "8.1.28"
    load(pathJoin("stack-cray-mpich", stack_cray_mpich_ver))
Change in ./tests/run_test.sh:
-    module load stack-intel/2023.1.0 stack-cray-mpich/8.1.25
+    module load stack-intel/2023.2.0 stack-cray-mpich/8.1.28

Also adding in ./tests/fv3_conf/fv3_slurm.IN_gaea:
export FI_VERBS_PREFER_XRC=0

@ulmononian
Copy link
Collaborator

Continue to see failures with various cases.


atmaero_control_p8_intel failed in run_test

cpld_bmark_p8_intel failed in run_test

cpld_control_ciceC_p8_intel failed in run_test

cpld_control_p8_faster_intel failed in run_test

cpld_control_p8_intel failed in run_test

cpld_control_p8_mixedmode_intel failed in run_test

cpld_control_p8.v2.sfc_intel failed in run_test

cpld_debug_p8_intel failed in run_test

hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel failed in run_test

hafs_regional_storm_following_1nest_atm_ocn_wav_intel failed in run_test

hafs_regional_storm_following_1nest_atm_ocn_wav_mom6_intel failed in run_test

regional_atmaq_debug_intel failed in run_test

About 3 different behaviors and error messages:


- cpld_bmark_p8_intel:

 769: libfabric:2470915:1729115914::cxi:core:cxip_ux_onload_cb():2657<warn> c6n0025: RXC (0x8b2:1) PtlTE 495:[Fatal] LE resources not recovered during flow control. FI_CXI_RX_MATCH_MODE=[hybrid|software] is required

- hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel:

592: PE 592: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.]  OFI retry continuing...

592: 0: slurmstepd: error: *** STEP 207205202.0 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***

slurmstepd: error: *** JOB 207205202 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

192: forrtl: error (78): process killed (SIGTERM)

- regional_atmaq_debug_intel:

srun: error: c6n0014: tasks 0-191: Killed

srun: Terminating StepId=207205194.0

327: forrtl: error (78): process killed (SIGTERM)

327: Image              PC                Routine            Line        Source

327: libpthread-2.31.s  00007F643D216910  Unknown               Unknown  Unknown

327: libc-2.31.so       00007F643A43EB57  __sched_yield         Unknown  Unknown

327: libmpi_intel.so.1  00007F643BECB44F  Unknown               Unknown  Unknown

327: libmpi_intel.so.1  00007F643BF5C4B6  Unknown               Unknown  Unknown

327: libmpi_intel.so.1  00007F643A7DE41D  MPI_Bcast             Unknown  Unknown

- all other failed cases :

 16: MPICH ERROR [Rank 16] [job id 207205189.0] [Wed Oct 16 21:12:57 2024] [c6n0220] - Abort(1009925903) (rank 16 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:

 16: PMPI_Win_create(294)................: MPI_Win_create(base=0x7ffce7fce7a0, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc400002a, win=0x7ffce7fce8fc) failed

@ulmononian @RatkoVasic-NOAA we need troubleshooting from lib side.

please try what @RatkoVasic-NOAA has suggested in your job cards, before fv3.exe is run: export FI_VERBS_PREFER_XRC=0.

this is a known issue inherent to the c5 system. may also try for c6.

@RatkoVasic-NOAA
Copy link
Collaborator

@jkbk2004 @BrianCurtis-NOAA
I just ran one of the tests that was failing on C6 (atmaero_control_p8_intel) and used export FI_VERBS_PREFER_XRC=0 in the job card. It passed on C5 (/gpfs/f5/epic/scratch/Ratko.Vasic/RT_RUNDIRS/Ratko.Vasic/FV3_RT/rt_3061724/atmaero_control_p8_intel/)
Can you try it on C6 as well?
It was up to new system installation, and @ulmononian found fix from admins' notes.

@RatkoVasic-NOAA
Copy link
Collaborator

@BrianCurtis-NOAA @jkbk2004 @ulmononian
All tests passed on Gaea C5:

/gpfs/f5/epic/scratch/Ratko.Vasic/WM-1.6.0/ufs-weather-model/tests
/gpfs/f5/epic/scratch/Ratko.Vasic/RT_RUNDIRS/Ratko.Vasic/FV3_RT/rt_432914
ECFLOW Tasks Remaining: 0/231
rt_utils.sh: ECFLOW tasks completed, cleaning up suite
rt.sh: Generating Regression Testing Log...

Performing Cleanup...
REGRESSION TEST RESULT: SUCCESS
******Regression Testing Script Completed******

If there is need more work on Gaea C6, I can make PR now. There are only 4 files that needed change, provided here.
Did you have time to try same fix for C6?

@BrianCurtis-NOAA
Copy link
Collaborator Author

Let me put all of this together and update this PR.

@RatkoVasic-NOAA RatkoVasic-NOAA marked this pull request as ready for review December 3, 2024 01:26
@RatkoVasic-NOAA
Copy link
Collaborator

Ran full regression tests on Gaea-C6. Logs attached.

@DeniseWorthen
Copy link
Collaborator

@RatkoVasic-NOAA What is the reason behind setting the TPN<192 for many of the cpld tests? Do the jobs otherwise hang, or fail or time-out? What exactly is the failure?

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Dec 3, 2024

@DeniseWorthen - several tests fail with TPN=192, e.g., cpld_debug_p8_intel, cpld_control_p8, but not other tests, such as control_c192_intel . Error messages were not very diagnostic, some weird errors referring to MPI or libraries issues, as shown further below.
Unit tests for MPI and specifically MPI_win were successful and did not indicate any software problems.

Memory issues, on the other hand, are hard to diagnose but very common for memory-demanding tests. TPN=192 on Gaea-c6 corresponds to 2GB/core, which from previous experiences was known to be insufficient for some tests. Setting TPN to lower allowed for more memory per core allocated job, which happened to solve runtime errors.

Example of runtime error in cpld_debug_p8_inte with the TPN =192:

  9: PMPI_Win_create(294)......: MPI_Win_create(base=0x7ffd68b76660, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc4000060, win=0x7ffd68b767bc) failed
  9: MPID_Win_create(89).......:
  9: MPIDIG_mpi_win_create(872):
  9: win_allgather(246)........: OFI mr_enable failed (ofi_win.c:246:win_allgather:Address already in use)
  9:
  9: aborting job:
  9: Fatal error in PMPI_Win_create: Other MPI error, error stack:
  9: PMPI_Win_create(294)......: MPI_Win_create(base=0x7ffd68b76660, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc4000060, win=0x7ffd68b767bc) failed
  9: MPID_Win_create(89).......:
  9: MPIDIG_mpi_win_create(872):
  9: win_allgather(246)........: OFI mr_enable failed (ofi_win.c:246:win_allgather:Address already in use)

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Dec 3, 2024

@natalie-perlin Are memory resources on Hera, where we also run these same tests w/o halving the TPN, so much more?

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Dec 3, 2024

Twice more CPUs on C6 but pretty much same memory configuration as C5. It looks reasonable to increase resources for tests:

The C5 compute nodes consist of [2x] 64 core AMD EPYC Zen 2 CPUs, with two hardware threads per physical core and 256 GiB of physical memory (2 GiB per core).
The C6 compute nodes consist of [2x] 96 core AMD EPYC Zen 4 CPUs, with two hardware threads per physical core and 384 GiB of physical memory (2 GiB per core).

@DeniseWorthen
Copy link
Collaborator

@jkbk2004 That doesn't address the question of why such modification is not required on Hera. You're simply propagating changes that you already put in for C5 to C6.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Dec 3, 2024

Why should we compare against hera? They are different machines. https://docs.rdhpcs.noaa.gov/systems/hera_user_guide.html#system-overview

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Dec 3, 2024

@jkbk2004 Because, in Natalies words, the "memory-demanding tests" are identical on both machines. The model is configured the same (ie, domain, WGC etc). So why are we not facing memory issues on Hera but they're so severe on C6 that we can only use 1/2 the available nodes?

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Dec 3, 2024

Hera has 40 cores/node and 96 GB, so the memory allocation is ~ 2.4GB/core.
Gaea c6 has 2GB/core (TPN=192, 384GB/node). Memory requirements for the job could sometimes be set as SBATCH directive in the launch script ("job card"), but Gaea-c6 slurm is not configured to allow it.

Possible solution is to find TPN< 192 that works for all the tests.

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Dec 3, 2024

@natalie-perlin Thanks for addressing the issue I'm raising. So, per node, Gaea has ~80% the memory of Hera, is that right? But we're reducing the TPN by 50% (in most cases). Why, if the issue is memory use, aren't we using more like 154 TPN on C6 (80% of 192)? Do those jobs fail?

@junwang-noaa
Copy link
Collaborator

To use comparable memory, maybe we can use 144 TPN on C6? That gives us 2.6GB/core.

@natalie-perlin
Copy link
Collaborator

@DeniseWorthen @junwang-noaa -
Yes, this layout could definitely be tested.

@RatkoVasic-NOAA
Copy link
Collaborator

@DeniseWorthen @junwang-noaa @natalie-perlin @jkbk2004 sorry for late reply.
I already tested all combinations for several tests and chosen the maximum number of TPN/minimum number of nodes. Default is set to 192 TPN, and several tests (exactly like Derecho) needed less tasks per node and they are set in tests/tests/test-name. It depended of total number of tasks needed for each test (i.e. 288, 384 and 864). In those cases we have TPN=144, TPN=128 and TPN=96.

@RatkoVasic-NOAA
Copy link
Collaborator

@RatkoVasic-NOAA What is the reason behind setting the TPN<192 for many of the cpld tests? Do the jobs otherwise hang, or fail or time-out? What exactly is the failure?

@DeniseWorthen Yes, as @natalie-perlin said some tests needed more memory per node. Some tests using 384 tasks were failing with use of 2 nodes (TPN=192) with message that Natalie shared. When used 3 nodes (TPN=128) it was failing in writing gocart.inst_aod.20210322_1200z.nc4 file (hanging). Only 4 nodes helped run that test (TPN=96) without problem.

@DeniseWorthen
Copy link
Collaborator

@RatkoVasic-NOAA Thanks for that additional information. The fact that the issue w/ TPN=128 is failing in writing the nc4 file from gocart should be documented I think. This puzzles me. What is going on w/ the file writing....is it really a memory issue or something else?

@RatkoVasic-NOAA
Copy link
Collaborator

@RatkoVasic-NOAA Thanks for that additional information. The fact that the issue w/ TPN=128 is failing in writing the nc4 file from gocart should be documented I think. This puzzles me. What is going on w/ the file writing....is it really a memory issue or something else?

It was hard to know when there's no log info (hanging job). Last log line was something like "writing gocart.*.nc" and... nothing. The only thing I know is by increasing number of nodes problem disappeared. I want to point out that almost identical is already existing for Derecho, so whatever was happening there might be happening on gaea-c6.

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Dec 3, 2024

@RatkoVasic-NOAA I believe the issue might be w/ ESMF-managed threading. I ran 3 test cases on C6 w/ this branch (cpld control, the c192 and the bmark). For the cpld_control, I switched off ESMF mananged threading, set tpn to 192 and used srun -n 200. The job ran fine.

/gpfs/f6/drsa-hurr1/proj-shared/Denise.Worthen/RT_RUNDIRS/Denise.Worthen/FV3_RT/rt_1011815/test2

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Dec 3, 2024

@RatkoVasic-NOAA I believe the issue might be w/ ESMF-managed threading. I ran 3 test cases on C6 w/ this branch (cpld control, the c192 and the bmark). For the cpld_control, I switched off ESMF mananged threading, set tpn to 192 and used srun -n 200. The job ran fine.

/gpfs/f6/drsa-hurr1/proj-shared/Denise.Worthen/RT_RUNDIRS/Denise.Worthen/FV3_RT/rt_1011815/test2

@DeniseWorthen Permission issue on /gpfs/f6/drsa-hurr1/proj-shared, /gpfs/f6/drsa-hurr1/world-shared is a good place to share.

@DeniseWorthen
Copy link
Collaborator

I copied it here /gpfs/f6/drsa-hurr1/world-shared/scrub/Denise.Worthen/test2. But there were only three changes to job_card and ufs_configure.

Set globalResourceControl: false
set --ntasks-per-node=192
use srun --label -n 200 ./fv3.exe

@RatkoVasic-NOAA
Copy link
Collaborator

I copied it here /gpfs/f6/drsa-hurr1/world-shared/scrub/Denise.Worthen/test2. But there were only three changes to job_card and ufs_configure.

Set globalResourceControl: false set --ntasks-per-node=192 use srun --label -n 200 ./fv3.exe

@DeniseWorthen what is your suggestion to change in this PR? Where do you set globalResourceControl? Is it going to affect other runs? ...

@DeniseWorthen
Copy link
Collaborator

@RatkoVasic-NOAA We were just talking about this off line. I think the point I was trying to make is that calling a memory issue is distracting from the real issue, which is (as we've seen on other platforms), more than likely the current implementation of ESMF Managed threading.

The globalResourceControl is in the ufs-configure; true means to utilize ESMF managed threading. I know ESMF is hoping to have a fix for this soon, but for now, I suspect we will continue to use it in the RTs because switching away and then back would be disruptive. But I think it is important to understand why we're required to run w/ this kind of under-resourcing on C6 (and probably derecho).

@RatkoVasic-NOAA
Copy link
Collaborator

@RatkoVasic-NOAA We were just talking about this off line. I think the point I was trying to make is that calling a memory issue is distracting from the real issue, which is (as we've seen on other platforms), more than likely the current implementation of ESMF Managed threading.

The globalResourceControl is in the ufs-configure; true means to utilize ESMF managed threading. I know ESMF is hoping to have a fix for this soon, but for now, I suspect we will continue to use it in the RTs because switching away and then back would be disruptive. But I think it is important to understand why we're required to run w/ this kind of under-resourcing on C6 (and probably derecho).

@DeniseWorthen Thanks for the explanation!

@ulmononian
Copy link
Collaborator

@jkbk2004 @RatkoVasic-NOAA @DeniseWorthen sounds like moving forward with the current c6 resource configuration @RatkoVasic-NOAA & @natalie-perlin have demonstrated as functional should work, as turning esmf threading on/off in the $UFS_CONFIGURE file would propagate inconsistencies if switched on/off before ESMF issues a permanent fix (as pointed out by @DeniseWorthen) and could impact other tests using that $UFS_CONFIGURE.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Dec 3, 2024

@RatkoVasic-NOAA We were just talking about this off line. I think the point I was trying to make is that calling a memory issue is distracting from the real issue, which is (as we've seen on other platforms), more than likely the current implementation of ESMF Managed threading.

The globalResourceControl is in the ufs-configure; true means to utilize ESMF managed threading. I know ESMF is hoping to have a fix for this soon, but for now, I suspect we will continue to use it in the RTs because switching away and then back would be disruptive. But I think it is important to understand why we're required to run w/ this kind of under-resourcing on C6 (and probably derecho).

@uturuncoglu FYI: a possible item regarding ESMF-managed threading

aerorahul pushed a commit to NOAA-EMC/gfs-utils that referenced this pull request Dec 4, 2024
After the recent Gaea-C5 OS upgrade, gfs_utils fails to build.
This corrects Gaea-C5 build, adds Gaea-C6 build capability (following
ufs-wx-model
[2448](ufs-community/ufs-weather-model#2448)),
and adds containerized build capability.

Refs NOAA-EMC/global-workflow
[3011](NOAA-EMC/global-workflow#3011)
Refs NOAA-EMC/global-workflow
[3025](NOAA-EMC/global-workflow#3025)
Resolve #86 
---------

Co-authored-by: Mark A Potts <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable ufs-weather-model on Gaea-C6