Gaea C6 support for UFSWM #2448

BrianCurtis-NOAA · 2024-10-02T17:48:48Z

Commit Queue Requirements:

Fill out all sections of this template.
All sub component pull requests have been reviewed by their code managers.
Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
Commit 'test_changes.list' from previous step

Description:

This PR will bring in all changes necessary to provide Gaea C6 support for UFSWM

Commit Message:

* UFSWM - Gaea C6 Support

Priority:

Normal

Git Tracking

UFSWM:

Closes Enable ufs-weather-model on Gaea-C6 #2407
None

Sub component Pull Requests:

None

UFSWM Blocking Dependencies:

Blocked by #
None

Changes

Regression Test Changes (Please commit test_changes.list):

No Baseline Changes. (just adds logs for Gaea C6)

Input data Changes:

None.

Library Changes/Upgrades:

No Updates

Testing Log:

BrianCurtis-NOAA · 2024-10-02T17:53:59Z

cpld_control_p8 intel fails for timing out, so there's work to tweak the configs to better match the C6 hardware.

I think there's still lots of other items to check here, this is just a placeholder for now. Please feel free to send PR's to my fork/branch to add/adjust/fix any issues etc...

…into gaeac6

BrianCurtis-NOAA · 2024-10-02T17:55:49Z

Also, once things start falling into place, we'll need to make sure intelllvm support is available for c6.

RatkoVasic-NOAA · 2024-10-04T00:50:55Z

@BrianCurtis-NOAA, name change suggestion:

gaea -----> gaea-c5
gaeac6 ---> gaea-c6

sanAkel · 2024-10-04T01:01:09Z

@BrianCurtis-NOAA Shall I re-try building with these modulefiles/ufs_gaeac6.intel.lua in this PR?

DusanJovic-NOAA · 2024-10-04T13:43:02Z

tests/compile.sh

@@ -95,7 +98,7 @@ export SUITES
 set -ex

 # Valid applications
-if [[ ${MACHINE_ID} != gaea ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
+if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm


Why do we even need this logic here, adding or not adding -DMOM6SOLO=ON? As far as I know, we do not regression test MOM6SOLO. Can we remove this block of code entirely from this script?

Good question. It was added there for a reason, and I don't recall if we ever RT'd MOM6SOLO. @junwang-noaa do you recall what this block of code was used for?

If I remember correctly, this is to support standalone MOM testing. @jiandewang Do you know why MOM6 SOLO does not work on gaea?

Hi! I'm new to the UFS, but AFAIK, nobody seems to use -DMOM6SOLO=ON, though I would differ it to @junwang-noaa.

@junwang-noaa My understanding from @jiandewang is that he (and others) are no longer routinely testing MOM solo config; I have always built using instructions at MOM6-examples

If I remember correctly, this is to support standalone MOM testing. @jiandewang Do you know why MOM6 SOLO does not work on gaea?

It was added here many years ago and we never tried this SOLO on any platform. My understanding is with nuopc_cap it has to be coupled with something.

@junwang-noaa My understanding from @jiandewang is that he (and others) are no longer routinely testing MOM solo config; I have always built using instructions at MOM6-examples

yes we use MOM-example to do standalone test when it's needed to do some debug work (to help GFDL to narrow down issue when their big PR is not working as expected in UWM).

So you do not use tests/compile.sh to build standalone test, is that correct?

So you do not use tests/compile.sh to build standalone test, is that correct?

correct

Then we should remove it from compile.sh

BrianCurtis-NOAA · 2024-10-04T14:12:49Z

cpld_control_p8 fails with:

  5: MPICH ERROR [Rank 5] [job id 207188364.0] [Fri Oct  4 13:33:08 2024] [c6n0210] - Abort(941244175) (rank 5 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:
  5: PMPI_Win_create(294)......: MPI_Win_create(base=0x7ffe81f20fe0, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc4000060, win=0x7ffe81f2113c) failed
  5: MPID_Win_create(89).......:
  5: MPIDIG_mpi_win_create(872):
  5: win_allgather(246)........: OFI mr_enable failed (ofi_win.c:246:win_allgather:Address already in use)

and control_p8 runs to completion:

0: * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . 
  0: *****************RESOURCE STATISTICS*******************************
  0: The total amount of wall time                        = 853.216145
  0: The total amount of time in user mode                = 216.242551
  0: The total amount of time in sys mode                 = 410.041583
  0: The maximum resident set size (KB)                   = 1720560
  0: Number of page faults without I/O activity           = 131391
  0: Number of page faults with I/O activity              = 173
  0: Number of times filesystem performed INPUT           = 1024
  0: Number of times filesystem performed OUTPUT          = 0
  0: Number of Voluntary Context Switches                 = 16903
  0: Number of InVoluntary Context Switches               = 9006
  0: *****************END OF RESOURCE STATISTICS*************************

BrianCurtis-NOAA · 2024-10-04T14:51:32Z

@DusanJovic-NOAA this look ok?:

diff --git a/tests/compile.sh b/tests/compile.sh
index 2c3c7796..26e3a788 100755
--- a/tests/compile.sh
+++ b/tests/compile.sh
@@ -97,17 +97,6 @@ SUITES=$(grep -Po "\-DCCPP_SUITES=\K[^ ]*" <<< "${MAKE_OPT}")
 export SUITES
 set -ex
 
-# Valid applications
-if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
-  if [[ "${MAKE_OPT}" == *"-DAPP=S2S"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-
-  if [[ "${MAKE_OPT}" == *"-DAPP=NG-GODAS"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-fi
-
 CMAKE_FLAGS=$(set -e; trim "${CMAKE_FLAGS}")
 echo "CMAKE_FLAGS = ${CMAKE_FLAGS}"

DusanJovic-NOAA · 2024-10-04T15:03:16Z

@DusanJovic-NOAA this look ok?:

diff --git a/tests/compile.sh b/tests/compile.sh
index 2c3c7796..26e3a788 100755
--- a/tests/compile.sh
+++ b/tests/compile.sh
@@ -97,17 +97,6 @@ SUITES=$(grep -Po "\-DCCPP_SUITES=\K[^ ]*" <<< "${MAKE_OPT}")
 export SUITES
 set -ex
 
-# Valid applications
-if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
-  if [[ "${MAKE_OPT}" == *"-DAPP=S2S"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-
-  if [[ "${MAKE_OPT}" == *"-DAPP=NG-GODAS"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-fi
-
 CMAKE_FLAGS=$(set -e; trim "${CMAKE_FLAGS}")
 echo "CMAKE_FLAGS = ${CMAKE_FLAGS}"

Yes.

ulmononian · 2024-10-16T17:21:29Z

@BrianCurtis-NOAA @jkbk2004 @FernandoAndrade-NOAA i believe EPIC now has full access to the bil-fire8 project (disk space and compute resources). i was able to run a control_c48 test using this allocation in /gpfs/f6/bil-fire8/scratch/role.epic/ufs-wm_2448 with run_dir at /gpfs/f6/bil-fire8/scratch/role.epic/RT_RUNDIRS/role.epic/FV3_RT/rt_1552059, but i had to create new baselines since they are not yet staged on c6. seems like rocoto should be installed on c6 as well (@natalie-perlin).

jkbk2004 · 2024-10-16T19:59:59Z

@BrianCurtis-NOAA can you sync up branch? I think I am able to create baseline on c6: /gpfs/f6/bil-fire8/world-shared/role.epic/UFS-WM_RT/NEMSfv3gfs.

jkbk2004 · 2024-10-17T13:00:25Z

Continue to see failures with various cases.

atmaero_control_p8_intel failed in run_test
cpld_bmark_p8_intel failed in run_test
cpld_control_ciceC_p8_intel failed in run_test
cpld_control_p8_faster_intel failed in run_test
cpld_control_p8_intel failed in run_test
cpld_control_p8_mixedmode_intel failed in run_test
cpld_control_p8.v2.sfc_intel failed in run_test
cpld_debug_p8_intel failed in run_test
hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel failed in run_test
hafs_regional_storm_following_1nest_atm_ocn_wav_intel failed in run_test
hafs_regional_storm_following_1nest_atm_ocn_wav_mom6_intel failed in run_test
regional_atmaq_debug_intel failed in run_test

About 3 different behaviors and error messages:

- cpld_bmark_p8_intel:
 769: libfabric:2470915:1729115914::cxi:core:cxip_ux_onload_cb():2657<warn> c6n0025: RXC (0x8b2:1) PtlTE 495:[Fatal] LE resources not recovered during flow control. FI_CXI_RX_MATCH_MODE=[hybrid|software] is required
- hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel:
592: PE 592: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.]  OFI retry continuing...
592: 0: slurmstepd: error: *** STEP 207205202.0 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 207205202 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
192: forrtl: error (78): process killed (SIGTERM)
- regional_atmaq_debug_intel:
srun: error: c6n0014: tasks 0-191: Killed
srun: Terminating StepId=207205194.0
327: forrtl: error (78): process killed (SIGTERM)
327: Image              PC                Routine            Line        Source
327: libpthread-2.31.s  00007F643D216910  Unknown               Unknown  Unknown
327: libc-2.31.so       00007F643A43EB57  __sched_yield         Unknown  Unknown
327: libmpi_intel.so.1  00007F643BECB44F  Unknown               Unknown  Unknown
327: libmpi_intel.so.1  00007F643BF5C4B6  Unknown               Unknown  Unknown
327: libmpi_intel.so.1  00007F643A7DE41D  MPI_Bcast             Unknown  Unknown
- all other failed cases :
 16: MPICH ERROR [Rank 16] [job id 207205189.0] [Wed Oct 16 21:12:57 2024] [c6n0220] - Abort(1009925903) (rank 16 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:
 16: PMPI_Win_create(294)................: MPI_Win_create(base=0x7ffce7fce7a0, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc400002a, win=0x7ffce7fce8fc) failed

@ulmononian @RatkoVasic-NOAA we need troubleshooting from lib side.

aerorahul · 2024-10-17T14:10:36Z

@BrianCurtis-NOAA, name change suggestion:
gaea -----> gaea-c5
gaeac6 ---> gaea-c6

@RatkoVasic-NOAA @BrianCurtis-NOAA
Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6.
Having no delimiter would be even better as in gaeac5 and gaeac6 Most
MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous.
Thanks for your consideration.

RatkoVasic-NOAA · 2024-10-17T14:55:10Z

@BrianCurtis-NOAA, name change suggestion:
gaea -----> gaea-c5
gaeac6 ---> gaea-c6
@RatkoVasic-NOAA @BrianCurtis-NOAA Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6. Having no delimiter would be even better as in gaeac5 and gaeac6 Most MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous. Thanks for your consideration.

Any combination is OK, as long as they are same length.

ulmononian · 2024-10-17T16:26:28Z

@BrianCurtis-NOAA, name change suggestion:
gaea -----> gaea-c5
gaeac6 ---> gaea-c6
@RatkoVasic-NOAA @BrianCurtis-NOAA Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6. Having no delimiter would be even better as in gaeac5 and gaeac6 Most MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous. Thanks for your consideration.

@MichaelLueken just fyi regarding c5/c6 naming conventions. i recall there was a desire to sync the srw ci/cd pipeline w/ certain gaea c5/c6 naming conventions.

BrianCurtis-NOAA · 2024-10-17T18:00:26Z

@BrianCurtis-NOAA, name change suggestion:
gaea -----> gaea-c5
gaeac6 ---> gaea-c6
@RatkoVasic-NOAA @BrianCurtis-NOAA Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6. Having no delimiter would be even better as in gaeac5 and gaeac6 Most MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous. Thanks for your consideration.
@MichaelLueken just fyi regarding c5/c6 naming conventions. i recall there was a desire to sync the srw ci/cd pipeline w/ certain gaea c5/c6 naming conventions.

I'll be going with gaeac6 and gaeac5, FYI. I'll make those changes at some point tomorrow.

RatkoVasic-NOAA · 2024-10-17T18:29:48Z

@BrianCurtis-NOAA @ulmononian @jkbk2004
Since Gaea C5, and Gaea C6 are almost identical, I suggest you expand this PR to include changes to C5 as well.

Changes in rt.sh:
    export LD_PRELOAD=/usr/lib64/libstdc++.so.6
    module load PrgEnv-intel/8.5.0
    module load intel-classic/2023.2.0
    module load cray-mpich/8.1.28
    module load python/3.9.12
Change in ./modulefiles/ufs_gaea.intel.lua:
    stack_intel_ver=os.getenv("stack_intel_ver") or "2023.2.0"
    load(pathJoin("stack-intel", stack_intel_ver))
    stack_cray_mpich_ver=os.getenv("stack_cray_mpich_ver") or "8.1.28"
    load(pathJoin("stack-cray-mpich", stack_cray_mpich_ver))
Change in ./tests/run_test.sh:
-    module load stack-intel/2023.1.0 stack-cray-mpich/8.1.25
+    module load stack-intel/2023.2.0 stack-cray-mpich/8.1.28

Also adding in ./tests/fv3_conf/fv3_slurm.IN_gaea:
export FI_VERBS_PREFER_XRC=0

ulmononian · 2024-10-17T18:49:08Z

Continue to see failures with various cases.


atmaero_control_p8_intel failed in run_test

cpld_bmark_p8_intel failed in run_test

cpld_control_ciceC_p8_intel failed in run_test

cpld_control_p8_faster_intel failed in run_test

cpld_control_p8_intel failed in run_test

cpld_control_p8_mixedmode_intel failed in run_test

cpld_control_p8.v2.sfc_intel failed in run_test

cpld_debug_p8_intel failed in run_test

hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel failed in run_test

hafs_regional_storm_following_1nest_atm_ocn_wav_intel failed in run_test

hafs_regional_storm_following_1nest_atm_ocn_wav_mom6_intel failed in run_test

regional_atmaq_debug_intel failed in run_test

About 3 different behaviors and error messages:


- cpld_bmark_p8_intel:

 769: libfabric:2470915:1729115914::cxi:core:cxip_ux_onload_cb():2657<warn> c6n0025: RXC (0x8b2:1) PtlTE 495:[Fatal] LE resources not recovered during flow control. FI_CXI_RX_MATCH_MODE=[hybrid|software] is required

- hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel:

592: PE 592: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.]  OFI retry continuing...

592: 0: slurmstepd: error: *** STEP 207205202.0 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***

slurmstepd: error: *** JOB 207205202 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

192: forrtl: error (78): process killed (SIGTERM)

- regional_atmaq_debug_intel:

srun: error: c6n0014: tasks 0-191: Killed

srun: Terminating StepId=207205194.0

327: forrtl: error (78): process killed (SIGTERM)

327: Image              PC                Routine            Line        Source

327: libpthread-2.31.s  00007F643D216910  Unknown               Unknown  Unknown

327: libc-2.31.so       00007F643A43EB57  __sched_yield         Unknown  Unknown

327: libmpi_intel.so.1  00007F643BECB44F  Unknown               Unknown  Unknown

327: libmpi_intel.so.1  00007F643BF5C4B6  Unknown               Unknown  Unknown

327: libmpi_intel.so.1  00007F643A7DE41D  MPI_Bcast             Unknown  Unknown

- all other failed cases :

 16: MPICH ERROR [Rank 16] [job id 207205189.0] [Wed Oct 16 21:12:57 2024] [c6n0220] - Abort(1009925903) (rank 16 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:

 16: PMPI_Win_create(294)................: MPI_Win_create(base=0x7ffce7fce7a0, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc400002a, win=0x7ffce7fce8fc) failed

@ulmononian @RatkoVasic-NOAA we need troubleshooting from lib side.

please try what @RatkoVasic-NOAA has suggested in your job cards, before fv3.exe is run: export FI_VERBS_PREFER_XRC=0.

this is a known issue inherent to the c5 system. may also try for c6.

RatkoVasic-NOAA · 2024-10-17T22:09:19Z

@jkbk2004 @BrianCurtis-NOAA
I just ran one of the tests that was failing on C6 (atmaero_control_p8_intel) and used export FI_VERBS_PREFER_XRC=0 in the job card. It passed on C5 (/gpfs/f5/epic/scratch/Ratko.Vasic/RT_RUNDIRS/Ratko.Vasic/FV3_RT/rt_3061724/atmaero_control_p8_intel/)
Can you try it on C6 as well?
It was up to new system installation, and @ulmononian found fix from admins' notes.

RatkoVasic-NOAA · 2024-10-18T03:34:15Z

@BrianCurtis-NOAA @jkbk2004 @ulmononian
All tests passed on Gaea C5:

/gpfs/f5/epic/scratch/Ratko.Vasic/WM-1.6.0/ufs-weather-model/tests
/gpfs/f5/epic/scratch/Ratko.Vasic/RT_RUNDIRS/Ratko.Vasic/FV3_RT/rt_432914

ECFLOW Tasks Remaining: 0/231
rt_utils.sh: ECFLOW tasks completed, cleaning up suite
rt.sh: Generating Regression Testing Log...

Performing Cleanup...
REGRESSION TEST RESULT: SUCCESS
******Regression Testing Script Completed******

If there is need more work on Gaea C6, I can make PR now. There are only 4 files that needed change, provided here.
Did you have time to try same fix for C6?

BrianCurtis-NOAA · 2024-10-18T12:07:44Z

Let me put all of this together and update this PR.

RatkoVasic-NOAA · 2024-12-03T01:28:59Z

Ran full regression tests on Gaea-C6. Logs attached.

DeniseWorthen · 2024-12-03T13:34:52Z

@RatkoVasic-NOAA What is the reason behind setting the TPN<192 for many of the cpld tests? Do the jobs otherwise hang, or fail or time-out? What exactly is the failure?

natalie-perlin · 2024-12-03T13:53:18Z

@DeniseWorthen - several tests fail with TPN=192, e.g., cpld_debug_p8_intel, cpld_control_p8, but not other tests, such as control_c192_intel . Error messages were not very diagnostic, some weird errors referring to MPI or libraries issues, as shown further below.
Unit tests for MPI and specifically MPI_win were successful and did not indicate any software problems.

Memory issues, on the other hand, are hard to diagnose but very common for memory-demanding tests. TPN=192 on Gaea-c6 corresponds to 2GB/core, which from previous experiences was known to be insufficient for some tests. Setting TPN to lower allowed for more memory per core allocated job, which happened to solve runtime errors.

Example of runtime error in cpld_debug_p8_inte with the TPN =192:

  9: PMPI_Win_create(294)......: MPI_Win_create(base=0x7ffd68b76660, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc4000060, win=0x7ffd68b767bc) failed
  9: MPID_Win_create(89).......:
  9: MPIDIG_mpi_win_create(872):
  9: win_allgather(246)........: OFI mr_enable failed (ofi_win.c:246:win_allgather:Address already in use)
  9:
  9: aborting job:
  9: Fatal error in PMPI_Win_create: Other MPI error, error stack:
  9: PMPI_Win_create(294)......: MPI_Win_create(base=0x7ffd68b76660, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc4000060, win=0x7ffd68b767bc) failed
  9: MPID_Win_create(89).......:
  9: MPIDIG_mpi_win_create(872):
  9: win_allgather(246)........: OFI mr_enable failed (ofi_win.c:246:win_allgather:Address already in use)

DeniseWorthen · 2024-12-03T13:58:11Z

@natalie-perlin Are memory resources on Hera, where we also run these same tests w/o halving the TPN, so much more?

jkbk2004 · 2024-12-03T14:12:54Z

Twice more CPUs on C6 but pretty much same memory configuration as C5. It looks reasonable to increase resources for tests:

The C5 compute nodes consist of [2x] 64 core AMD EPYC Zen 2 CPUs, with two hardware threads per physical core and 256 GiB of physical memory (2 GiB per core).
The C6 compute nodes consist of [2x] 96 core AMD EPYC Zen 4 CPUs, with two hardware threads per physical core and 384 GiB of physical memory (2 GiB per core).

DeniseWorthen · 2024-12-03T14:16:34Z

@jkbk2004 That doesn't address the question of why such modification is not required on Hera. You're simply propagating changes that you already put in for C5 to C6.

jkbk2004 · 2024-12-03T14:19:26Z

Why should we compare against hera? They are different machines. https://docs.rdhpcs.noaa.gov/systems/hera_user_guide.html#system-overview

DeniseWorthen · 2024-12-03T14:24:00Z

@jkbk2004 Because, in Natalies words, the "memory-demanding tests" are identical on both machines. The model is configured the same (ie, domain, WGC etc). So why are we not facing memory issues on Hera but they're so severe on C6 that we can only use 1/2 the available nodes?

natalie-perlin · 2024-12-03T14:35:47Z

Hera has 40 cores/node and 96 GB, so the memory allocation is ~ 2.4GB/core.
Gaea c6 has 2GB/core (TPN=192, 384GB/node). Memory requirements for the job could sometimes be set as SBATCH directive in the launch script ("job card"), but Gaea-c6 slurm is not configured to allow it.

Possible solution is to find TPN< 192 that works for all the tests.

DeniseWorthen · 2024-12-03T14:56:53Z

@natalie-perlin Thanks for addressing the issue I'm raising. So, per node, Gaea has ~80% the memory of Hera, is that right? But we're reducing the TPN by 50% (in most cases). Why, if the issue is memory use, aren't we using more like 154 TPN on C6 (80% of 192)? Do those jobs fail?

junwang-noaa · 2024-12-03T15:19:42Z

To use comparable memory, maybe we can use 144 TPN on C6? That gives us 2.6GB/core.

natalie-perlin · 2024-12-03T15:59:24Z

@DeniseWorthen @junwang-noaa -
Yes, this layout could definitely be tested.

RatkoVasic-NOAA · 2024-12-03T16:00:18Z

@DeniseWorthen @junwang-noaa @natalie-perlin @jkbk2004 sorry for late reply.
I already tested all combinations for several tests and chosen the maximum number of TPN/minimum number of nodes. Default is set to 192 TPN, and several tests (exactly like Derecho) needed less tasks per node and they are set in tests/tests/test-name. It depended of total number of tasks needed for each test (i.e. 288, 384 and 864). In those cases we have TPN=144, TPN=128 and TPN=96.

RatkoVasic-NOAA · 2024-12-03T16:09:59Z

@RatkoVasic-NOAA What is the reason behind setting the TPN<192 for many of the cpld tests? Do the jobs otherwise hang, or fail or time-out? What exactly is the failure?

@DeniseWorthen Yes, as @natalie-perlin said some tests needed more memory per node. Some tests using 384 tasks were failing with use of 2 nodes (TPN=192) with message that Natalie shared. When used 3 nodes (TPN=128) it was failing in writing gocart.inst_aod.20210322_1200z.nc4 file (hanging). Only 4 nodes helped run that test (TPN=96) without problem.

DeniseWorthen · 2024-12-03T16:44:32Z

@RatkoVasic-NOAA Thanks for that additional information. The fact that the issue w/ TPN=128 is failing in writing the nc4 file from gocart should be documented I think. This puzzles me. What is going on w/ the file writing....is it really a memory issue or something else?

RatkoVasic-NOAA · 2024-12-03T16:49:45Z

@RatkoVasic-NOAA Thanks for that additional information. The fact that the issue w/ TPN=128 is failing in writing the nc4 file from gocart should be documented I think. This puzzles me. What is going on w/ the file writing....is it really a memory issue or something else?

It was hard to know when there's no log info (hanging job). Last log line was something like "writing gocart.*.nc" and... nothing. The only thing I know is by increasing number of nodes problem disappeared. I want to point out that almost identical is already existing for Derecho, so whatever was happening there might be happening on gaea-c6.

DeniseWorthen · 2024-12-03T19:56:18Z

@RatkoVasic-NOAA I believe the issue might be w/ ESMF-managed threading. I ran 3 test cases on C6 w/ this branch (cpld control, the c192 and the bmark). For the cpld_control, I switched off ESMF mananged threading, set tpn to 192 and used srun -n 200. The job ran fine.

/gpfs/f6/drsa-hurr1/proj-shared/Denise.Worthen/RT_RUNDIRS/Denise.Worthen/FV3_RT/rt_1011815/test2

jkbk2004 · 2024-12-03T20:12:28Z

@RatkoVasic-NOAA I believe the issue might be w/ ESMF-managed threading. I ran 3 test cases on C6 w/ this branch (cpld control, the c192 and the bmark). For the cpld_control, I switched off ESMF mananged threading, set tpn to 192 and used srun -n 200. The job ran fine.
/gpfs/f6/drsa-hurr1/proj-shared/Denise.Worthen/RT_RUNDIRS/Denise.Worthen/FV3_RT/rt_1011815/test2

@DeniseWorthen Permission issue on /gpfs/f6/drsa-hurr1/proj-shared, /gpfs/f6/drsa-hurr1/world-shared is a good place to share.

DeniseWorthen · 2024-12-03T20:19:06Z

I copied it here /gpfs/f6/drsa-hurr1/world-shared/scrub/Denise.Worthen/test2. But there were only three changes to job_card and ufs_configure.

Set globalResourceControl: false
set --ntasks-per-node=192
use srun --label -n 200 ./fv3.exe

RatkoVasic-NOAA · 2024-12-03T20:55:13Z

I copied it here /gpfs/f6/drsa-hurr1/world-shared/scrub/Denise.Worthen/test2. But there were only three changes to job_card and ufs_configure.

Set globalResourceControl: false set --ntasks-per-node=192 use srun --label -n 200 ./fv3.exe

@DeniseWorthen what is your suggestion to change in this PR? Where do you set globalResourceControl? Is it going to affect other runs? ...

DeniseWorthen · 2024-12-03T21:08:43Z

@RatkoVasic-NOAA We were just talking about this off line. I think the point I was trying to make is that calling a memory issue is distracting from the real issue, which is (as we've seen on other platforms), more than likely the current implementation of ESMF Managed threading.

The globalResourceControl is in the ufs-configure; true means to utilize ESMF managed threading. I know ESMF is hoping to have a fix for this soon, but for now, I suspect we will continue to use it in the RTs because switching away and then back would be disruptive. But I think it is important to understand why we're required to run w/ this kind of under-resourcing on C6 (and probably derecho).

RatkoVasic-NOAA · 2024-12-03T21:11:35Z

@RatkoVasic-NOAA We were just talking about this off line. I think the point I was trying to make is that calling a memory issue is distracting from the real issue, which is (as we've seen on other platforms), more than likely the current implementation of ESMF Managed threading.

The globalResourceControl is in the ufs-configure; true means to utilize ESMF managed threading. I know ESMF is hoping to have a fix for this soon, but for now, I suspect we will continue to use it in the RTs because switching away and then back would be disruptive. But I think it is important to understand why we're required to run w/ this kind of under-resourcing on C6 (and probably derecho).

@DeniseWorthen Thanks for the explanation!

ulmononian · 2024-12-03T21:37:45Z

@jkbk2004 @RatkoVasic-NOAA @DeniseWorthen sounds like moving forward with the current c6 resource configuration @RatkoVasic-NOAA & @natalie-perlin have demonstrated as functional should work, as turning esmf threading on/off in the $UFS_CONFIGURE file would propagate inconsistencies if switched on/off before ESMF issues a permanent fix (as pointed out by @DeniseWorthen) and could impact other tests using that $UFS_CONFIGURE.

jkbk2004 · 2024-12-03T21:43:11Z

@RatkoVasic-NOAA We were just talking about this off line. I think the point I was trying to make is that calling a memory issue is distracting from the real issue, which is (as we've seen on other platforms), more than likely the current implementation of ESMF Managed threading.

The globalResourceControl is in the ufs-configure; true means to utilize ESMF managed threading. I know ESMF is hoping to have a fix for this soon, but for now, I suspect we will continue to use it in the RTs because switching away and then back would be disruptive. But I think it is important to understand why we're required to run w/ this kind of under-resourcing on C6 (and probably derecho).

@uturuncoglu FYI: a possible item regarding ESMF-managed threading

After the recent Gaea-C5 OS upgrade, gfs_utils fails to build. This corrects Gaea-C5 build, adds Gaea-C6 build capability (following ufs-wx-model [2448](ufs-community/ufs-weather-model#2448)), and adds containerized build capability. Refs NOAA-EMC/global-workflow [3011](NOAA-EMC/global-workflow#3011) Refs NOAA-EMC/global-workflow [3025](NOAA-EMC/global-workflow#3025) Resolve #86 --------- Co-authored-by: Mark A Potts <[email protected]>

initial testing to get UFSWM working on Gaea C6

b968b96

Merge branch 'develop' of github.com:ufs-community/ufs-weather-model …

efe342e

…into gaeac6

jkbk2004 mentioned this pull request Oct 4, 2024

Enable ufs-weather-model on Gaea-C6 #2407

Open

BrianCurtis-NOAA added 4 commits October 4, 2024 08:53

gaea->gaea-c5 and gaeac6->gaea-c6

7476837

Fixed linter issue

742a7c2

Update to 192 cores on Gaea-c6

5bee5b2

Update tests to gaea-c5 and added gaea-c6 where necessary

63a56ac

DusanJovic-NOAA reviewed Oct 4, 2024

View reviewed changes

Remove MOM6SOLO from compile.sh

bb83396

Merge branch 'develop' into gaeac6

532f418

RatkoVasic-NOAA mentioned this pull request Oct 18, 2024

Gaea C5 lib issue #2472

Closed

RatkoVasic-NOAA added 2 commits December 2, 2024 18:21

Ajdust some TPNs in coupled runs.

d31b872

Add log file RegressionTests_gaeac6.log (and test_changes.list).

5edbfd1

RatkoVasic-NOAA marked this pull request as ready for review December 3, 2024 01:26

Rename gaea to gaeac5 in rt.conf

ff58542

RatkoVasic-NOAA mentioned this pull request Dec 4, 2024

[develop] Port SRW to Gaea C6 ufs-community/ufs-srweather-app#1163

Draft

39 tasks

Gaea C6 support for UFSWM #2448

Are you sure you want to change the base?

Gaea C6 support for UFSWM #2448

Conversation

BrianCurtis-NOAA commented Oct 2, 2024 • edited by RatkoVasic-NOAA Loading

Commit Queue Requirements:

Description:

Commit Message:

Priority:

Git Tracking

UFSWM:

Sub component Pull Requests:

UFSWM Blocking Dependencies:

Changes

Regression Test Changes (Please commit test_changes.list):

Input data Changes:

Library Changes/Upgrades:

Testing Log:

BrianCurtis-NOAA commented Oct 2, 2024

BrianCurtis-NOAA commented Oct 2, 2024

RatkoVasic-NOAA commented Oct 4, 2024

sanAkel commented Oct 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanAkel Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DusanJovic-NOAA Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

BrianCurtis-NOAA commented Oct 4, 2024

BrianCurtis-NOAA commented Oct 4, 2024

DusanJovic-NOAA commented Oct 4, 2024

ulmononian commented Oct 16, 2024 • edited Loading

jkbk2004 commented Oct 16, 2024

jkbk2004 commented Oct 17, 2024

aerorahul commented Oct 17, 2024 • edited Loading

RatkoVasic-NOAA commented Oct 17, 2024

ulmononian commented Oct 17, 2024

BrianCurtis-NOAA commented Oct 17, 2024

RatkoVasic-NOAA commented Oct 17, 2024 • edited Loading

ulmononian commented Oct 17, 2024

RatkoVasic-NOAA commented Oct 17, 2024

RatkoVasic-NOAA commented Oct 18, 2024

BrianCurtis-NOAA commented Oct 18, 2024

RatkoVasic-NOAA commented Dec 3, 2024

DeniseWorthen commented Dec 3, 2024

natalie-perlin commented Dec 3, 2024 • edited Loading

DeniseWorthen commented Dec 3, 2024 • edited Loading

jkbk2004 commented Dec 3, 2024

DeniseWorthen commented Dec 3, 2024

jkbk2004 commented Dec 3, 2024

DeniseWorthen commented Dec 3, 2024 • edited Loading

natalie-perlin commented Dec 3, 2024 • edited Loading

DeniseWorthen commented Dec 3, 2024 • edited Loading

junwang-noaa commented Dec 3, 2024

natalie-perlin commented Dec 3, 2024

RatkoVasic-NOAA commented Dec 3, 2024

RatkoVasic-NOAA commented Dec 3, 2024

DeniseWorthen commented Dec 3, 2024

RatkoVasic-NOAA commented Dec 3, 2024

DeniseWorthen commented Dec 3, 2024 • edited Loading

jkbk2004 commented Dec 3, 2024

DeniseWorthen commented Dec 3, 2024

RatkoVasic-NOAA commented Dec 3, 2024

DeniseWorthen commented Dec 3, 2024

RatkoVasic-NOAA commented Dec 3, 2024

ulmononian commented Dec 3, 2024

jkbk2004 commented Dec 3, 2024

BrianCurtis-NOAA commented Oct 2, 2024 •

edited by RatkoVasic-NOAA

Loading

sanAkel Oct 4, 2024 •

edited

Loading

DusanJovic-NOAA Oct 4, 2024 •

edited

Loading

ulmononian commented Oct 16, 2024 •

edited

Loading

aerorahul commented Oct 17, 2024 •

edited

Loading

RatkoVasic-NOAA commented Oct 17, 2024 •

edited

Loading

natalie-perlin commented Dec 3, 2024 •

edited

Loading

DeniseWorthen commented Dec 3, 2024 •

edited

Loading

DeniseWorthen commented Dec 3, 2024 •

edited

Loading

natalie-perlin commented Dec 3, 2024 •

edited

Loading

DeniseWorthen commented Dec 3, 2024 •

edited

Loading

DeniseWorthen commented Dec 3, 2024 •

edited

Loading