-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gaea C6 support for UFSWM #2448
base: develop
Are you sure you want to change the base?
Conversation
cpld_control_p8 intel fails for timing out, so there's work to tweak the configs to better match the C6 hardware. I think there's still lots of other items to check here, this is just a placeholder for now. Please feel free to send PR's to my fork/branch to add/adjust/fix any issues etc... |
Also, once things start falling into place, we'll need to make sure intelllvm support is available for c6. |
@BrianCurtis-NOAA, name change suggestion:
|
@BrianCurtis-NOAA Shall I re-try building with these |
tests/compile.sh
Outdated
@@ -95,7 +98,7 @@ export SUITES | |||
set -ex | |||
|
|||
# Valid applications | |||
if [[ ${MACHINE_ID} != gaea ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm | |||
if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we even need this logic here, adding or not adding -DMOM6SOLO=ON? As far as I know, we do not regression test MOM6SOLO. Can we remove this block of code entirely from this script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. It was added there for a reason, and I don't recall if we ever RT'd MOM6SOLO. @junwang-noaa do you recall what this block of code was used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, this is to support standalone MOM testing. @jiandewang Do you know why MOM6 SOLO does not work on gaea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! I'm new to the UFS, but AFAIK, nobody seems to use -DMOM6SOLO=ON
, though I would differ it to @junwang-noaa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junwang-noaa My understanding from @jiandewang is that he (and others) are no longer routinely testing MOM solo config; I have always built using instructions at MOM6-examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, this is to support standalone MOM testing. @jiandewang Do you know why MOM6 SOLO does not work on gaea?
It was added here many years ago and we never tried this SOLO on any platform. My understanding is with nuopc_cap it has to be coupled with something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junwang-noaa My understanding from @jiandewang is that he (and others) are no longer routinely testing MOM solo config; I have always built using instructions at MOM6-examples
yes we use MOM-example to do standalone test when it's needed to do some debug work (to help GFDL to narrow down issue when their big PR is not working as expected in UWM).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you do not use tests/compile.sh to build standalone test, is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you do not use tests/compile.sh to build standalone test, is that correct?
correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we should remove it from compile.sh
cpld_control_p8 fails with:
and control_p8 runs to completion:
|
@DusanJovic-NOAA this look ok?:
|
Yes. |
@BrianCurtis-NOAA @jkbk2004 @FernandoAndrade-NOAA i believe EPIC now has full access to the |
@BrianCurtis-NOAA can you sync up branch? I think I am able to create baseline on c6: /gpfs/f6/bil-fire8/world-shared/role.epic/UFS-WM_RT/NEMSfv3gfs. |
Continue to see failures with various cases.
About 3 different behaviors and error messages:
@ulmononian @RatkoVasic-NOAA we need troubleshooting from lib side. |
@RatkoVasic-NOAA @BrianCurtis-NOAA |
Any combination is OK, as long as they are same length. |
@MichaelLueken just fyi regarding c5/c6 naming conventions. i recall there was a desire to sync the srw ci/cd pipeline w/ certain gaea c5/c6 naming conventions. |
I'll be going with gaeac6 and gaeac5, FYI. I'll make those changes at some point tomorrow. |
@BrianCurtis-NOAA @ulmononian @jkbk2004
Also adding in ./tests/fv3_conf/fv3_slurm.IN_gaea: |
please try what @RatkoVasic-NOAA has suggested in your job cards, before fv3.exe is run: export FI_VERBS_PREFER_XRC=0. this is a known issue inherent to the c5 system. may also try for c6. |
@jkbk2004 @BrianCurtis-NOAA |
@BrianCurtis-NOAA @jkbk2004 @ulmononian
If there is need more work on Gaea C6, I can make PR now. There are only 4 files that needed change, provided here. |
Let me put all of this together and update this PR. |
Ran full regression tests on Gaea-C6. Logs attached. |
@RatkoVasic-NOAA What is the reason behind setting the TPN<192 for many of the cpld tests? Do the jobs otherwise hang, or fail or time-out? What exactly is the failure? |
@DeniseWorthen - several tests fail with TPN=192, e.g., cpld_debug_p8_intel, cpld_control_p8, but not other tests, such as control_c192_intel . Error messages were not very diagnostic, some weird errors referring to MPI or libraries issues, as shown further below. Memory issues, on the other hand, are hard to diagnose but very common for memory-demanding tests. TPN=192 on Gaea-c6 corresponds to 2GB/core, which from previous experiences was known to be insufficient for some tests. Setting TPN to lower allowed for more memory per core allocated job, which happened to solve runtime errors. Example of runtime error in cpld_debug_p8_inte with the TPN =192:
|
@natalie-perlin Are memory resources on Hera, where we also run these same tests w/o halving the TPN, so much more? |
Twice more CPUs on C6 but pretty much same memory configuration as C5. It looks reasonable to increase resources for tests:
|
@jkbk2004 That doesn't address the question of why such modification is not required on Hera. You're simply propagating changes that you already put in for C5 to C6. |
Why should we compare against hera? They are different machines. https://docs.rdhpcs.noaa.gov/systems/hera_user_guide.html#system-overview |
@jkbk2004 Because, in Natalies words, the "memory-demanding tests" are identical on both machines. The model is configured the same (ie, domain, WGC etc). So why are we not facing memory issues on Hera but they're so severe on C6 that we can only use 1/2 the available nodes? |
Hera has 40 cores/node and 96 GB, so the memory allocation is ~ 2.4GB/core. Possible solution is to find TPN< 192 that works for all the tests. |
@natalie-perlin Thanks for addressing the issue I'm raising. So, per node, Gaea has ~80% the memory of Hera, is that right? But we're reducing the TPN by 50% (in most cases). Why, if the issue is memory use, aren't we using more like 154 TPN on C6 (80% of 192)? Do those jobs fail? |
To use comparable memory, maybe we can use 144 TPN on C6? That gives us 2.6GB/core. |
@DeniseWorthen @junwang-noaa - |
@DeniseWorthen @junwang-noaa @natalie-perlin @jkbk2004 sorry for late reply. |
@DeniseWorthen Yes, as @natalie-perlin said some tests needed more memory per node. Some tests using 384 tasks were failing with use of 2 nodes (TPN=192) with message that Natalie shared. When used 3 nodes (TPN=128) it was failing in writing gocart.inst_aod.20210322_1200z.nc4 file (hanging). Only 4 nodes helped run that test (TPN=96) without problem. |
@RatkoVasic-NOAA Thanks for that additional information. The fact that the issue w/ TPN=128 is failing in writing the nc4 file from gocart should be documented I think. This puzzles me. What is going on w/ the file writing....is it really a memory issue or something else? |
It was hard to know when there's no log info (hanging job). Last log line was something like "writing gocart.*.nc" and... nothing. The only thing I know is by increasing number of nodes problem disappeared. I want to point out that almost identical is already existing for Derecho, so whatever was happening there might be happening on gaea-c6. |
@RatkoVasic-NOAA I believe the issue might be w/ ESMF-managed threading. I ran 3 test cases on C6 w/ this branch (cpld control, the c192 and the bmark). For the cpld_control, I switched off ESMF mananged threading, set tpn to 192 and used
|
@DeniseWorthen Permission issue on /gpfs/f6/drsa-hurr1/proj-shared, /gpfs/f6/drsa-hurr1/world-shared is a good place to share. |
I copied it here Set globalResourceControl: false |
@DeniseWorthen what is your suggestion to change in this PR? Where do you set globalResourceControl? Is it going to affect other runs? ... |
@RatkoVasic-NOAA We were just talking about this off line. I think the point I was trying to make is that calling a memory issue is distracting from the real issue, which is (as we've seen on other platforms), more than likely the current implementation of ESMF Managed threading. The globalResourceControl is in the ufs-configure; true means to utilize ESMF managed threading. I know ESMF is hoping to have a fix for this soon, but for now, I suspect we will continue to use it in the RTs because switching away and then back would be disruptive. But I think it is important to understand why we're required to run w/ this kind of under-resourcing on C6 (and probably derecho). |
@DeniseWorthen Thanks for the explanation! |
@jkbk2004 @RatkoVasic-NOAA @DeniseWorthen sounds like moving forward with the current c6 resource configuration @RatkoVasic-NOAA & @natalie-perlin have demonstrated as functional should work, as turning esmf threading on/off in the $UFS_CONFIGURE file would propagate inconsistencies if switched on/off before ESMF issues a permanent fix (as pointed out by @DeniseWorthen) and could impact other tests using that $UFS_CONFIGURE. |
@uturuncoglu FYI: a possible item regarding ESMF-managed threading |
After the recent Gaea-C5 OS upgrade, gfs_utils fails to build. This corrects Gaea-C5 build, adds Gaea-C6 build capability (following ufs-wx-model [2448](ufs-community/ufs-weather-model#2448)), and adds containerized build capability. Refs NOAA-EMC/global-workflow [3011](NOAA-EMC/global-workflow#3011) Refs NOAA-EMC/global-workflow [3025](NOAA-EMC/global-workflow#3025) Resolve #86 --------- Co-authored-by: Mark A Potts <[email protected]>
Commit Queue Requirements:
Description:
This PR will bring in all changes necessary to provide Gaea C6 support for UFSWM
Commit Message:
Priority:
Git Tracking
UFSWM:
Sub component Pull Requests:
UFSWM Blocking Dependencies:
Changes
Regression Test Changes (Please commit test_changes.list):
Input data Changes:
Library Changes/Upgrades:
Testing Log: