-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducibility issues in HEMCO_CESM and investigation #31
Comments
What I've found is that In
In
For
In the MPI configuration:
Note how the numbers in the first CPU match bit-for-bit but not in the second one. The values in i=1 should be contiguous starting from CPU 0 to CPU 1, as the values in the only CPU in the single core result. The domain decomposition for 10x15 domain is:
For 2 CPUs, the domain is chopped horizontally in the middle where each CPU covers all longitudes but half of the latitudes.
The print outputs are directly from I've tried shifting the longitude edges for the HEMCO grid in |
This partially fixed with #41. More work is needed to fix the ERP tests which change the tasks/threading count upon restart. |
This issue thread serves to note the reproducibility issues in HEMCO within CESM2 which should eventually be fixed for: ESCOMP/CAM#856
For the purposes of debugging HEMCO_CESM, it is suggested to use CAM-chem compsets (e.g.,
FCnudged
,FCclimo2010
, ...) beuse CAM-chem is known to be b4b reproducible and GEOS-Chem compsets are likely not. The responsibility of this issue is to ensure that the physics buffer and history fields (e.g.,HCO_NO
,HCO_NH3
,HCO_CO
, ...) match bit-for-bit in restart, different MPI decomp, and different OpenMP threading scenarios.Test/debug workflow
This setup will help debug the issues.
ESCOMP/CESM
).cesm2_3_alpha17c
was used here but any release with HEMCO (post-cam6_3_118
) should do../manage_externals/checkout_externals
hplin/debug_parallel
fromjimmielin/HEMCO_CESM
forcomponents/cam/src/hemco
may be useful, as it has some debug printouts which will appear incesm.log.
./create_newcase --case ~/2403_dev_hco_2.3/2403_dev_hco_2.3-f10_singlecore --compset FC2010climo_HCO --res f10_f10_mg37 --run-unsupported --mach derecho --project UHAR0022
-- thef10_f10_mg37
resolution is 10x15 degree and coarse enough to run on 1 core. I suggest usingFC2010climo
or something that is notFCnudged
so configuring nudging / met fields can be avoided inuser_nl_cam
.cd
to case directory,./xmlchange NTASKS=1
for single core orNTASKS=2
for two cores, etc. In the 10x15 case,NTHRDS=1
(I have not successfully ran with more than 1 thread on this grid)./case.setup --reset
, then filluser_nl_cam
with:The
/glade/u/home/hplin/2403_dev_hco_2.3/HEMCO_Config.CC.TestOnly.c240331.rc
test config file only has CEDS withNO
CO
andNH3
withNO
having a 1x1 gridded scale factor. This makes it easier to debug and much quicker to run../case.build -v
numactl
(binding to certain cores returninginvalid
on5x5_amazon
resolution NCAR/mpibind#5) - editenv_batch.xml
and change the command in<directive gpu_enabled="false">
to always request 128 cores from the scheduler (it was{{ max_tasks_per_node}}
-> to128
):env_run.xml
:RUN_STARTDATE=2016-01-01
,STOP_OPTION=nhours
,STOP_N=3
(shorter may not work due to coupling intervals)Debugging output is in
cesm.log.*
and organized per CPU.The
cprnc
tool is very useful to compare two netCDF files for bit-for-bit matches: I use this in my.zshrc
Usage:
cprnc <file1> <file2>
The text was updated successfully, but these errors were encountered: