Reproducibility issues in HEMCO_CESM and investigation #31

jimmielin · 2024-04-01T19:53:40Z

This issue thread serves to note the reproducibility issues in HEMCO within CESM2 which should eventually be fixed for: ESCOMP/CAM#856

For the purposes of debugging HEMCO_CESM, it is suggested to use CAM-chem compsets (e.g., FCnudged, FCclimo2010, ...) beuse CAM-chem is known to be b4b reproducible and GEOS-Chem compsets are likely not. The responsibility of this issue is to ensure that the physics buffer and history fields (e.g., HCO_NO, HCO_NH3, HCO_CO, ...) match bit-for-bit in restart, different MPI decomp, and different OpenMP threading scenarios.

Test/debug workflow

This setup will help debug the issues.

Checkout https://github.com/ESCOMP/CESM (ESCOMP/CESM). cesm2_3_alpha17c was used here but any release with HEMCO (post-cam6_3_118) should do.
./manage_externals/checkout_externals
Using this branch (https://github.com/jimmielin/HEMCO_CESM/tree/hplin/debug_parallel) hplin/debug_parallel from jimmielin/HEMCO_CESM for components/cam/src/hemco may be useful, as it has some debug printouts which will appear in cesm.log.
Create a case: ./create_newcase --case ~/2403_dev_hco_2.3/2403_dev_hco_2.3-f10_singlecore --compset FC2010climo_HCO --res f10_f10_mg37 --run-unsupported --mach derecho --project UHAR0022 -- the f10_f10_mg37 resolution is 10x15 degree and coarse enough to run on 1 core. I suggest using FC2010climo or something that is not FCnudged so configuring nudging / met fields can be avoided in user_nl_cam.
cd to case directory, ./xmlchange NTASKS=1 for single core or NTASKS=2 for two cores, etc. In the 10x15 case, NTHRDS=1 (I have not successfully ran with more than 1 thread on this grid)
./case.setup --reset, then fill user_nl_cam with:

hemco_config_file = '/glade/u/home/hplin/2403_dev_hco_2.3/HEMCO_Config.CC.TestOnly.c240331.rc',

cam_physics_mesh = '/glade/campaign/cesm/cesmdata/inputdata/share/meshes/10x15_nomask_c110308_ESMFmesh.nc'
hemco_grid_xdim = 24,
hemco_grid_ydim = 19,

fincl1 = 'T', 'HCO_CO', 'HCO_NO', 'HCO_NH3', 'CO', 'O3', 'NO', 'HCO_EDGAR_TODNOX'
mfilt = 1,
nhtfrq = 1,

The /glade/u/home/hplin/2403_dev_hco_2.3/HEMCO_Config.CC.TestOnly.c240331.rc test config file only has CEDS with NO CO and NH3 with NO having a 1x1 gridded scale factor. This makes it easier to debug and much quicker to run.

./case.build -v
To run >1 core on Derecho: you will run into this cryptic error with numactl (binding to certain cores returning invalid on 5x5_amazon resolution NCAR/mpibind#5) - edit env_batch.xml and change the command in <directive gpu_enabled="false"> to always request 128 cores from the scheduler (it was {{ max_tasks_per_node}} -> to 128):

<directive> -l select={{ num_nodes }}:ncpus=128:mpiprocs={{ tasks_per_node }}:ompthreads={{ thread_count }}:mem=230GB</directive>

Change env_run.xml: RUN_STARTDATE=2016-01-01, STOP_OPTION=nhours, STOP_N=3 (shorter may not work due to coupling intervals)
Submit the case
Create multiple case directories for 1 core, 2 cores, etc. because clean recompile is needed to change core configuration.

Debugging output is in cesm.log.* and organized per CPU.

The cprnc tool is very useful to compare two netCDF files for bit-for-bit matches: I use this in my .zshrc

alias cprnc="/glade/campaign/cesm/cesmdata/cseg/tools/cime/tools/cprnc/cprnc"

Usage: cprnc <file1> <file2>

The text was updated successfully, but these errors were encountered:

jimmielin · 2024-04-01T20:03:59Z

What I've found is that HCO_CO and HCO_NH3 match bit-for-bit (files are identical) between singlecore and mpi (2 cores), but not HCO_NO. This is because HCO_NO applies another field, EDGAR_TODNOX to it, and this field is somehow different in the two runs.

In singlecore's cesm.log for EDGAR_TODNOX field:

0:  hcdebug: (edgar, i=           1 )  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30
0:  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 -9.9999998E+30   1.302642
0:    1.306328       1.306328       1.306328       1.306328       1.306328
0:    1.306328       1.306328       1.319497       1.210987       1.210987
0:    1.020795
0:  hcdebug: (edgar, i=           2 )  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30
0:  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 -9.9999998E+30   1.361859
0:    1.361859       1.361859       1.361859       1.372795       1.372795
0:    1.372795       1.372795       1.391346       1.385491       1.385491
0:    1.020795

In mpi, note how in both CPUs 0: and 1:, the first 5 grid boxes have the fill value in:

0:  hcdebug: (edgar, i=           1 )  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30
0:  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 -9.9999998E+30   1.302642
0:    1.306328       1.306328
1:  hcdebug: (edgar, i=           1 )  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30
1:  -9.9999998E+30 -9.9999998E+30   1.374258       1.210987       1.210987
1:    1.019067
0:  hcdebug: (edgar, i=           2 )  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30
0:  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 -9.9999998E+30   1.361859
0:    1.361859       1.361859
1:  hcdebug: (edgar, i=           2 )  -9.9999998E+30   1.405800       1.405800
1:    1.405800       1.405800       1.405477       1.385491       1.385491
1:    1.019067

For HCO_NO emissions at surface:

0:  hcdebug: writing out lvl-sfc at present dt
0:  hcdebug: (i=           1 )   0.000000000000000E+000  0.000000000000000E+000
0:   0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
0:   0.000000000000000E+000  0.000000000000000E+000  2.544706480154566E-014
0:   2.170436195605955E-014  3.528874920258346E-017  0.000000000000000E+000
0:   1.880868963362995E-016  1.772452777726410E-015  0.000000000000000E+000
0:   3.712318333349080E-013  2.388847495132503E-014  3.953244476404797E-014
0:   0.000000000000000E+000  0.000000000000000E+000

In the MPI configuration:

0:  hcdebug: (i=           1 )   0.000000000000000E+000  0.000000000000000E+000
0:   0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
0:   0.000000000000000E+000  0.000000000000000E+000  2.544706480154566E-014
0:   2.170436195605955E-014  3.528874920258346E-017
1:  hcdebug: (i=           1 )   0.000000000000000E+000  1.439813645198005E-016
1:   1.356820567806388E-015  0.000000000000000E+000  2.841796369544945E-013
1:   2.487986842177408E-014  3.953244476404797E-014  0.000000000000000E+000
1:   0.000000000000000E+000

Note how the numbers in the first CPU match bit-for-bit but not in the second one. The values in i=1 should be contiguous starting from CPU 0 to CPU 1, as the values in the only CPU in the single core result.

The domain decomposition for 10x15 domain is:
Global domain size: 24 longitudes, 19 latitudes (24x19)

Lon centers: -180 -165 -150 ... 150 165 180
Lon edges: -187.5 -172.5 ... 157.5 172.5 (verified that the behavior of edges being < -180.0 is consistent with GEOS-Chem Classic.)
Lat edges: -90 -85 -75 ... -5 5 15 ... 85 90 (half-sized polar grid boxes)

For 2 CPUs, the domain is chopped horizontally in the middle where each CPU covers all longitudes but half of the latitudes.

CPU0: lat edges -90 -85 ... -5 5
CPU1: lat edges 5 15 25 ... 85 90
So there is no reason why CPU1 is seeing fill values as from the singlecore data there is data at i=1 (lon=-180) and lat=5 lat=15 ...

The print outputs are directly from HCO_GetPtr or from %Emis%Val so they're from HEMCO upstream code and before it hits the regridder. So I feel there is a bug somewhere upstream in HEMCO or in Map_A2A but I have not looked into the rabbit hole of hco_readlist_mod and hcoio_read_std_mod and map_a2a yet.

I've tried shifting the longitude edges for the HEMCO grid in hco_esmf_grid.F90 as having -187.5 as a starting value (it's equal to -180 minus 15/2) seemed sketchy to me. But either using -180.0 as the leftmost edge or -187.5 only makes numerical differences (expected, since the grid is now different to HEMCO) but does not make the difference between 1 core and 2 cores disappear.

lizziel · 2024-12-04T21:40:46Z

This partially fixed with #41. More work is needed to fix the ERP tests which change the tasks/threading count upon restart.

jimmielin added the bug Something isn't working label Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducibility issues in HEMCO_CESM and investigation #31

Reproducibility issues in HEMCO_CESM and investigation #31

jimmielin commented Apr 1, 2024 •

edited

Loading

jimmielin commented Apr 1, 2024 •

edited

Loading

lizziel commented Dec 4, 2024 •

edited

Loading

Reproducibility issues in HEMCO_CESM and investigation #31

Reproducibility issues in HEMCO_CESM and investigation #31

Comments

jimmielin commented Apr 1, 2024 • edited Loading

Test/debug workflow

jimmielin commented Apr 1, 2024 • edited Loading

lizziel commented Dec 4, 2024 • edited Loading

jimmielin commented Apr 1, 2024 •

edited

Loading

jimmielin commented Apr 1, 2024 •

edited

Loading

lizziel commented Dec 4, 2024 •

edited

Loading