Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility issues in HEMCO_CESM and investigation #31

Open
jimmielin opened this issue Apr 1, 2024 · 2 comments
Open

Reproducibility issues in HEMCO_CESM and investigation #31

jimmielin opened this issue Apr 1, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@jimmielin
Copy link
Member

jimmielin commented Apr 1, 2024

This issue thread serves to note the reproducibility issues in HEMCO within CESM2 which should eventually be fixed for: ESCOMP/CAM#856

For the purposes of debugging HEMCO_CESM, it is suggested to use CAM-chem compsets (e.g., FCnudged, FCclimo2010, ...) beuse CAM-chem is known to be b4b reproducible and GEOS-Chem compsets are likely not. The responsibility of this issue is to ensure that the physics buffer and history fields (e.g., HCO_NO, HCO_NH3, HCO_CO, ...) match bit-for-bit in restart, different MPI decomp, and different OpenMP threading scenarios.

Test/debug workflow

This setup will help debug the issues.

  • Checkout https://github.com/ESCOMP/CESM (ESCOMP/CESM). cesm2_3_alpha17c was used here but any release with HEMCO (post-cam6_3_118) should do.
  • ./manage_externals/checkout_externals
  • Using this branch (https://github.com/jimmielin/HEMCO_CESM/tree/hplin/debug_parallel) hplin/debug_parallel from jimmielin/HEMCO_CESM for components/cam/src/hemco may be useful, as it has some debug printouts which will appear in cesm.log.
  • Create a case: ./create_newcase --case ~/2403_dev_hco_2.3/2403_dev_hco_2.3-f10_singlecore --compset FC2010climo_HCO --res f10_f10_mg37 --run-unsupported --mach derecho --project UHAR0022 -- the f10_f10_mg37 resolution is 10x15 degree and coarse enough to run on 1 core. I suggest using FC2010climo or something that is not FCnudged so configuring nudging / met fields can be avoided in user_nl_cam.
  • cd to case directory, ./xmlchange NTASKS=1 for single core or NTASKS=2 for two cores, etc. In the 10x15 case, NTHRDS=1 (I have not successfully ran with more than 1 thread on this grid)
  • ./case.setup --reset, then fill user_nl_cam with:
hemco_config_file = '/glade/u/home/hplin/2403_dev_hco_2.3/HEMCO_Config.CC.TestOnly.c240331.rc',

cam_physics_mesh = '/glade/campaign/cesm/cesmdata/inputdata/share/meshes/10x15_nomask_c110308_ESMFmesh.nc'
hemco_grid_xdim = 24,
hemco_grid_ydim = 19,

fincl1 = 'T', 'HCO_CO', 'HCO_NO', 'HCO_NH3', 'CO', 'O3', 'NO', 'HCO_EDGAR_TODNOX'
mfilt = 1,
nhtfrq = 1,

The /glade/u/home/hplin/2403_dev_hco_2.3/HEMCO_Config.CC.TestOnly.c240331.rc test config file only has CEDS with NO CO and NH3 with NO having a 1x1 gridded scale factor. This makes it easier to debug and much quicker to run.

<directive> -l select={{ num_nodes }}:ncpus=128:mpiprocs={{ tasks_per_node }}:ompthreads={{ thread_count }}:mem=230GB</directive>
  • Change env_run.xml: RUN_STARTDATE=2016-01-01, STOP_OPTION=nhours, STOP_N=3 (shorter may not work due to coupling intervals)
  • Submit the case
  • Create multiple case directories for 1 core, 2 cores, etc. because clean recompile is needed to change core configuration.

Debugging output is in cesm.log.* and organized per CPU.

The cprnc tool is very useful to compare two netCDF files for bit-for-bit matches: I use this in my .zshrc

alias cprnc="/glade/campaign/cesm/cesmdata/cseg/tools/cime/tools/cprnc/cprnc"

Usage: cprnc <file1> <file2>

@jimmielin jimmielin added the bug Something isn't working label Apr 1, 2024
@jimmielin
Copy link
Member Author

jimmielin commented Apr 1, 2024

What I've found is that HCO_CO and HCO_NH3 match bit-for-bit (files are identical) between singlecore and mpi (2 cores), but not HCO_NO. This is because HCO_NO applies another field, EDGAR_TODNOX to it, and this field is somehow different in the two runs.

In singlecore's cesm.log for EDGAR_TODNOX field:

0:  hcdebug: (edgar, i=           1 )  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30
0:  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 -9.9999998E+30   1.302642
0:    1.306328       1.306328       1.306328       1.306328       1.306328
0:    1.306328       1.306328       1.319497       1.210987       1.210987
0:    1.020795
0:  hcdebug: (edgar, i=           2 )  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30
0:  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 -9.9999998E+30   1.361859
0:    1.361859       1.361859       1.361859       1.372795       1.372795
0:    1.372795       1.372795       1.391346       1.385491       1.385491
0:    1.020795

In mpi, note how in both CPUs 0: and 1:, the first 5 grid boxes have the fill value in:

0:  hcdebug: (edgar, i=           1 )  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30
0:  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 -9.9999998E+30   1.302642
0:    1.306328       1.306328
1:  hcdebug: (edgar, i=           1 )  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30
1:  -9.9999998E+30 -9.9999998E+30   1.374258       1.210987       1.210987
1:    1.019067
0:  hcdebug: (edgar, i=           2 )  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30
0:  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 -9.9999998E+30   1.361859
0:    1.361859       1.361859
1:  hcdebug: (edgar, i=           2 )  -9.9999998E+30   1.405800       1.405800
1:    1.405800       1.405800       1.405477       1.385491       1.385491
1:    1.019067

For HCO_NO emissions at surface:

0:  hcdebug: writing out lvl-sfc at present dt
0:  hcdebug: (i=           1 )   0.000000000000000E+000  0.000000000000000E+000
0:   0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
0:   0.000000000000000E+000  0.000000000000000E+000  2.544706480154566E-014
0:   2.170436195605955E-014  3.528874920258346E-017  0.000000000000000E+000
0:   1.880868963362995E-016  1.772452777726410E-015  0.000000000000000E+000
0:   3.712318333349080E-013  2.388847495132503E-014  3.953244476404797E-014
0:   0.000000000000000E+000  0.000000000000000E+000

In the MPI configuration:

0:  hcdebug: (i=           1 )   0.000000000000000E+000  0.000000000000000E+000
0:   0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
0:   0.000000000000000E+000  0.000000000000000E+000  2.544706480154566E-014
0:   2.170436195605955E-014  3.528874920258346E-017
1:  hcdebug: (i=           1 )   0.000000000000000E+000  1.439813645198005E-016
1:   1.356820567806388E-015  0.000000000000000E+000  2.841796369544945E-013
1:   2.487986842177408E-014  3.953244476404797E-014  0.000000000000000E+000
1:   0.000000000000000E+000

Note how the numbers in the first CPU match bit-for-bit but not in the second one. The values in i=1 should be contiguous starting from CPU 0 to CPU 1, as the values in the only CPU in the single core result.

The domain decomposition for 10x15 domain is:
Global domain size: 24 longitudes, 19 latitudes (24x19)

  • Lon centers: -180 -165 -150 ... 150 165 180
  • Lon edges: -187.5 -172.5 ... 157.5 172.5 (verified that the behavior of edges being < -180.0 is consistent with GEOS-Chem Classic.)
  • Lat edges: -90 -85 -75 ... -5 5 15 ... 85 90 (half-sized polar grid boxes)

For 2 CPUs, the domain is chopped horizontally in the middle where each CPU covers all longitudes but half of the latitudes.

  • CPU0: lat edges -90 -85 ... -5 5
  • CPU1: lat edges 5 15 25 ... 85 90
    So there is no reason why CPU1 is seeing fill values as from the singlecore data there is data at i=1 (lon=-180) and lat=5 lat=15 ...

The print outputs are directly from HCO_GetPtr or from %Emis%Val so they're from HEMCO upstream code and before it hits the regridder. So I feel there is a bug somewhere upstream in HEMCO or in Map_A2A but I have not looked into the rabbit hole of hco_readlist_mod and hcoio_read_std_mod and map_a2a yet.

I've tried shifting the longitude edges for the HEMCO grid in hco_esmf_grid.F90 as having -187.5 as a starting value (it's equal to -180 minus 15/2) seemed sketchy to me. But either using -180.0 as the leftmost edge or -187.5 only makes numerical differences (expected, since the grid is now different to HEMCO) but does not make the difference between 1 core and 2 cores disappear.

@lizziel
Copy link
Collaborator

lizziel commented Dec 4, 2024

This partially fixed with #41. More work is needed to fix the ERP tests which change the tasks/threading count upon restart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants