Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tracers infrastructure #133

Merged
merged 5 commits into from
Oct 4, 2024

Conversation

grnydawn
Copy link

@grnydawn grnydawn commented Sep 26, 2024

Created the Tracers class and implemented its methods. Unit tests and documentation were also added.

Tracer defintion file: "omega/src/ocn/TracerDefs.inc"
Tracer configuration file: "omega/configs/Default.yml"

Checklist

  • Documentation:
    • Design document has been generated and added to the docs
    • User's Guide has been updated
    • Developer's Guide has been updated
  • Testing
    • CTest unit tests for new features have been added per the approved design.

@grnydawn grnydawn assigned grnydawn and philipwjones and unassigned grnydawn Sep 26, 2024
@grnydawn grnydawn force-pushed the ykim/omega/tracerinfra branch 2 times, most recently from 89bf794 to 30a52b9 Compare September 26, 2024 23:13
@grnydawn
Copy link
Author

grnydawn commented Sep 26, 2024

Revised several getXX functions to return an integer error code and to let the first argument as an output argument.

@grnydawn grnydawn force-pushed the ykim/omega/tracerinfra branch 2 times, most recently from 399ece2 to 2e30794 Compare September 27, 2024 13:52
Copy link

@philipwjones philipwjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @grnydawn this looks good. Just a few comments that may require others to chime in on naming and having public indices.

const I4 TimeLevel, ///< [in] Time level index
const I4 TracerIndex ///< [in] Global tracer index
);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update docs for the changes in retrieval functions

Copy link
Author

@grnydawn grnydawn Sep 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

//---------------------------------------------------------------------------
std::string Tracers::packTracerFieldName(const std::string &TracerName) {
return "Tracer" + TracerName;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to have the Tracer prefix on the field names. We want a consistent name between tracers, streams and wherever the field is referenced. So the tracer name is sufficient

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefix is not used now, and packTracerFieldName function is removed.

TracerGroups[GroupName] =
std::pair<I4, I4>(GroupStartIndex, TracerIndex - GroupStartIndex);
FieldGroup::create("TracerGroup" + GroupName);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here - I would prefer not to have the TracerGroup prefix. Just use the tracer group name as the field group name so it will be the same in Tracers, Field, Streams, etc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The group prefix is not used now.

}

// Read tracer definitions from file
#include "TracerDefs.inc"
Copy link

@philipwjones philipwjones Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment on tracer indices and the include file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philipwjones is your suggestion somewhat equivalent to taking this part of the init routine and adding it to the include file? Or are you proposing to not iterate over the groups in the tracer config and just having hard coded indices for each tracer no matter if they are active or not?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanroekel Not hard-coded indices. We would still initialize at run time, just provide a shared named index that we could refer to rather than retrieving by name. But it looks like that's going to be more complicated than I thought so I think we'll stick with the current implementation for now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, so mostly moving the init to the .inc file, but I did see that Youngsung found it doesn't work as easily as possible. Thanks.

/// definitions selected in the Tracers section of the OMEGA YAML configuration
/// file will be allocated into memory.
//===----------------------------------------------------------------------===//

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reasons for the getByIndex function was to avoid the more expensive string comparison in getByName. In the current implementation, we would still need to get the index by name first and then use the index. I'm wondering whether in the include file, we could include a list of shared, public indices for each tracer (set when they are defined). Like IndxTemp, IndxSalt, IndxDebug1, etc. Since we only want to edit one file when adding tracers, that would mean putting the indices at the top of this include file and including it in Tracers.h. That means we would have to wrap all the defines in a defineAll function that is called where the current inc file is included.

What do others think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philipwjones , what are the benefits of having a list of shared and public indices for each tracers?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By way of explanation, this is what I'm thinking. The TracersDef.inc file would look like:

// Index defined for each tracer. The value of these tracer indices are still set in Tracers::init()  and
// tracers that have not been selected would have standard invalid value: Tracer::IndxInvalid = -1
static int IndxTemp;
static int IndxSalt;
static int IndxMyBGCTracer;
static int IndxAnotherBGCTracer;
static int IndxDebug1;
static int IndxDebug2;
etc. for all defined tracers

// Tracer definitions packaged in a defineAllTracers function
static void defineAllTracers() {

  // here we put all the define calls as they exist in the current TracersDef.inc...
}

The TracersDef.inc file would then be included in the Tracers.h file in the public part of the Tracer class definition so that they are all available using eg Tracer::IndxSalt. In the Tracers.cpp file where the include currently sits, we would replace the include statement with a call to defineAllTracers. The init function would still assign consecutive indices to groups.

This would allow everyone to access the indices directly, providing speedier access without an explicit call to search for the index by name. I realize this is more of a Fortran style and not the usual C++ approach but I think it is both simpler and faster than a search by name or having all modules do the lookup on init.

This is just a proposal - others should weigh in...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philipwjones, I tried to implement your suggested code in this branch (https://github.com/grnydawn/Omega/tree/ykim/omega/tracerinfra-def) but encountered a linking issue:

ld.lld: error: undefined symbol: OMEGA::Tracers::IndxTemp
>>> referenced by Tracers.cpp
>>>               Tracers.cpp.o:(OMEGA::Tracers::defineAllTracers()) in archive ../src/libOmegaLib.a

To resolve the linker error, I added the following near the top of Tracers.cpp:

I4 Tracers::IndxTemp = 0;
I4 Tracers::IndxSalt = 0;

This fix worked, but it makes me think that an additional user-provided include file might be necessary. I’m still learning C++, so I may be missing something.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right - forgot about that. The header just has the forward declaration but we still need to initialize in the implementation (.cpp) file. We might be able to include the same file in both with some creative ifdefs but this is starting to get more complicated than I had intended. Let me think about it some more

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's stick with your initial implementation for now

Copy link
Member

@mwarusz mwarusz Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am bit late to this conversation, but if you declare the member index variables to be inline static then you can initialize them in the header file. Like this:

inline static int IndxTemp = -1;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, and echo a comment from @xylar at the last meeting: this is just a convenience/optimization that can be done for certain tracers and not a requirement, correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwarusz, thanks for the tip about using inline. I have updated Tracers with the inline, and it worked well.

@philipwjones, I am using the variable names for the static indexes from your example code above. If you need to add more tracer index variable names in "TracerDefs.inc", let me know, or we can update the file later.

@grnydawn grnydawn force-pushed the ykim/omega/tracerinfra branch from 2e30794 to 17af062 Compare September 28, 2024 17:22
@hyungyukang
Copy link

hyungyukang commented Sep 30, 2024

All ctest passed on pm-cpu.

I believe loading tracers from a file at initial time has not been implemented in this PR. I think we need this functionality for tracers as in OceanState, as many tests use tracer information from files (e.g., a cosine bell test in Polaris):

DefaultOceanState->loadStateFromFile(DefHorzMesh->MeshFileName, DefDecomp);

@grnydawn
Copy link
Author

@hyungyukang, thank you for your review. We certainly need to add the feature of saving and loading the tracer to/from an external file within the tracer infrastructure. However, after discussions with the other members, we have decided to implement these features through IOStream. Therefore, the features will be added separately after this PR.

@hyungyukang
Copy link

@hyungyukang, thank you for your review. We certainly need to add the feature of saving and loading the tracer to/from an external file within the tracer infrastructure. However, after discussions with the other members, we have decided to implement these features through IOStream. Therefore, the features will be added separately after this PR.

Got it. Thanks for clarifying, @grnydawn !

@mark-petersen
Copy link

@grnydawn I tested on Frontier cpus and get the error below. I suspect I did something wrong in the compile or linking. Could you check? I'm at the airport and don't have time to look into it.

The error is:

27/27 Test #27: TRACERS_TEST .....................***Failed    2.45 sec
2: non-void function did not return a value
3: non-void function did not return a value
0: non-void function did not return a value
7: non-void function did not return a value
5: non-void function did not return a value
6: non-void function did not return a value
1: non-void function did not return a value
4: non-void function did not return a value
srun: error: frontier00637: tasks 1-7: Illegal instruction
srun: Terminating StepId=2609730.26
0: slurmstepd: error: *** STEP 2609730.26 ON frontier00637 CANCELLED AT 2024-10-01T09:41:08 ***
srun: error: frontier00637: task 0: Illegal instruction (core dumped)
Here are my steps:


######### Frontier cpu ############
CODE_DIR=opr
RUNDIR=241001_omega
mkdir /lustre/orion/cli115/scratch/mpetersen/runs/$RUNDIR
cd !$

cd /ccs/home/mpetersen/repos/E3SM/${CODE_DIR}
git submodule update --init --recursive externals/YAKL externals/ekat externals/scorpio cime
cd /lustre/orion/cli115/scratch/mpetersen/runs/$RUNDIR

module load cmake
rm -rf build
mkdir build
cd build
export PARMETIS_ROOT=$PROJWORK/cli115/pwjones/frontierlibs-cray/parmetis
cmake \
   -DOMEGA_CIME_COMPILER=crayclang \
   -DOMEGA_BUILD_TYPE=Release \
   -DOMEGA_CIME_MACHINE=frontier \
   -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT}\
   -DOMEGA_BUILD_TEST=ON \
   -Wno-dev \
   -S /ccs/home/mpetersen/repos/E3SM/${CODE_DIR}/components/omega -B .
./omega_build.sh

# linking:
cd test
ln -isf ~/meshes/omega/O*nc .
cp /ccs/home/mpetersen/repos/E3SM/${CODE_DIR}/components/omega/configs/Default.yml omega.yml


salloc -A cli115 -J inter -t 2:00:00 -q debug -N 1 -S 0
RUNDIR=241001_omega
cd /lustre/orion/cli115/scratch/mpetersen/runs/$RUNDIR/build
./omega_ctest.sh

@grnydawn grnydawn force-pushed the ykim/omega/tracerinfra branch from 17af062 to 57521d0 Compare October 1, 2024 14:51
@grnydawn
Copy link
Author

grnydawn commented Oct 1, 2024

@mark-petersen , Indeed, there was an issue with not returning an error code. The Cray compiler caught this error, while the NVIDIA and Intel compilers did not. The issue is now fixed, and my test on Frontier has passed. Thank you very much for identifying this error.

@xylar
Copy link

xylar commented Oct 1, 2024

@grnydawn, could you rebase onto develop so we can see if CI works again?

Created the Tracers class and implemented its methods.
Unit tests and documentation were also added.
@grnydawn grnydawn force-pushed the ykim/omega/tracerinfra branch from 57521d0 to 3cfa5cf Compare October 1, 2024 20:02
@grnydawn
Copy link
Author

grnydawn commented Oct 1, 2024

@xylar , thank you very much for the quick fix. I rebased onto develop and CI test has passed.

@brian-oneill
Copy link

Thanks so much for your work on this @grnydawn. Unit tests passed for me on Chrysalis and pm-cpu and pm-gpu (aside from an unrelated issue #143). This all looks good to me.

Copy link
Member

@mwarusz mwarusz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good thanks @grnydawn. Aside from #143, everything passed for me on frontier. My only serious question is is about time level conventions.

std::map<std::string, std::pair<I4, I4>> Tracers::TracerGroups;
std::map<std::string, I4> Tracers::TracerIndexes;
std::map<I4, std::string> Tracers::TracerNames;
std::vector<std::string> Tracers::TracerDimNames = {"NCells", "NVertLevels"};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TracerDimNames is used only inside one function. Maybe it could be a local variable instead of a member ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that’s better. I will move it to the local scope. Thanks!


//---------------------------------------------------------------------------
// get Tracer arrays
// TimeLevel == [0:current, -1:previous, -2:two times ago, ...]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This convention for time levels looks different than what the state class uses. In there time levels are nonnegative, NextLevel = NTimeLevels - 1, and CurLevel = NTimeLevels - 2. Would it be possible to follow one convention ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this issue up. I think it's worth some discussion because the difference in indexing conventions between Tracers and State stems from how the time level index is managed.

Both State and Tracers have a std::vector (let's call it ARR) as a private member in their respective classes to maintain multiple time levels.

In State, NextLevel is always set to the highest index of ARR, and CurLevel is NextLevel - 1. Since the ARR indices for the time levels are fixed, to increment the time level in State, the elements of ARR are swapped. I guess this approach may be better aligned with the established convention.

In Tracers, a private integer member (let's call it IDX) keeps track of the array (ARR) index for the current time level. To increment the time level, IDX is simply increased by 1. If IDX exceeds the size of ARR, it is reset to zero. I chose this approach to maximize performance because adding 1 is computationally cheap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both State and Tracers have a std::vector (let's call it ARR) as a private member in their respective classes to maintain multiple time levels.

Small correction, state stores its variables as a public Kokkos::Array<Array2DReal, MaxTimeLevels>.

Copy link
Member

@mwarusz mwarusz Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained why I had suggested using Kokkos::Array for the State in the response to @sbrus89 comment below and why I no longer think it is necessary. As for the index convention, it is a separate issue, and I don't have a strong preference. One thing that would be nice is to have a common function for checking that a time index is valid.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grnydawn and @mwarusz, after looking at this a little more, I think we should change State to follow the time index convention here (as well as using the std::vector approach).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwarusz , @sbrus89 , Agreed. We definitely need to have the same time level convention between State and Tracers. During my testing, I had to subtract (1) from the State time level to advance Tracers.

Copy link

@mark-petersen mark-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now passes on Frontier with crayclang compiler. Approving based on others comments above, and my own inspection.

Comment on lines +30 to +33
static std::vector<Array3DReal>
TracerArrays; ///< TimeLevels -> [Tracer, Cell, Vert]
static std::vector<HostArray3DReal>
TracerArraysH; ///< TimeLevels -> [Tracer, Cell, Vert]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the State we (based on @mwarusz's suggestion) used:

Kokkos::Array<Array2DReal, MaxTimeLevels>
       LayerThickness;

to handle the time levels, to avoid any device incompatibility issues with std::vector. Is that something we would want to do here as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done to allow calling LayerThickness[TimeLevel] directly from a device kernel. However, I don't think we ended up using this capability anywhere. The approach taken in this PR is to have getter functions for tracer arrays that perform index checks and log errors. The logging alone prevents from calling them on GPU, but it is nice for safety. I think using std::vector here is fine and we should change the State to use it as well.

Comment on lines +408 to +409
All other requirements will need to be tested in later system tests once
I/O streams and time integration schemes have been implemented.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be testing the validation of tracer fields using the min/max valid values or will this come later?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philipwjones, do you have any thoughts on this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't necessarily have to be a part of this PR, but it might be good to have a function like validate and/or validateAll to check tracer values for obvious errors like out-of-range. We would need a similar function in State. It might need to be up to the time integrator and/or a run-time variable to determine how frequently to check since it would impact performance. So maybe a later PR where we can decide how/when to call it?

@grnydawn
Copy link
Author

grnydawn commented Oct 3, 2024

@philipwjones, I have updated this to include the trace indices defined in "TracerDefs.inc," as @mwarusz suggested. I believe I have made the updates per the reviews except the recent validation feature. @hyungyukang, the basic algorithm of the tracer infrastructure has not changed, though some implementation details have been modified. Please feel free to review the updates and approve them when you think they are ready.

Copy link

@philipwjones philipwjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grnydawn Thanks for the changes. I think a quick update to the documentation to reflect the addition of tracer indices in the include is needed and then I had a separate comment about removing undefined indices. With those changes, I'm fine with this and will approve on the assumption those changes are made.

inline static I4 IndxAnotherBGCTracer = Tracers::IndxInvalid;
inline static I4 IndxDebug1 = Tracers::IndxInvalid;
inline static I4 IndxDebug2 = Tracers::IndxInvalid;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the index entries that don't have a corresponding define function. That way the index list serves as a list of potentially available tracers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unused entries.

@grnydawn
Copy link
Author

grnydawn commented Oct 3, 2024

@philipwjones , updated docs to add tracer indices in TracerDefs.inc file. Thanks!

Copy link

@hyungyukang hyungyukang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grnydawn , thank you very much for your hard work on this PR. Codes look good to me. I have tested this PR with #132 and #140 simultaneously by implementing a cosine bell advection test case that requires 1) tracer array and info manipulations through this PR, 2) writing history output through IOStream PR, and 3) tracer advection through tracer tendency terms PR

During my testing, I had to separately read initial tracers from the initial file, but @grnydawn confirmed that this feature will be added to IOStreams separately. I have tested several functions in this PR, and they all worked fine.

In this cosine bell test, I disabled all tendency terms except for tracer advection.

tracer1 at t=0 day
image

tracer1 at t=24 days
image

Due to the rushed implementation of this test, there may be small bugs in the test. However, the functionality of this PR is working as intended

There was a concern about the time level convention for Tracers that differs from State. During testing, I had to subtract (1) from the State time level to advance Tracers. As @mwarusz and @sbrus89 mentioned, I believe this will be fixed soon.

I'm approving this PR based on my testing and the testing and visual inspection of others. Thanks again @grnydawn !


//---------------------------------------------------------------------------
// get Tracer arrays
// TimeLevel == [0:current, -1:previous, -2:two times ago, ...]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwarusz , @sbrus89 , Agreed. We definitely need to have the same time level convention between State and Tracers. During my testing, I had to subtract (1) from the State time level to advance Tracers.

Copy link

@cbegeman cbegeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grnydawn Thanks for addressing my comments and for your hard work on this! Approving based on visual inspection and the testing of others.

Copy link

@sbrus89 sbrus89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look great to me. I'll work on a PR to make the time handling consistent in State. Approving based on inspection and testing of others. Thanks your work on this @grnydawn!

@sbrus89
Copy link

sbrus89 commented Oct 4, 2024

@philipwjones, I think this one is ready to merge, if you want to do the honors. I checked with Luke and he gave his go-ahead as well (but said he was having trouble with getting the PR to load so he could approve.)

static void defineAllTracers() {

define("Temp", ///< [in] Name of tracer
"Potential Temperature", ///< [in] Long name or description

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noting here and in similar places, for TEOS-10 (new EOS) this should actually be Conservative Temperature not Potential Temperature and sea_water_conservative_temperature for the CF name

Similarkly salinity is now Absolute Salinity and not Practical Salinity units are g/kg instead of psu

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be changed later.

Copy link

@vanroekel vanroekel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on visual inspection, but noting we'll need to update temperature and salinity naming and metadata at a future time to be consistent with TEOS-10. This can wait.

@philipwjones
Copy link

@vanroekel I'll make the change before merging, thanks.

  - also added indices for defined debug tracers
@philipwjones philipwjones merged commit 8b50f3c into E3SM-Project:develop Oct 4, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants