-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add decomp test case to the drying slope test group #695
Add decomp test case to the drying slope test group #695
Conversation
TestingI ran both |
@gcapodag This is the branch I used to replicate your issue. I figured I'd add it to compass. It would be great if you had a chance to check it out and run (maybe both with master and your fix). |
@xylar I realize that a threads test will also be handy. I plan to add this with other extensions of the |
cfa846c
to
e0ba5f0
Compare
Thank you @cbegeman . I tested on Perlmutter with gnu. I ran
right before this code in RK4:
I wonder if the reason why we need this update should be investigated further. As I mentioned to @xylar , the |
Looking at the computation of
I am a little confused by the above computation: why is a halo update not needed at present? If we are looping over |
@gcapodag Thanks for doing that testing. I'm happy to take a look at the RK4 routine to try to understand why the line is needed. Upwinding with split-explicit does pass a 1 proc vs 12 proc decomp test when applied to the baroclinic channel test case. It cannot be tested with the drying slope test case because I have not yet completed the W&D implementation for split-explicit. |
@gcapodag With respect to this question, I don't think we need to worry about looping over |
Hi @cbegeman , by outside the domain you mean outside the actual global domain or outside the partition handled by a given process? |
@gcapodag Oh, I misunderstood. I was talking about the whole domain. You're concerned with the partition. I believe I was motivated by trying to define
I suppose you could try replacing Am I getting warmer in terms of addressing your concerns? I'm not the most knowledgeable about halos in MPAS-O. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I lost track of the review request here. I haven't run the test myself but the code looks great! I'm happy to have you merge when you're ready.
I think it's better if it fails to remind us to merge the fix. I guess we could wait and merge the test after the fix is in but I think it's fine if the test fails as long as it's not in the |
@gcapodag Did you get a chance to check out this branch and try running it? If so, could you approve the PR? You're welcome to provide other feedback if you have any. If you don't have time to review, let me know. |
@gcapodag That works for me. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cbegeman, I tested again on Perlmutter and I think we can merge. I took some time to look further into why the tests are failing in the current setup, and found out that the problem is with layerThickness
and not with layerThickEdgeFlux
. There is a freaky situation happening in which for some not yet identified reason, the layerThickness
is not BFB anymore at some point during the RK4 time stepping and that affects the computation of layerThickEdgeFlux
. If I add a halo update in RK4 right before the final computation of the diagnostic variables, the BFB issue is gone (so we do not need a halo update on layerThickEdgeFlux
). The interesting/weird/troubling thing is that the issue with layerThickEdgeFlux
is actually present also with 8 procs for instance, but due to the process configuration I suppose, the "bad" value is actually never used in the computation of the hadv_coriolis
tendency contribution. This is a print from inside the loop on edge on edge in hadv_coriolis
for 8 procs (focusing on eoe
such that indexToEdgeID(eoe)==14
):
from tendency: 1.3944039866367177 11 18
from tendency: 1.3944039866367177 11 18
from tendency: 1.3944039866367177 11 18
from tendency: 1.3944039866367177 11 18
from tendency: 1.3944039866367177 11 18
from tendency: 1.3944039866367177 11 18
from tendency: 1.3944039866367177 11 18
from tendency: 1.3944039866367177 11 18
the first number is layerThickEdgeFlux(k,eoe)
and the other two are the two cells in local numbering that share that edge. Note that the edge eoe
that is picked up is always the same, which happens to be the one with the correct value of layerThickEdge
. For 10 procs, one time the loop picks up the edge with the wrong (i.e. not halo updated value):
from tendency: 1.3944039866367177 7 14
from tendency: 1.3944039866367177 7 14
from tendency: 1.3944039866367177 7 14
from tendency: 1.3944039866367177 7 14
from tendency: **10.045894683899487** 31 45
from tendency: 1.3944039866367177 7 14
from tendency: 1.3944039866367177 7 14
from tendency: 1.3944039866367177 7 14
So now, there are two options in my opinion: 1. add a halo update on the layer thickness before the final computation of the diagnostics in RK4. 2. spend time to understand why the layer thickness is not BFB in the first place. Obviousuly one route is less painful than the other.
@gcapodag Thanks for your testing and this additional context! I'm moving away from using RK4 configurations on my projects, and since RK4 is also not used in E3SM configurations, this may really be a decision for ICoM. I think that given ICoM's priorities the former option would be sufficient. However, given the shifting priorities, you could raise this with Rob Hetland and see what he thinks. |
This PR adds a decomp test case to the drying slope test group.
Checklist
api.rst
) has any new or modified class, method and/or functions listedTesting
in this PR) any testing that was used to verify the changesThis test case helps evaluate #686.