-
Notifications
You must be signed in to change notification settings - Fork 0
Ocean/layer volume weighted mask #1292
base: ocean/develop
Are you sure you want to change the base?
Ocean/layer volume weighted mask #1292
Conversation
@mark-petersen here are my proposed changes to the layer volume weighted AM. I have the don't merge tag for now until I can confirm that the output is consistent with what is produced by the head of ocean/develop. There may be other areas for performance enhancements as well, and I would greatly appreciate any suggestions. Also if @philipwjones or @pwolfram had additional thoughts on areas for improvement please let me know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Luke,
Great work! I've noted some places that may present additional performance gain opportunities in this file.
Specific ideas are highlighted but here are some high-level ideas:
-
If we could reduce
kBufferLength=nOceanRegionsTmp*nLayerVolWeightedAvgFields*nVertLevels
it would be helpful because this must be communicated in parallel. -
Could loop unrolling be used to speed up computation? Is it possible the compiler isn't doing this automatically, for example?
Lastly, and least importantly: do loops aren't indented properly in several places in the file, e.g., line 498.
workBufferSum(kBuffer) = workBufferSum(kBuffer) + workSum(iDataField) | ||
workBufferMin(kBuffer) = min(workBufferMin(kBuffer), workMin(iDataField)) | ||
workBufferMax(kBuffer) = max(workBufferMax(kBuffer), workMax(iDataField)) | ||
do iRegion=1,nOceanRegionsTmp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanroekel, doesn't first index vary fastest in FORTRAN, e.g., http://jblevins.org/log/efficient-code? Unless I'm confused here the loop order may be switchable for some performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake, you are certainly right. I will fix this.
workSum(3) = workSum(3) + cellVolume ! volume sum | ||
do iDataField=4,nDefinedDataFields | ||
workSum(iDataField) = workSum(iDataField) + cellVolume*workArray(iDataField,iCell) ! volume-weighted sum | ||
do iRegion=1,nOceanRegionsTmp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanroekel, it looks like the indentation may be off here.
workSum(iRegion, 3) = workSum(iRegion, 3) + cellVolume ! volume sum | ||
do iDataField=4,nDefinedDataFields | ||
workSum(iRegion, iDataField) = workSum(iRegion, iDataField) + cellVolume*workArray(iDataField,iCell) ! volume-weighted sum | ||
enddo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the code below right? iDatafield
appears to be the last value of nDefinedDataFields
. Should this code be moved up in the loop? This isn't 100% obvious at first glance. A comment here would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved into what loop? It is in the iRegion loop (sorry for the poor indenting). I'm not sure why this would be moved into the cells loop. This loops over all the physical variables (1,2,3 are mask, area, and volume respectively).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I see what you were referencing here. It is the two lines below the iDataField loop and yes you are right these needed to be moved and have been.
@@ -333,21 +336,21 @@ | |||
<var name="workMinLayerVolume" | |||
persistence="scratch" | |||
type="real" | |||
dimensions="nLayerVolWeightedAvgFields Time" | |||
dimensions="nOceanRegionsTmp nLayerVolWeightedAvgFields Time" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the array-order in memory here consistent with the fastest array-access pattern? This is a key thing that could help performance that should be checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, this needs to be fixed as mentioned above.
@@ -250,17 +336,17 @@ subroutine ocn_compute_layer_volume_weighted_averages(domain, timeLevel, err)!{{ | |||
! get pointers to scratch variables | |||
call mpas_pool_get_subpool(block % structs, 'layerVolumeWeightedAverageAMScratch', scratchPool) | |||
call mpas_pool_get_field(scratchPool, 'workArrayLayerVolume', workArrayField) | |||
call mpas_pool_get_field(scratchPool, 'workMaskLayerVolume', workMaskField) | |||
! call mpas_pool_get_field(scratchPool, 'workMaskLayerVolume', workMaskField) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not just removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a mistaken holdover. I have removed
@@ -312,18 +398,11 @@ subroutine ocn_compute_layer_volume_weighted_averages(domain, timeLevel, err)!{{ | |||
workBufferMin(:) = +1.0e20_RKIND | |||
workBufferMax(:) = -1.0e20_RKIND | |||
|
|||
! loop over all ocean regions | |||
do iRegion=1,nOceanRegionsTmp | |||
|
|||
! loop over the vertical | |||
do iLevel=1,nVertLevels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we absolutely have to do this for each level? Could we get away with every other or every n levels, for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, but I don't think that the work required to not compute every level will pay off for the efficiency gained.
One more performance thought that may hopefully be the most helpful-- communication is in the loop, e.g., at https://github.com/vanroekel/MPAS/blob/74f8f668ff9b97bc845567102dffb2772ef94af4/src/core_ocean/analysis_members/mpas_ocn_layer_volume_weighted_averages.F#L449. It may be better to cache the results and communicate at the end in one call in order to avoid the latency of communication, which for |
@vanroekel - I'll look at this today. |
74f8f66
to
a184f5a
Compare
@pwolfram communication is not in the loop. Where you point to is post block loop. Communication happens after the nVertLevels loop. And the buffer size is now indeed |
@vanroekel - hate to say this, but this really needs to be re-written. @pwolfram already pointed out the stride access issues - all the arrays should have k-index (ilevel) first and level loop innermost. In addition, the construct of having the minmaxsum (statistics) as a separate function is creating a lot of unnecessary storage and expensive allocations that would not be needed with an in-line calculation and an appropriate loop ordering. If you want, I can take a shot at the core routine(s) and could turn this around tomorrow. |
@philipwjones if you are willing to take a stab at the rewrite, I would be very happy. I was trying to not change too much, hence the separate stats call and the slightly odd ordered arrays, so I if you want to make any and all changes, I'd be very grateful. This routine is a big hit for the high-res runs, taking 5% of the total ocean time with a compute interval of 1-day. For my own understanding can you explain why k first instead of nDataFields first, which is what I have now done here? |
@vanroekel and @philipwjones There is a COMPASS test that will be very helpful here. I just added a little commit to add more variables to the comparison, hope you don't mind @vanroekel. You can set up this test case as follows:
on a login node:
|
I'm finding that the min and max are broken using the above test, for both Volume and Layer variables. The avg appears to be working.
It was working on ocean/develop:
|
@mark-petersen thanks for testing this. I have only checked the layer averages thus far and hadn't had a chance to look at max/min. I will try and look soon. From the look of the numbers it seems like I'm not dividing by volume of the cell perhaps. |
Actually, the regional max value is incorrect in both the original and @vanroekel versions. The routine has the statement: So it's taking the max of minimum values instead of the true maximum. Done with my rewrite and am now testing... |
@philipwjones What is the status of this PR? @q-rai (Anne Berres) has volunteered to add regional mask computations (from a netcdf file) to this AM, but she is only at LANL for another week ( @milenaveneziani this will help us out). It is important that @q-rai starts with code that is close to the final version of this PR, and she has time to work on it now. @philipwjones could you just push up whatever you have, and tell us the status (passes or fails tests...) @q-rai can start with that commit, and we can always rebase later. I bet any remaining changes for this PR will be small. Thanks! |
@mark-petersen Will push at least one version up tonight. However, note that the region mask should really be an MPAS-ocean-wide capability and not limited to this AM. Maybe @q-rai should design/build a standalone module for this? |
@philipwjones: the way that the regional mask is implemented in this AM is similar to only two other AMs: the surfaceAvg and the waterMassCensus ams. I believe @q-rai already made changes to waterMassCensus (not sure about surfaceAvg), which should be similar to those to be made here. |
@philipwjones @milenaveneziani On the topic of a standalone module: I won't be here long enough to design a new module, but I'll leave my two cents hoping they'll help someone else. |
@q-rai and @milenaveneziani - Sorry if I was unclear. What I was suggesting is essentially what Anne is doing already - namely replacing these hard-wired region masks with the common region masks and her suggestions for standardizing some of the naming (eg of nOceanRegions). Was thinking more about the overall bookkeeping, grouping and avoiding duplication among the AMs which I believe is done now mostly through preprocessing of the mask file. Something for the future. BTW, I suspect we may want to change index ordering on the masks - the versions of layer avg code I sent earlier are meant to explore this to some extent. |
@philipwjones I copied your three versions and the registry file into this repo, and made one commit for each for posterity, see above. All three compiled using gnu on grizzly. My trouble is, @q-rai has time to add the regional netcdf masks, but I don't have time to test these at high rez because I'm committed to the ACME non-bfb block test until it's fixed. Anne is only here for one more week. @philipwjones could you make a best guess on which version to proceed with? Anne can start with that commit and add the netcdf region masks. |
@mark-petersen @q-rai - the regional mask in all three is identical, so maybe start with the Fcolumn version which is the most similar to the current one. If I get a chance after the threading stuff, I can try to evaluate these in a high-res case. |
@philipwjones I noticed that you changed the hard-wired region dimensions down from 7 to 6. Completely unrelated: which version of the code were you basing your changes on? I'm asking because @vanroekel pulled the mask computation up to the init routine which seemed like a good idea to me (I actually made the same change over here after seeing his solution). Is there a performance-related reason against it or did you just start from a different codebase? |
@q-rai Ok - didn't catch the 7th, but probably would be better to make that explicit rather than implied? I didn't see it in the var array in registry (may have missed it), so just assumed there was an extra index. On masking - I did move the region mask up into init, but left the vertical mask since future simulations with active ice sheets or inundation might have time-varying masks for vertical levels. So I computed the total mask as the product of an invariant region mask and time-varying vertical mask. Due to some vectorization-related optimization later, we may end up having an MPAS-wide vertical mask rather than varying loop limits, so this may get pushed elsewhere. |
@philipwjones It's easy to miss, that's why I figured a comment stating that will help. I completely missed the mask part when skimming your code. I expected something of the form
because that's what I'd seen so far (and that's one big obvious block to see). Mark just explained to me that it doesn't actually have to be that way because there's also a contains all the way at the top before initializing. |
c7aabaf
to
8cde6b3
Compare
I added in region mask support. At a first glance, I seem to get the same results as hard-wired regions (with equivalent region mask). |
@q-rai thanks for your quick work on this! |
0e08bf7
to
2f1d07c
Compare
8cde6b3
to
1f66c6c
Compare
Hard-wired regions and region masks can be turned on/off individually in namelist.
Testing: which can be found here: The output looks promising, but there are some small issues that need fixing. The current status is:
Here are some example images; commit numbers are in the caption. From left to right it’s ocean/develop-hard-wired, Phil-hard-wired, mine-hard-wired, mine-regionmask. Region numbers for reference:
|
@vanroekel This had huge swaths of conflicts with the head. I'm postponing the merge, but we should do it next week. Best to rebase after |
@vanroekel and @mark-petersen, is this something we need to migrate to MPAS-Model? |
These changes are to partially address slow performance in the layer weighted volume average ocean AM. In the 18to6 case, this AM was nearly 60% of the time integration timer. With this PR, the time for this AM is reduced by 80%, it is now about 10% of the time integration timer.