-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesigned "caar loop pre-boundary exchange", tuned for Frontier #6522
Conversation
This PR looks like it will be quite a bit of effort to merge in its current state. Perhaps it's premature to post it. Perhaps it would be better to discuss the branch on a Confluence page before finalizing a PR. Trey, when you have a chance, would you post full timer output for the two runs in which you see the 32x performance discrepancy? |
I created the following Confluence page for discussion. It includes the full timer output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some preliminary comments. As Andrew said, this PR is quite big, and may be worth see if/how we can chunk it. But before then, we should find a way to make things work for Serial/OpenMP/Cuda backends.
du += dvvy[j] * t0.sv(x,j); | ||
dv += dvvx[j] * t1.sv(j,y); | ||
} | ||
if (rrdmd == 0) rrdmd = (1.0 / metdet) * scale_factor_inv; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to always be true. Maybe you could just do it in the constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Divisions are expensive on AMD GPUs. The attempt here is to do it only once per kernel, even if the kernel calls div
or fort
more than once, or both div
and vort
, and to do it zero times if the kernel never calls div
or vort
(but calls only grad
, for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The F90 data structures do stored rmeted=1/metdet, but the C++ ones don't. I don't recall why we chose to do that. Perhaps we should do as in F90, since 1/metdet does not change across the run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want me to try converting to a stored rmetdet
and see how performance compares?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for that. To avoid doing the division at every kernel call, we'd have to precompute the rmetdet view during init, and storing in the ElementsGeometry structure. I don't think that's a change for this PR. I was just thinking out loud :)
|
||
KOKKOS_INLINE_FUNCTION void vortInit(SphereBlockScratch &t0, SphereBlockScratch &t1, const Real v0, const Real v1) const | ||
{ | ||
t0.sv(x,y) = g.d(e,0,0,x,y) * v0 + g.d(e,0,1,x,y) * v1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed for dinv
you store a copy of the point-local 2x2 matrix in the class, while for d
you don't. Would it benefit to store d
locally too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the tuning process was trying different combinations of local copies. A local copy of dinv
happened to show speedup, while a local copy of d
did not. I hypothesize that the tiny L1 cache happens to be big enough to reuse the required d
values without making an explicit local copy.
const ExecViewManaged<const Real *[NP][NP]> metdet; | ||
const Real scale_factor_inv; | ||
|
||
SphereGlobal(const SphereOperators &op): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since SphereGlobal stores stuff that is already in SphereOperators, what is the point of SphereGlobal objects? One could simply pass around a SphereOperator object, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a lambda captures an object, it makes a full copy of that object, regardless of what members the lambda actually uses. The SphereGlobal
struct attempts to minimize the number of SphereOperators
members that get copied into the kernel. (When a Kokkos::View
gets copied, it copies only the array descriptor, not all the values of the array. But SphereOperators
still has more of them than are actually used by each kernel.)
struct SphereBlockOps; | ||
|
||
struct SphereBlockScratch { | ||
SphereBlockScratchView v; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the v
part of this type ever used. So maybe we only need (NP,NP) scratch views?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It provides the actual space for the sv
subviews.
sv(Kokkos::subview(v, b.z % SPHERE_BLOCK_LEV, Kokkos::ALL, Kokkos::ALL)) |
Each vertical level uses a separate (NP,NP) piece of the shared array.
const SphereGlobal &g; | ||
const Team &t; | ||
Real scale_factor_inv; | ||
Real metdet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think you need to store these local values (like dvvx, dvvy, metdet, ...)? Does it save on indexing time? I didn't expect this to matter much, tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the particular reason each local copy wins. It was all empirical. If a value was reused, I tried making a local copy. If the result was faster, I kept it.
Possible reasons are that it keeps the compiler from computing indices or loading values more than once. Integer calculation for array indices can be surprisingly expensive on AMD GPUs.
metdet = g.metdet(e,x,y); | ||
for (int i = 0; i < 2; i++) for (int j = 0; j < 2; j++) dinv[i][j] = g.dinv(e,i,j,x,y); | ||
for (int j = 0; j < NP; j++) { | ||
dvvx[j] = g.dvv(x,j); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you subview the first index, maybe you could just make dvvx/dvvy two kokkos subviews of g.dvv, to avoid a copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because SphereBlockOps
objects are on the kernel stack, dvvx
and dvvy
only exist in vector registers. Each "copy" is a single load that is required regardless. And Kokkos subviews can be register hogs.
|
||
const Real v0 = state_v(b.e,data_n0,0,b.x,b.y,b.z) * state_dp3d(b.e,data_n0,b.x,b.y,b.z); | ||
const Real v1 = state_v(b.e,data_n0,1,b.x,b.y,b.z) * state_dp3d(b.e,data_n0,b.x,b.y,b.z); | ||
buffers_vdp(b.e,0,b.x,b.y,b.z) = v0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffers are allocated based on the number of concurrent slots needed, but with SphereBlockOps (as well as with SphereColOps and SphereScanOps) we access them with the element index. We need to polish this out, so that we can use "team_idx" instead of the element index for the buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use buffers to store values across kernels, so I need separate values for every element. I think you are saying that the buffers are actually allocated based on the number of teams, and the number of teams being the same as the number of local elements is a coincidence that I should not rely upon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I did not notice that the epochs are now separate kernels (rather than separate parts of the same kernel). That's a big change. It reminds me of what we did in SCREAM for some physics parametrizations, where an implementation with many smaller kernels was better than a monolithic one. But we should definitely see if this impl is also good for CUDA.
That said, yes, the buffers are allocated based on the maximum number of concurrent teams. If the number of (rank-local) elements is large enough, even on GPU we can get num_buffer_slots<num_elements, which means those buffers may be smaller than num_elems. This is pretty much always the case on CPU, btw, so this impl needs to be adjusted to be portable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the following to-do for me? Convert to using a separate set of buffers, created to be the required sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean make the new impl use different buffers from the original impl? That would work. But related to that, I am getting a bit concerned about how implementations for different backends may start do dramatically diverge. So yes, do the fix you suggest, but we need to start verification of other backends.
using Team = TeamPolicy::member_type; | ||
|
||
static constexpr int NPNP = NP * NP; | ||
static constexpr int WARP_SIZE = warpSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think warpSize is a HIP specific variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Cuda has it, but it's not a compile-time constant. I need a compile-time workaround. #ifdef
? Other suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also not defined for serial/openmp, which means we do need some sort of CPP guard. I guess #ifdef
would do.
const int ixy = tr / SPHERE_BLOCK_LEV; | ||
x = ixy / NP; | ||
y = ixy % NP; | ||
const int dz = tr % SPHERE_BLOCK_LEV; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This snippet below confuse me. I don't think team_rank() uniquely identifies a thread in a warp, but rather it identifies the warp as a whole. So I'm confused as of why this code can work correctly for all levels.
In Kokkos, team_rank() returns threadIdx.y, but a block (i.e., a team) is 2d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For SphereBlockOps::policy
, the team_rank()
does uniquely identify a thread, because the Kokkos "vector" dimension is one (no third argument in the constructor):
return TeamPolicy(num_elems * SPHERE_BLOCKS_PER_COL, SPHERE_BLOCK). |
static constexpr int SPHERE_BLOCK = NPNP * SPHERE_BLOCK_LEV; |
Other team policies do have nonzero vector lengths, so their
team_rank()
values do not uniquely identify a thread. This is the case for the existing code and for SphereScanOps::policy
in the new code (three arguments in constructor):return TeamPolicy(num_elems, NPNP, WARP_SIZE); |
Is this ready? |
I doubt it. It has only been test on Frontier. |
How do you want to test this on other machines? @trey-ornl do you have perlmutter access? |
Sorry, no, I don't have Perlmutter access. Once I have E3SM-Project/scream#2969 through on |
@trey-ornl did you get a chance to try on perlmutter? |
Notes: need to check cpu performance. Look for less-invasive way to do it. |
Yes. It also sped up on |
@trey-ornl any updates? |
Still working on speeding up |
@trey-ornl any update on the CPU work? |
I'm going to close this. Please reopen when all concerns have been addressed. |
This pull request attempts to provide Frontier optimizations like those used in the 2023 Gordon Bell Climate runs, but with a software architecture that meets the requirements of https://acme-climate.atlassian.net/wiki/x/oICd6. Here is a summary of the changes.
struct
s inSphereOperators.hpp
to use for the new Caar pre-boundary exchange.CaarFunctorImpl.cpp
that implements the newcaar_compute
function and template functions with Kokkos loops.CaarFunctorImpl.hpp
with new functions, different number of buffers, and#if
to turn on/off the newcaar_compute
. If we adopt the new implementation permanently, significant code can be eliminated from this file.viewAsReal
functions inViewUtils.hpp
.LaunchBounds<512,1>
calls, which I think are incorrect for AMD GPUs, where the Kokkos teams sometimes use 1024 threads.ExecSpaceDefs.cpp
to use newer Kokkos command-line argument for MPI.cpe/23.12
and Rocmrocm/5.7.1
.frontier.cmake
for standalone Homme builds.I confirmed that the modified code passes the
caar_ut
unit test usingfrontier-bfb.cmake
, and I ran a single-node performance test of stand-alone Homme provided to me by @oksana, building with the newfrontier.cmake
. Here are thewalltotal
times recorded inHommeTime_stats
for"caar compute"
.Existing Code (
#if 0
): 153.6221 sNew Code (
#if 1
): 4.777188 sI feel like this improvement is "too much." I confirmed that the outputs of the performance runs matched within roundoff. And I confirmed the performance difference using
rocprof
.I then compiled
CaarFunctor
with-Rpass-analysis=kernel-resource-usage
. While all the kernels of the new code had a grand total of just one vector-register spill, the monolithic existing kernel had 156115 scalar-register spills and 2820 vector-register spills usingrocm/5.7.1
. Is this enough to explain the performance gulf? I wonder if current E3SM runs are seeing this slowdown.Here are a few caveats about this pull request.
TeamThreadRange
loops).