-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SHOC support for small kernels #1940
Conversation
// Need a second workspace for storing shcol local scalars. Need workspaces | ||
// of size schol, not nlev, to store temporary scalars for small kernels | ||
const auto policy = ekat::ExeSpaceUtils<ExeSpace>::get_default_team_policy(shcol, 1); | ||
ScalarWorkspaceMgr wsm_local(shcol, num_local_scalars, policy, 1.e10); // huge overprovision => no ws sharing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since both the workspace size and type are different, I decided to make a new WSM for managing local scalars instead of trying to leverage the existing WSM.
@@ -192,6 +197,10 @@ set(SCREAM_LIB_ONLY ${DEFAULT_LIB_ONLY} CACHE BOOL "Only build libraries, no exe | |||
set(NetCDF_Fortran_PATH ${DEFAULT_NetCDF_Fortran_PATH} CACHE FILEPATH "Path to netcdf fortran installation") | |||
set(NetCDF_C_PATH ${DEFAULT_NetCDF_C_PATH} CACHE FILEPATH "Path to netcdf C installation") | |||
set(SCREAM_MACHINE ${DEFAULT_SCREAM_MACHINE} CACHE STRING "The CIME/SCREAM name for the current machine") | |||
set(SCREAM_MONOLITHIC_KERNELS ${DEFAULT_MONOLITHIC_KERNELS} CACHE STRING "Use monolithic kokkos kernels") | |||
if (NOT SCREAM_MONOLITHIC_KERNELS) | |||
set(EKAT_DISABLE_WORKSPACE_SHARING TRUE CACHE STRING "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure this is needed or even a good design choice. I think we should leave it to SCREAM to set up their WSMs properly. Any WSM that might be used for small kernels should have their over provision factor set to a huge number. This will prevent any workspace sharing. The top level functions of small kernels should assert that their WSMs have one slot per team.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I'm not 100% sure what this option would do in EKAT. But I feel like we don't want to set global ekat configurations just b/c a few use cases need it. If we can make SCREAM create WSMs differently, each with its own config/ctor params, I think that would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EKAT's WSM should support a new option to make the workspace size proportional to league size (total number of physics columns) rather than use the overprovision factor. This will follow what Hommexx does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ambrad , that's easily doable. I agree that I am abusing the overprovision argument here.
# Add dispatch source files if monolithic kernels are off | ||
if (NOT SCREAM_MONOLITHIC_KERNELS) | ||
list(APPEND SHOC_SRCS | ||
shoc_energy_integrals_disp.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you want to do it in this PR, and it might be a personal taste thing, but in case you agree: the shoc (and p3) folders are quite big, given the large number of _impl.cpp
and .cpp
files (96 files in the shoc folder). This makes navigating the repo a bit hard (imho). I would vote for adding subfolders: one impl
subfolder for the impl files, one eti
subfolder for the ETI cpp files, and one dispatch
for the new files you are adding. That will also allow the less experienced users to recognize the difference between eti and dispatch *.cpp
files...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This directory now has 98 files and will be growing to near 150 when this is done.
|
||
template<> | ||
void Functions<Real,DefaultDevice> | ||
::shoc_energy_fixer_disp( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal taste here would be for spelling out dispatch
, but that's very subjective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those extra 4 keystrokes would give me carpal tunnel!
const Int& nadv, | ||
const view_2d<const Spack>& zt_grid, | ||
const view_2d<const Spack>& zi_grid, | ||
const Int& se_b_slot, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could group these slot indices in a verbose struct, like shoc::SmallKernelsSlotsIds
or something.
@@ -611,6 +659,65 @@ struct Functions | |||
const uview_1d<Spack>& wqls_sec, | |||
const uview_1d<Spack>& brunt, | |||
const uview_1d<Spack>& isotropy); | |||
#else | |||
static void shoc_main_internal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could give these functions two different names, like shoc_main_monolithic
and shoc_main_multiple_kernels
, or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think overall looks fine. My biggest concern is potential confusion from code duplication and bloating of files. Which is why at some point (perhaps at the end of the refactor), I would prefer if we spend a tiny bit of effort on organizing code/files in a self-documenting way.
Good start. I'd like to offer two alternatives for handling persistent workspace slots. (1) Rather than getting pointers (*_slot indices) and passing those around, make a setup/teardown pair of routines that just does the usual thing, e.g. workspace.template take_many_and_reset<5>(
{"rho_zt", "shoc_qv", "dz_zt", "dz_zi", "tkh"},
{&rho_zt, &shoc_qv, &dz_zt, &dz_zi, &tkh}); and then releases it all at the end of the kernel. Each small kernel always calls this setup/teardown pair and thus always gets the right persistent memory for each variable. You could package the persistent arrays in a structure. This pattern would be useful in the case of more complicated partial-persistence patterns where the setup/teardown would change throughout the parameterization. SHOC doesn't seem to need that, though. But this pattern is what I had in mind in #1903, where I was thinking about the general case. (2) Instead, this immediately suggests a simpler option for SHOC: In the case of nonmono kernels, just have one additional buffer struct for these arrays, and when requesting buffer space, include this additional buffer struct in the request. Then no extra workspace gymnastics are needed in SHOC. For the scalars, I recommend making an array to hold all scalars and then providing an enum in the new buffer struct that indexes this array, like this: enum : int { se_b = 0, ke_b, wv_b, ... };
view1d<Scalar> scals;
// usage
b.scals[b.se_b] |
I'm not a fan of the setup/teardown approach, which relies on the order of fields to be always the same. It is a somewhat subtle assumption, which has the risk of being forgotten 3yy down the road when another developer is maybe adding a feature to shoc. Of course, inline documentation would help, but it would still be a bit fragile. Option 2 seems more attractive. If I understand correctly, you want to use "normal" buffer views for persistent data, without using WSM at all, right? Then, inside each top-level kernel, you would subview those buffers, rather than grab slots from the WSM. Imho, this is much simpler than hacking the WSM to get non-scratch memory. |
@bartgol re: option 2, yes. Since it's all we need for SHOC, we can table discussion of more complicated temp-buffer use cases. |
const view_2d<Spack>& isotropy) | ||
{ | ||
// Create space for temporary scalars | ||
view_1d<Scalar> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These views shouldn't be allocated each time SHOC is called. They should be part of the init-time allocation using buffer memory. I suggest using a struct to hold these.
@@ -2693,7 +2693,12 @@ int shoc_init_f(Int nlev, Real *pref_mid, Int nbot_shoc, Int ntop_shoc) | |||
ekat::host_to_device({pref_mid}, nlev, temp_d); | |||
view_1d pref_mid_d(temp_d[0]); | |||
|
|||
#ifndef SCREAM_MONOLITHIC_KERNELS | |||
SHF::SHOCTemporaries temporaries; // we won't keep these |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little hacky but fine IMO. The f90->cxx bridge is already slow so it doesn't matter if we reallocate temporaries for each shoc_main_f call.
shoc_history_output.shoc_mix, shoc_history_output.w_sec, shoc_history_output.thl_sec, shoc_history_output.qw_sec, shoc_history_output.qwthl_sec, // Diagnostic Output Variables | ||
shoc_history_output.wthl_sec, shoc_history_output.wqw_sec, shoc_history_output.wtke_sec, shoc_history_output.uw_sec, shoc_history_output.vw_sec, // Diagnostic Output Variables | ||
shoc_history_output.w3, shoc_history_output.wqls_sec, shoc_history_output.brunt, shoc_history_output.isotropy, // Diagnostic Output Variables | ||
shoc_temporaries.se_b, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the existing pattern of unpacking the struct before calling shoc_main_internal even though this function has a crazy number of arguments.
@@ -58,6 +63,33 @@ Int Functions<S,D>::shoc_init( | |||
const auto host_view = Kokkos::create_mirror_view(npbl_d); | |||
Kokkos::deep_copy(host_view, npbl_d); | |||
|
|||
// Allocate temporaries if using small kernels | |||
#ifndef SCREAM_MONOLITHIC_KERNELS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this part is not following the established pattern for processes to allocate workspace. I believe this code should be in SHOCMacrophysics::init_buffers
and be allocated using the buffer_manager.get_memory()
pointer, as is done for m_buffer
. The reason for this is then no process is holding onto a bunch of device memory that can't be used by any other process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ambrad , thanks, I was wondering if I should use that (Buffer). I have pushed a commit that does this.
m_buffer.zi_grid = decltype(m_buffer.zi_grid)(s_mem, m_num_cols, nlevi_packs); | ||
s_mem += m_buffer.zi_grid.size(); | ||
using spack_2d_view_t = decltype(m_buffer.z_mid); | ||
spack_2d_view_t* _2d_spack_mid_view_ptrs[Buffer::num_2d_vector_mid] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the opportunity to clean up some of the repetitive code in this method. The changes to the 1d code should be semantically identical to the previous impl. The 2d code does not behave exactly as before because the order of the views in the buffer is different with all the "mid" views and "int" views grouped separately now. I assume this order doesn't matter. If it does, I will refactor this to preserve the original order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jgfouca, after the last changes, this code looks great. Are you wanting to merge this PR, or are you looking for approval from everyone following our discussions above and then you'll continue working on this branch using the patterns you established? |
@ambrad , thanks for the good feedback.
I'm going to continue working on this using the established pattern. |
@bartgol this is an example of where a global var shared among all threads in a team will lead to an error: |
Yes, that's a pb. We also have another += on the shared var in the view_reduction function, for Serialize=false. But for Serialize=true, all += on shared vars are only performed by team rank 0. |
But note that the case of an auto var passed to view_reduction must also be handled; a |
For clarity, I tried to add |
Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing. |
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection Is Not Necessary for this Pull Request. |
Also, tbc, I'm fine with Jim's mod. I just want to understand where the shared memory was causing a problem with Serialize=true. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Blake
Jenkins Parameters
Using Repos:
Pull Request Author: jgfouca |
Yes, that's the one I was referring to with "We also have another += on the shared var in the view_reduction function". But that code only happens if Serialize=false. If Serialize=true, none of the += on shared memory happens. |
Ok, what about https://github.com/E3SM-Project/EKAT/blob/dbe3d1c2e3d6242ce998865b2d4676676d80bb69/src/ekat/kokkos/ekat_kokkos_utils.hpp#L133? However, this applies only if pack size > 1. |
Yes, that is a problem, since all team members execute that line. But I tried to switch PerThread with PerTeam, adding a team.team_broadcast(result,0) right after the single, and still got non bfb. |
Ah, ok, maybe this somewhat more subtle situation: In |
Maybe. I stuffed the function with plenty of team_barrier() calls though, so I am not really sure. I will raise my hands and give up. Though I think we should fix this in ekat, to avoid doing the same mistake again somewhere else. |
Uhm, I modified view_reduction to do
and called it passing
with the original code, no? |
Luca, I don't think it's the same. I believe the thread race condition I mentioned is the key here. In the first case you quote, |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Blake
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
Mmm, ok, but then adding a team_barrier() after Anyway, I should prob stop worring about this. I was just trying to emulate Jim's fix inside EKAT, so that we would not have this issue again in the future when calling Edit: nvm, I was just doing |
@bartgol , @ambrad , thanks for the deep dive on In the meantime, can one of you approve this PR? |
Status Flag 'Pre-Merge Inspection' - SUCCESS: The last commit to this Pull Request has been INSPECTED AND APPROVED by [ bartgol ]! |
Status Flag 'Pull Request AutoTester' - Pull Request will be Automerged |
Merge on Pull Request# 1940: IS A SUCCESS - Pull Request successfully merged |
#1797 is using this new capability; see #1797 (comment) for a preliminary report of success. |
This is a major refactor of SHOC to support small kernels which is needed on frontier/crusher.
The only significant challenge so far has been how to maintain workspaces across kernels. This is a new challenge for us because all previous implementation were monolithic kernels. The approach that appears to be working the best is to use integer slot-ids as a handle to the workspace.
Another challenge is that temporary scalars in the monolithic kernel have to become views of schol scalars for small kernels.
I've only transitioned a couple SHOC functions to supporting small kernels. I wanted to get approval for the pattern I'm using before doing the rest since that's going to be a considerable amount of work.
There a few minor changes needed to EKAT in order for this to build that you can't see here. The only interesting change was the addition of this WSM::Workspace method:
Fixes #1903