Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructured dynamical core, variables array-agnostic and 3-dimensional, modular netCDF output #525

Merged
merged 120 commits into from
Sep 25, 2024

Conversation

milankl
Copy link
Member

@milankl milankl commented Apr 25, 2024

A draft superceding #493 to restructure the prognostic variables.

Includes device and ArrayType in SpectralGrid

julia> spectral_grid
SpectralGrid:
├ Spectral:   T63 LowerTriangularMatrix{Complex{Float32}}, radius = 6.371e6 m
├ Grid:       96-ring OctahedralGaussianGrid{Float32}, 10944 grid points
├ Resolution: 216km (average)
├ Vertical:   2-level SigmaCoordinates
└ Device:     CPU using Array

the variables 3D/4D are now directly in prognostic variables, not nested in .layers[1].timesteps[1] or .surface

julia> tree(progn)
PrognosticVariables{Float32, Vector{Float32}, OctahedralGaussianGrid{Float32}, Matrix{ComplexF32}, Array{ComplexF32, 3}, ShallowWater}
├ trunc
...
├ vor
├ div
├ temp
├ humid
├ pres
...

the type signature for PrognosticVariables however got a bit more complicated, maybe there's ways to simplify that

Base.@kwdef struct PrognosticVariables{
    NF<:AbstractFloat,
    ArrayType2D<:AbstractArray{NF, 1},                  # 2D real type (land + ocean)
    Grid<:AbstractGridArray{NF, 1, ArrayType2D},        # Grid for those
    ArrayTypeComplex3D<:AbstractArray{Complex{NF}, 2},  # 3D complex type (surface pressure)
    ArrayTypeComplex4D<:AbstractArray{Complex{NF}, 3},  # 4D complex type (layered variables)
    M<:ModelSetup,
} <: AbstractPrognosticVariables

@milankl milankl added structure 🏠 Internal structure of the code array types 🔢 LowerTriangularMatrices and RingGrids dynamics 〰️ Affects the dynamical core labels Apr 25, 2024
@milankl
Copy link
Member Author

milankl commented Apr 25, 2024

Problem, while this works (LowerTriangularMatrix on layer 1, time step 1)

julia> progn.vor[:,1,1]
65×64 LowerTriangularMatrix{ComplexF32}:
 0.0+0.0im  0.0+0.0im  0.0+0.0im  0.0+0.0im  0.0+0.0im  

this doesn't

julia> progn.vor[:,end,1]
ERROR: BoundsError: attempt to access 2144×2×2 Array{ComplexF32, 3} at index [1:2144, 64, 1]

the end gets translated to 64 which is size(progn.vor)[2] ...

@maximilian-gelbrecht
Copy link
Member

maximilian-gelbrecht commented May 2, 2024

This is a consequence of the LTA/LTM being a subtype of AbstractArray{T,4} in this case.

What we did with the two ways of indexing is a of a hack to be fair, the LTA can't be a 3-dim and 4-dim at the same time that easily. As far as I know we could maybe redefine the behaviour of end by defining lastindex(::LowerTriangularArray) but then it wouldn't work properly anymore for the 4-dim case. I guess we have to decide if we want the end to work as expected in the case of flat indexing (lm) or matrix indexing (l, m).

@maximilian-gelbrecht
Copy link
Member

maximilian-gelbrecht commented May 2, 2024

Does this actually appear in the dynamical core? Because the 3d indexing with leading colon does cause allocations, so it should be avoided anyway, or we can introduce a non-allocating view variarant. That's possible as well.

In the end, I think we can just avoid using the end there. We have the constant N_STEPS defined which can just use in its place there. Similarly we'll always have access to the number of horizontal levels as well easily.

@maximilian-gelbrecht
Copy link
Member

Another alternative would be to do something a bit similar to EndpointRanges.jl and define a custom iend/lmend that works for the flat lm-indexing (and only there).

@milankl
Copy link
Member Author

milankl commented May 2, 2024

Another alternative would be to do something a bit similar to EndpointRanges.jl and define a custom iend/lmend that works for the flat lm-indexing (and only there).

Okay that's interesting, thanks for looking into it. No we wouldn't really need to use end inside the dynamical core or if we can easily replace it, it's more generally about using LowerTriangularArray intuitively. I guess if we've clearly defined what one can and cannot do with those then all good, but if I stumble across something that I expect to work and it doesn't then I'd rather flag this for discussion then to silently rewrite the code to just make it work

I am generally not super sure about the whole two ways of indexing LowerTriangularArray, while it seems possible to make this convenient I'm wondering whether in the end it just means we get hung up on cases where it doesn't work as expected or causes some small edge case issue that's it's not worth it overall. E.g. I see several possibilities

  1. support flat and matrix indexing throughout but get hung up somewhat whenever this causes issues (what we have now)
  2. support only flat indexing and use ij2k manually inside the dynamical core whenever necessary
  3. support only matrix indexing by storing the upper triangle explicitly but disallow setindex! there
  4. support matrix indexing only for LowerTriangularMatrix in addition to flat indexing for ...Array that way we have the more user facing LowerTriangularMatrix intuitive but are consitently indexing within the dynamical core

I obviously don't want to keep changing things but I see somewhat an appeal to make our life easier with 2-4. But I'm wondering whether in the end we should first reimplement the gradients and transforms for LowerTriangularArray to make a better informed decision also on what's important for performance and GPU

@maximilian-gelbrecht
Copy link
Member

maximilian-gelbrecht commented May 2, 2024

Let's not overthink it, I would say.

We defined LowerTriangularArray{T,N} to be a subtype of AbstractArray{T,N} but store an array AbstractArray{T,N-1}. All indexing with N indices (matrix indexing) will work flawlessly due to the abstract array interface.

Additionally we defined several functions so that the flat indexing with N-1 indices works in almost all cases. I think that's fine, and in some way that's the same as if we would only support matrix indexing and use ij2k always in the dynamical core. I think it's totally to keep it as it is, and just be a bit aware of flat indexing being a bit limited (the docs should mention that!).

The alternative would be to swap the behaviour by making it a subtype of AbstractArray{T,N-1} and have the matrix indexing to be somewhat limited.

And so far, the only limitation that I can think of are those connected with size and axes. And that is because in the end, it can't be a 3-d and 4-d array at the same time. That's not that bad.

Edit: Another way to look at is also that the way we have it currently implemented we gives us the matrix style indexing and some support for flat indexing. At the same time though we can also always directly index the .data object, when in doubt about the flat indexing.

@maximilian-gelbrecht
Copy link
Member

Another alternative would be to do something a bit similar to EndpointRanges.jl and define a custom iend/lmend that works for the flat lm-indexing (and only there).

Okay that's interesting, thanks for looking into it.

As far as I can see it, EndpointRanges does even directly allow to define those as subtypes of their types. It's a pretty small package, so it wouldn't be too bad to depend on it. I could have a look at it some day.

@maximilian-gelbrecht
Copy link
Member

maximilian-gelbrecht commented Sep 19, 2024

With closures over functions there's also easy ways to do stuff like this in a really compact way:

Like:

set!(simulation, sea_surface_temperature=(λ, ϕ) -> interpolate(ϕ, λ, model.land_sea_mask.mask) < 0.1 ? 273.15 + 30*cosd(ϕ) + 5*sind-30) : NaN)

or

set!(simulation, snow_depth=(λ, ϕ) -> interpolate(ϕ, λ, model.orography.orography) > 1000 ? 100 : 0)

Btw, we should be more consistent with the order of longitude and latitude for function calls (see above)

@milankl
Copy link
Member Author

milankl commented Sep 19, 2024

we should be more consistent with the order of longitude and latitude for function calls

It's interpolate functions that do lat-lon instead of lon-lat (what we should do as that also matches Julia matrix ordering with how our grids are organised?

test/set.jl Outdated
u_grid = transform(u, model.spectral_transform)
v_grid = transform(v, model.spectral_transform)

set!(simulation, u=u_grid, v=v_grid, coslat_scaling_included=false)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as lf is not set here, this uses lf=1 (default)

U = similar(u)
V = similar(v)

SpeedyTransforms.UV_from_vordiv!(U, V, prog_new.vor[lf], prog_new.div[lf], model.spectral_transform)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however, this calculates U, V from lf which in this testset is set to lf=2 above

Comment on lines +86 to +87
u = randn(spectral_grid.SpectralVariable3D, trunc+2, trunc+1, nlayers)
v = randn(spectral_grid.SpectralVariable3D, trunc+2, trunc+1, nlayers)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For any transform tests I usually start in spectral space because the grid space has more degrees of freedom meaning that rand of some grid contains a lot of information (high resolution noise) that's lost in the first transform as not even the highest wavenumbers can hold that. You can probably circumvent this by using a linear trunction (instead of the default quadratic) but because we always do quadratic, I'd just make sure to start any spec-grid uv-vor/div roundtrip with information that's surely represented in any space

Comment on lines 398 to 402
curl!(vor, u_new, v_new, S; add)
divergence!(div, u_new, v_new, S; add)
else
curl!(vor, u_, v_, S; add)
divergence!(div, u_, v_, S; add)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a division by the radius missing here. As discussed somewhere in the docs (sufficiently I hope) all spectral gradient operators omit the radius scaling as it's not needed in the dynamical core (hence we use vor, div being scaled by the radius). I've therefore added to a bunch of curl and divergence functions a radius keyword (default 1) that one needs to pass on. The SpectralTransform doesn't know about the radius (because it's invariant to the radius)

Copy link
Member Author

@milankl milankl Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All spectral gradients with c3f7398 now can take an optional radius keyword argument. Otherwise radius=1 is assumed.

@milankl
Copy link
Member Author

milankl commented Sep 20, 2024

I still can't fully close the uv-vordiv roundtrip, there's something off, maybe with the coslat scaling because the error largely projects onto the even zonal modes

image

working on it

@milankl
Copy link
Member Author

milankl commented Sep 23, 2024

Regarding performance, quick comparison between v0.10 and main on my macbook air

Model (default config), resolution v0.10 main change
Barotropic T31 20k SYPD 17k SYPD -18%
Barotropic T85 880 SYPD 700 SYPD -25%
ShallowWater T31 2800 SYPD 2300 SYPD -22%
ShallowWater T85 125 SYPD 100 SYPD -25%

@maximilian-gelbrecht
Copy link
Member

maximilian-gelbrecht commented Sep 24, 2024

I did some benchmarks of the transforms and gradients between main and mk/newvariables. Performance between the two branches is pretty much identical. So, the malus has to come from somewhere else

import Pkg
Pkg.activate(".")

using SpeedyWeather, BenchmarkTools

NF = Float32 
trunc = 63
Grid = FullGaussianGrid
#nlayers = 8 

SG = SpectralGrid(; NF, trunc, Grid)
S = SpectralTransform(SG, recompute_legendre=false)

grid = randn(SG.Grid{NF}, SG.nlat_half)
L1 = zeros(LowerTriangularMatrix{Complex{NF}}, trunc+2, trunc+1)
L2 = rand(LowerTriangularMatrix{Complex{NF}}, trunc+2, trunc+1)
L3 = zeros(LowerTriangularMatrix{Complex{NF}}, trunc+2, trunc+1)
L4 = zeros(LowerTriangularMatrix{Complex{NF}}, trunc+2, trunc+1)

@benchmark spectral!(L1, grid, S)
@benchmark gridded!(grid, L1, S)

#@benchmark transform!(L1, grid, S)
#@benchmark transform!(grid, L1, S)

@benchmark curl!(L3, L1, L2, S)
@benchmark divergence!(L4, L1, L2, S)

@benchmark SpeedyWeather.UV_from_vordiv!(L1, L2, L3, L4, S)

@benchmark ∇²!(L1, L4, S, inverse=false)
@benchmark ∇²!(L1, L4, S, inverse=true)

New branch:

@benchmark transform!(L1, grid, S)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  49.417 μs  101.292 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     50.417 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   50.637 μs ±   1.018 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▂▆█▇▅▂                                                 
  ▁▂▂▂▂▂▂██████▇▇▄▅▅▄▄▃▃▃▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  49.4 μs         Histogram: frequency by time         54.8 μs <

 Memory estimate: 12.38 KiB, allocs estimate: 250.

@benchmark transform!(grid, L1, S)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  47.500 μs  105.000 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     48.667 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   48.801 μs ±   1.289 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▅█▇▃      ▁▃▆▆▆▃▁▁                                          
  ▂▅██████▆▆▆▇████████▅▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▂▂▂▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  47.5 μs         Histogram: frequency by time         52.6 μs <

 Memory estimate: 12.41 KiB, allocs estimate: 251.

@benchmark curl!(L3, L1, L2, S)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min  max):  2.532 μs   3.722 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.727 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.733 μs ± 49.194 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                       ▂▇█▆▆▄▂ ▁                             ▂
  ▆▅▃▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃██████████▆▆▅▃▅▆▅▅▅▄▄▅▅▄▃▅▄▃▄▄▃▃▁▃▄▃▄ █
  2.53 μs      Histogram: log(frequency) by time        3 μs <

 Memory estimate: 32 bytes, allocs estimate: 1.

 @benchmark divergence!(L4, L1, L2, S)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min  max):  2.505 μs   3.782 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.694 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.703 μs ± 57.143 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     ▂▆▆██▇▅▂▁▁                              ▂
  ▄▄▆▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██████████▇▇▅▆▆▅▄▄▆▆▅▅▅▄▃▅▅▃▅▅▄▃▅▁▄▄▄▄▄ █
  2.5 μs       Histogram: log(frequency) by time     2.99 μs <

 Memory estimate: 32 bytes, allocs estimate: 1.

@benchmark SpeedyWeather.UV_from_vordiv!(L1, L2, L3, L4, S)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min  max):  3.146 μs   16.948 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     3.391 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.431 μs ± 307.718 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▂            ▂█▆▁▂▃▁   ▃▁                                  ▁
  ██▅▃▃▅▅▃▂▂▃▄▃▃███████▇▆███▇▆▆▆▅▆▇▆▅▅▅▅▇▇▇▆▆▅▅▆▆▅▅▅▆▆▆▅▅▆▅▇▅ █
  3.15 μs      Histogram: log(frequency) by time      4.06 μs <

 Memory estimate: 64 bytes, allocs estimate: 1.

@benchmark ∇²!(L1, L4, S, inverse=false)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.446 μs   5.304 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.621 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.623 μs ± 70.167 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▂   ▂▂▃▁▂▁  ▂▆█▇▆▄▃▂                              ▂
  ▇▇▄▆▄▄▄▃▆██▆▃████████████████▇▇▇█▇▆▅▅▁▆▃▆▄▅▅▆▄▄▅▅▄▅▃▅▁▅▅▄▅ █
  1.45 μs      Histogram: log(frequency) by time     1.87 μs <

 Memory estimate: 32 bytes, allocs estimate: 1.

@benchmark ∇²!(L1, L4, S, inverse=true)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.446 μs   2.358 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.558 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.562 μs ± 22.520 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                      ▁ █▇                    
  ▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃█▁██▄▆▃▂▁▂▂▁▂▂▂▁▂▂▁▂▂▂ ▂
  1.45 μs        Histogram: frequency by time        1.62 μs <

 Memory estimate: 32 bytes, allocs estimate: 1.

Main branch:

@benchmark spectral!(L1, grid, S)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  48.125 μs  111.625 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     51.958 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   52.343 μs ±   1.526 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                      ▆██▆▆▆▆▆▅▅▄▄▃▂▂     ▁                    ▂
  ▄▃▃▃▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▆██████████████████▇█████▇██▇▆▇▇▆▆▇▇▇▆▆▆▆ █
  48.1 μs       Histogram: log(frequency) by time      57.9 μs <

 Memory estimate: 10.88 KiB, allocs estimate: 202.

@benchmark gridded!(grid, L1, S)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  46.792 μs  110.000 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     49.209 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.510 μs ±   1.329 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                 ▃▅█▂▁                                          
  ▂▁▁▁▁▁▁▁▂▁▂▃▅▅██████▇█▆▅▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  46.8 μs         Histogram: frequency by time         54.9 μs <

 Memory estimate: 10.91 KiB, allocs estimate: 203.

@benchmark curl!(L3, L1, L2, S)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min  max):  2.477 μs   4.042 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.671 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.679 μs ± 49.191 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                           ▆▃█                                
  ▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▄███▆▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂ ▂
  2.48 μs        Histogram: frequency by time         2.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

 @benchmark divergence!(L4, L1, L2, S)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min  max):  2.486 μs    5.880 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.685 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.707 μs ± 101.890 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▂▆█▇▅▄▂▂                                      ▂
  ▇▆▃▁▁▆▃▁▃▁▁▁▁▁████████▇▇▇▇▆▆▇▆▆▅▆▇▅▇▅▆▆▇▆▆█▇▇▆▆▆▆▆▇▇▆▇▆▇▆▇▇ █
  2.49 μs      Histogram: log(frequency) by time      3.19 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark SpeedyWeather.UV_from_vordiv!(L1, L2, L3, L4, S)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.979 μs   3.183 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.125 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.132 μs ± 45.261 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     ▁█▆▅▁ ▁▁                                ▁
  ▅▅▁▁▁▁▁▄▃▁▁▁▁▁▁▁▁▁▁█████▇██▆▆▆▄▇▅▄▆▅▄▄▅▄▄▅▄▄▄▁▄▅▃▄▄▄▄▅▃▅▃▄ █
  1.98 μs      Histogram: log(frequency) by time     2.38 μs <

 Memory estimate: 16 bytes, allocs estimate: 1.

@benchmark ∇²!(L1, L4, S, inverse=false)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.400 μs    7.075 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.508 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.522 μs ± 111.858 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▅▇█                                          
  ▂▂▁▂▁▁▁▁▁▁▁▁▂▁▂▄███▄▃▃▂▃▂▃▂▂▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  1.4 μs          Histogram: frequency by time        1.75 μs <

 Memory estimate: 32 bytes, allocs estimate: 1.

@benchmark ∇²!(L1, L4, S, inverse=true)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.413 μs    8.100 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.517 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.528 μs ± 130.292 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                 █                                             
  ▂▂▂▂▁▁▁▁▁▁▁▁▁▂▇█▆█▇▇▇▃▂▂▂▄▅▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▂▂ ▃
  1.41 μs         Histogram: frequency by time        1.75 μs <

 Memory estimate: 32 bytes, allocs estimate: 1.

@milankl
Copy link
Member Author

milankl commented Sep 24, 2024

Okay that's very good because that means that all our new indexing doesn't cause any problems in practice. Yay! I bet it's some type instability, maybe needs a function barrier or similar. Given the size of this pull request I don't recall all the changes we made in that direction 🙈

@maximilian-gelbrecht
Copy link
Member

maximilian-gelbrecht commented Sep 24, 2024

I was continuing the benchmarking a bit by dissecting dynamics_tendencies! for the PrimitiveWetModel.

There are differences in speed between multiple of the individual functions, but by far the biggest offender both in terms of absolute time and relative difference is vertical_advection!.

There's also e.g. a 4x difference in vertical_integration! but the absolute difference is much smaller than for vertical_advection!

Here are detailed results (benchmarked in blocks that are directly comparable):
mk/newvariables https://gist.github.com/maximilian-gelbrecht/d71181d39536336de7347c54f4515fd9
main https://gist.github.com/maximilian-gelbrecht/8f33757a00e3d7b5bcfb3071994339bd

@milankl
Copy link
Member Author

milankl commented Sep 24, 2024

Awesome! Yeah the vertical integration and advection did change a bit in terms of structure without the vertical layering. Maybe I haven't been careful there.

@milankl
Copy link
Member Author

milankl commented Sep 24, 2024

I had removed all @inbounds in the dynamics tendencies, they are back in now - let's see how much of a difference that makes

@milankl
Copy link
Member Author

milankl commented Sep 24, 2024

Was (T31L8, with CenteredVerticalAdvection{NF, 1})

julia> @btime SpeedyWeather.vertical_advection!($diagn, $model)
  10.062 ms (202808 allocations: 15.47 MiB)

now

julia> @btime SpeedyWeather.vertical_advection!($diagn, $model)
  281.434 μs (24 allocations: 2.81 KiB)

40x faster! 🎉 remaining allocations are Symbol creations, which could be replaced but don't think this is necessary. The reasons was that to create the vertical stencils, (1, 1, 2), (1, 2, 3), (2, 3, 4) we had the following line

k_stencil = max.(min.(nlayers, k-B:k+B), 1)

which allocates a vector and then [2:end] was allocating a new vector... Previously this wasn't a problem but now with the reordering of the vertical this moved two loops down and got really expensive!

@maximilian-gelbrecht
Copy link
Member

maximilian-gelbrecht commented Sep 25, 2024

The full benchmark suite:

Main

(https://gist.github.com/maximilian-gelbrecht/97ad1e55c044a1df3ec8d26fb9e8a7d8):

Model T L Physics Δt SYPD Memory
BarotropicModel 31 1 false 1800 61467 1.26 MB
ShallowWaterModel 31 1 false 1800 37730 1.28 MB
PrimitiveDryModel 31 8 true 1800 1611 4.24 MB
PrimitiveWetModel 31 8 true 1800 1239 4.59 MB

mk/newvariables

(https://gist.github.com/maximilian-gelbrecht/e712878dc3e8ff9b0a31c566cb605def)

Model T L Physics Δt SYPD Memory
BarotropicModel 31 1 false 1800 63138 1.17 MB
ShallowWaterModel 31 1 false 1800 38006 1.19 MB
PrimitiveDryModel 31 8 true 1800 1984 4.11 MB
PrimitiveWetModel 31 8 true 1800 1431 4.47 MB

So maybe we are even slightly faster now 🎉

@milankl
Copy link
Member Author

milankl commented Sep 25, 2024

While we're at it at little performance overview of the calls in dynamics_tendencies! (T31L8 resolution, PrimitiveWetModel)

Function # transforms time share of total memory allocations
pressure_gradient_flux! 2 (2D only) 93.034 μs 3% 260 allocations: 12.88 KiB
linear_virtual_temperature! 0 1.762 μs <0.1% 0 allocations: 0 bytes
temperature_anomaly! 0 3.897 μs 0.1% 0 allocations: 0 bytes
geopotential! 0 8.988 μs 0.3% 0 allocations: 0 bytes
vertical_integration! 0 25.179 μs 1% 0 allocations: 0 bytes
surface_pressure_tendency! 1 (2D only) 29.679 μs 1% 130 allocations: 6.44 KiB
vertical_velocity! 0 7.813 μs 0.2% 0 allocations: 0 bytes
linear_pressure_gradient! 0 1.454 μs >0.1% 0 allocations: 0 bytes)
vertical_advection! 0 294.082 μs 10% 24 allocations: 2.81 KiB
vordiv_tendencies! 2 656.258 μs 23% 1940 allocations: 98.38 KiB
temperature_tendency! 3 1.175 ms 42% 2910 allocations: 147.56 KiB
humidity_tendency! 3 1.092 ms 39% 2910 allocations: 147.56 KiB
bernoulli_potential! 1 272.455 μs 10% 970 allocations: 49.19 KiB
all 9x3D + 2x2D 2.8ms 100% 9144 allocations: 464.81 KiB

(Doesn't add up, it's faster to benchmark all together...)
So the transforms very much dominate the RHS calculations, most other kernels (except vertical advection at 10% now) are negligible either in grid or spectral space. @jackleland that also shows you how important the transforms are 🙈 .
@maximilian-gelbrecht This looks like another benchmark we could do regularly.

Translated to a good pie chart we have

image

@maximilian-gelbrecht
Copy link
Member

Yes, I can add them to the benchmark suite.

@milankl milankl merged commit 798f9bb into main Sep 25, 2024
6 checks passed
@milankl milankl deleted the mk/newvariables branch October 7, 2024 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
array types 🔢 LowerTriangularMatrices and RingGrids dynamics 〰️ Affects the dynamical core output 📤 NetCDF or JLD2 output, their writers, formats, etc structure 🏠 Internal structure of the code time integration 🕙 Time integration of the equations transform ⬅️ ➡️ Our spherical harmonic transform, grid to spectral, spectral to grid
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants