(0.95.3) Reset model clock and skip `time_step!`ing if next actuation time is tiny #3606

tomchor · 2024-05-24T00:40:07Z

This is somewhat of a hack to investigate (and eventually solve) #3593. For now I just want to investigate if this breaks too much stuff in the code, but it does seem to solve the problem. Here's a MWE to demonstrate:

using Oceananigans

grid_base = RectilinearGrid(topology = (Bounded, Periodic, Bounded),
                            size = (16, 20, 4), extent = (800, 1000, 100),)
    
@inline east_wall(x, y, z) = x > 400
grid = ImmersedBoundaryGrid(grid_base, GridFittedBoundary(east_wall))

model = NonhydrostaticModel(grid = grid, timestepper = :RungeKutta3, buoyancy = BuoyancyTracer(), tracers = :b,)

N² = 6e-6
b∞(x, y, z) = N² * z
set!(model, b=b∞)
    
simulation = Simulation(model, Δt=25, stop_time=1e4,)

using Statistics: std
using Printf
progress_message(sim) = @printf("Iteration: %04d, time: %s, iteration×Δt: %s, std(pNHS) = %.2e\n",
                                iteration(sim), sim.model.clock.time, iteration(sim) * sim.Δt, std(model.pressures.pNHS))
add_callback!(simulation, progress_message, IterationInterval(1))

simulation.output_writers[:snaps] = NetCDFOutputWriter(model, (; model.pressures.pNHS,),
                                                       filename = "test_pressure.nc",
                                                       schedule = TimeInterval(100),
                                                       overwrite_existing = true,)
run!(simulation)

On main this produces stuff like:

Iteration: 0001, time: 25.0, iteration×Δt: 25.0, std(pNHS) = 6.02e-03
Iteration: 0002, time: 50.0, iteration×Δt: 50.0, std(pNHS) = 6.02e-03
Iteration: 0003, time: 75.0, iteration×Δt: 75.0, std(pNHS) = 6.02e-03
Iteration: 0004, time: 99.99999999999999, iteration×Δt: 100.0, std(pNHS) = 6.02e-03
Iteration: 0005, time: 100.0, iteration×Δt: 125.0, std(pNHS) = 2.72e+10

The last two lines are of note where we went from time: 99.99999999999999 to time: 100.0, implying a very tiny time-step, which results in a weird pressure field, as quantified by the last output of the last line: std(pNHS) = 2.72e+10. Note that because of this, time and iteration×Δt don't match up anymore in the last line. Namely time: 100.0, iteration×Δt: 125.0. This "misstep" happens many times throughout the run on main.

On this branch this doesn't happen anymore, and even after many time-steps things remain aligned (albeit with a very small round-off error):

Iteration: 0396, time: 9900.0, iteration×Δt: 9900.0, std(pNHS) = 5.99e-03
Iteration: 0397, time: 9925.000000000002, iteration×Δt: 9925.0, std(pNHS) = 5.99e-03
Iteration: 0398, time: 9950.000000000004, iteration×Δt: 9950.0, std(pNHS) = 5.99e-03
Iteration: 0399, time: 9975.000000000005, iteration×Δt: 9975.0, std(pNHS) = 5.99e-03

Ideally the way to really fix this would be to figure out a way to avoid round-off errors, but I haven't been able to do that yet.

… tiny

…ined

glwagner · 2024-05-24T21:32:14Z

src/Simulations/run.jl

+            time_step!(sim.model, Δt, callbacks=model_callbacks)
+        else
+            println("Skipping aligned time step, which is of ", Δt)
+            sim.model.clock.time = next_actuation_time(sim)


Could this also be

Suggested change

sim.model.clock.time = next_actuation_time(sim)

sim.model.clock.time += Δt

?

I don't think this works when we're almost at next_actuation_time since it would skip that actuation time, no?

What is

sim.model.clock.time + Δt

vs

next_actuation_time(sim)

?

@tomchor still have this question

Ah, sorry I missed this. It's been so long that I forgot the details, but looking at the code for both, it's possible they end up calculating the same thing since Δt is calculated with aligned_time_step().

glwagner · 2024-05-24T21:35:11Z

Nice work! I'm curious about the criteria. Should it be something like

dt = 10 * eps(dt) * sim.dt

? Or does it have to be larger than that (hence the factor 1e10).

It'd be nice not to have to define next_actuation_time for every schedule... it doesn't really make sense for WallTimeInterval either. Plus, we want users to be able to provide custom schedules (since they only need to be a function of model that returns true/false) so that people can trigger output / action using interesting custom criteria...

tomchor · 2024-05-24T22:13:01Z

Nice work! I'm curious about the criteria. Should it be something like
dt = 10 * eps(dt) * sim.dt
? Or does it have to be larger than that (hence the factor 1e10).

I actually don't know what the proper criterion should be. With the one you proposed, the error doesn't go away in this example since the tiny time-step is about 1e-12, but 10 * eps(dt) * sim.dt come out to be about 1e-13. If we use 100 * eps(dt) * sim.dt then it works. But I don't yet know how much of this will generalize to other, more complex simulations. I still have to test these on my own simulations to see what works.

It'd be nice not to have to define next_actuation_time for every schedule... it doesn't really make sense for WallTimeInterval either. Plus, we want users to be able to provide custom schedules (since they only need to be a function of model that returns true/false) so that people can trigger output / action using interesting custom criteria...

Yeah, agree. I'm not sure of a good workaround here though. Do you have suggestions?

For the time being we can just set a fallback method as next_actuation_time(scheduke) = Inf I guess? (Similar to what I did for IterationInterval.

Also, nice to see that tests pass and nothing is breaking :)

tomchor · 2024-05-27T22:34:59Z

I did a few tests with some criteria for timestep-skipping with a couple of my own simulations in addition to the MWE included here. In summary:

Criterion sim.Δt / 1e10: successfully gets rids of the problem in both the MWE and in my simulations
Criterion 10 * eps(sim.Δt) * sim.Δt: doesn't get rid of the problem in any simulation
100 * eps(sim.Δt) * sim.Δt: fixes the problem in the MWE but not in my simulations, although it does decrease its frequency of occurrence a good amount.
1000 * eps(sim.Δt) * sim.Δt: fixes everything in all simulations I've tried.

So only options 1 and 4 fully fix the problem (at least in the simulations I've tried so far). For me both those options rely on pretty arbitrary numbers though, so I'm not very happy with neither. From the point of view seeing the timestep-skipping as an approximation ($u^{n+1} \approx u^n$), then maybe criterion 1 makes more sense, although I'm not sure how it'd behave for Float32 simulations.

I see three possible ways to go about it right now:

Do what this PR is doing, and manually set the criterion to either option 1 or 4 above. If it turns out that some simulations still have issues, we revisit.
We add min_Δt as a property of NonhydrostaticModel (or maybe Simulation?). I think the minimum Δt for which time skipping will be necessary will vary significantly between simulations, so this solution deals with that by leaving the decision up to the user if they are interested in the pressure output.
We try something that actually prevents these round-off errors instead of dealing with them. @glwagner suggested an Integer-based model clock, but there might be other options.

glwagner · 2024-05-28T23:53:15Z

I did a few tests with some criteria for timestep-skipping with a couple of my own simulations in addition to the MWE included here. In summary:

Criterion sim.Δt / 1e10: successfully gets rids of the problem in both the MWE and in my simulations

Criterion 10 * eps(sim.Δt) * sim.Δt: doesn't get rid of the problem in any simulation

100 * eps(sim.Δt) * sim.Δt: fixes the problem in the MWE but not in my simulations, although it does decrease its frequency of occurrence a good amount.

1000 * eps(sim.Δt) * sim.Δt: fixes everything in all simulations I've tried.

So only options 1 and 4 fully fix the problem (at least in the simulations I've tried so far). For me both those options rely on pretty arbitrary numbers though, so I'm not very happy with neither. From the point of view seeing the timestep-skipping as an approximation (un+1≈un), then maybe criterion 1 makes more sense, although I'm not sure how it'd behave for Float32 simulations.

I see three possible ways to go about it right now:

Do what this PR is doing, and manually set the criterion to either option 1 or 4 above. If it turns out that some simulations still have issues, we revisit.

We add min_Δt as a property of NonhydrostaticModel (or maybe Simulation?). I think the minimum Δt for which time skipping will be necessary will vary significantly between simulations, so this solution deals with that by leaving the decision up to the user if they are interested in the pressure output.

We try something that actually prevents these round-off errors instead of dealing with them. @glwagner suggested an Integer-based model clock, but there might be other options.

Note that eps(sim.Δt) is similar to sim.Δt * eps(typeof(Δt)). So Δt / 1e10 is pretty similar to 100000 * eps(sim.Δt). The only point of using eps is to avoid hard coding Float64.

tomchor · 2024-06-05T16:07:51Z

@glwagner are you okay if I just add min_Δt as a property of NonhydrostaticModel and maintain the strategy of skipping the timestep is Δt is smaller than that? I think that's a reasonable and simple way to fix this.

glwagner · 2024-06-05T16:16:31Z

This isn't a problem of the nonhydrostatic model specifically. Why would we add it as a property there?

tomchor · 2024-06-05T16:22:42Z

This isn't a problem of the nonhydrostatic model specifically. Why would we add it as a property there?

This PR is supposed to fix the output of the pressure, which can be unreasonably large if the time step is really small due to, I believe, the nonhydrostatic pressure correction, which is specific to the NonhydrostaticModel. Unless I'm missing something.

This wouldn't necessarily fix #3056, which may be what you're thinking of. I can also add min_Δt to Simulation, so that this can be used with other models.

glwagner · 2024-06-05T16:25:05Z

I think we should fix the problem once. Otherwise we'll end up with unnecessary code somewhere that has to be deleted.

tomchor · 2024-06-05T17:20:38Z

I think we should fix the problem once. Otherwise we'll end up with unnecessary code somewhere that has to be deleted.

@glwagner Can you please be clearer? Does that mean adding min_Δt to Simulation is an acceptable solution? Or should we try to avoid these round-off errors to even happen in the first place?

glwagner · 2024-06-06T11:00:55Z

I think the solution discussed here, where the time-step change associated with TimeInterval schedules is restricted by a sort of relative tolerance criteria, is acceptable if we can't tease out the underlying issue (or its unsolvable).

If we could indeed solve the problem simply by eliminating round off error, then this would almost certainly be preferred since it might be much simpler (eg just fixing an floating-point-unstable arithmetic operation by rearranging terms). That could be really easy.

@Sbozzolo might be able to help because I believe they do something special to avoid round off issues in ClimaAtmos.

I would hesitate to establish an absolute min_Δt that's independent of the units being used, unless the default is 0.

glwagner · 2024-06-06T15:11:26Z

I'll try to get going with the MWE here. Is the immersed boundary acually part of the MWE (ie the error does not occur with it?) And buoyancy?

glwagner · 2024-06-06T15:18:57Z

It seems that without the immersed boundary, there is still a problem of super small time-steps, but the pressure does not blow up.

glwagner · 2024-06-06T17:10:48Z

Ok getting closer maybe...

I think this problem is generic and cannot be solved in general for arbitrary time steps. Here's a few thoughts:

Reading about Kahan summation makes it clear that we simply cannot avoid errors if we would like to add a small floating point number (the time step) to a very large number (the model time).
I think the issue with the time-step is whether or not we can compute the RHS of the pressure Poisson equation accurately --- which is div(u') / Δt, where u' = u + Δt * Gu is the predictor velocity and div is the divergence. This is interesting, because I could not figure out why we would ever find large div(u') with small Δt even in this MWE. But now I realize that because of the status of the immersed Poisson solver, the velocity along the boundary is divergent, strongly so. So, div(u') is large along the boundary. And when we divide by Δt we get something huge. The magnitude of div(u') also somehow seems to depend on the time step (as does the magnitude of the spurious circulation). The correct solution to this case remains at rest of course. (An aside is that this problem could be avoided by separately computing the hydrostatic pressure, and then using a special horizontal gradient operators that avoid computing a hydrostatic pressure gradient across an immersed boundary. However, this would only be correct for no-flux boundary conditions on buoyancy on side walls). Anyways, apparently because of this issue with the immersed pressure solver, it seems that div(u') is large (because div(u) is large) even when Δt = O(1e-14)...
As a result of all of this I am confused about whether this MWE is actually reliable for debugging the issue. I guess we should expect to see problems simply when Δt = O(eps) because this is when div(u') / Δt cannot be reliably computed, I think. This leads to a fairly simple criteria for the time step that's compatible with the pressure correction. But as noted in this PR, this is not enough to fix issues with the immersed boundary MWE... but whether or not that is because of problems with the setup itself, I'm not sure...
All of that said, taking @tomchor suggestion to be more careful in updating the clock for RK3 actually does solve the MWE here. Obviously, this is again addressing the (in principle not entirely solvable) issue of error accumulation in clock.time, rather than addressing the other issue with very small time-steps producing an ill-posed pressure correction. I think we should fix RK3 separately, basically because we cannot completely avoid accumulating error in clock.time, every little thing we do to make it more accurate is a good idea.

glwagner · 2024-06-06T17:55:33Z

PR #3617 helps with this MWE

Sbozzolo · 2024-06-06T18:05:56Z

@Sbozzolo might be able to help because I believe they do something special to avoid round off issues in ClimaAtmos.

With regards to floating point instabilities due to arithmetic with time and time intervals, ultimately, we will be solving this issue (and others) by moving away from a floating point time in favor of an integer one (e.g., milliseconds from a start date). As a fun fact, if you are using Float32 and keep track of your time in seconds, t + dt is going to have numerical error after approximately one year of simulated time (which is not that much).

src/Simulations/run.jl

src/Simulations/simulation.jl

glwagner · 2024-12-14T19:43:53Z

Looks good, maybe bump the minor version

tomchor · 2024-12-14T20:12:54Z

Looks good, maybe bump the minor version

Done! I wanna test this code in my actual simulation before sending it off for reviews though.

Also, I'm thinking of using the MWE at the top comment as a test. It's the most reliable way I could reproduce this issue. Thoughts?

Co-authored-by: Gregory L. Wagner <[email protected]>

glwagner · 2024-12-14T20:50:29Z

Looks good, maybe bump the minor version

Done! I wanna test this code in my actual simulation before sending it off for reviews though.

Also, I'm thinking of using the MWE at the top comment as a test. It's the most reliable way I could reproduce this issue. Thoughts?

I would just put in a very simple test that time-stepping is skipped correctly, by manually taking two time-steps and changing dt in between.

I think the kind of test you are describing may not be appropriate for CI; it's more a validation test. I think if you can put a unit test to show the feature works correctly, you can later show that using the feature solves pressure solver issues and be happy that the unit test ensures it will continue to work as in the validation test.

tomchor · 2024-12-15T13:58:02Z

I tested this branch on my simulation and things work fine. However, when trying to include a test I came across some weird behavior. Namely the snippet below fails:

using Oceananigans
using Test

grid  = RectilinearGrid(size=(1, 1, 1), extent=(1, 1, 1))
model = NonhydrostaticModel(; grid)

simulation = Simulation(model, Δt=3, stop_iteration=1)
run!(simulation)

stop_time = 1.0
simulation = Simulation(model, Δt=1; stop_time)
run!(simulation)

@test simulation.model.clock.time == stop_time

with

Test Failed at REPL[11]:1
  Expression: simulation.model.clock.time == stop_time
   Evaluated: 4.0 == 1.0

If I re-build model before re-creating a simulation, then it works:

using Oceananigans
using Test

grid  = RectilinearGrid(size=(1, 1, 1), extent=(1, 1, 1))
model = NonhydrostaticModel(; grid)

simulation = Simulation(model, Δt=3, stop_iteration=1)
run!(simulation)

stop_time = 1.0
model = NonhydrostaticModel(; grid)
simulation = Simulation(model, Δt=1; stop_time)
run!(simulation)

@test simulation.model.clock.time == stop_time

So there seems to be some attribute of model (presumably model.clock?) that's leading to different Simulation behavior. Is this expected? I couldn't quite figure out what was happening.

tomchor · 2024-12-15T14:01:34Z

src/Simulations/run.jl

+    Δt = aligned_time_step(sim, sim.Δt)
+    if Δt < sim.minimum_relative_step * sim.Δt
+        next_time = next_actuation_time(sim)


FYI, next_actuation_time(sim) and sim.model.clock.time + Δt are equivalent. I chose to use the former here to avoid errors related to the addition operation. But let me know if I should change it to just use the latter.

Doesn't it complicate the code to define next_actuation_time in addition to aligned_time_step? Things go wrong if you change one but not the other. It's usually best to have one "source" of reality / truth

simone-silvestri · 2024-12-15T14:12:39Z

So there seems to be some attribute of model (presumably model.clock?) that's leading to different Simulation behavior. Is this expected? I couldn't quite figure out what was happening.

Indeed, the clock is not reset to zero when the simulation is built, so you are restarting from model.clock == 3.0.

tomchor · 2024-12-15T14:56:46Z

So there seems to be some attribute of model (presumably model.clock?) that's leading to different Simulation behavior. Is this expected? I couldn't quite figure out what was happening.

Indeed, the clock is not reset to zero when the simulation is built, so you are restarting from model.clock == 3.0.

Ah, true! I'm ashamed I forgot about that. Although, in this case shouldn't the simulation just not iterate and keep the clock at 3?

src/Simulations/simulation.jl

navidcy · 2024-12-15T21:13:26Z

src/Simulations/simulation.jl

@@ -88,7 +93,8 @@ function Simulation(model; Δt,
                     0.0,
                     false,
                     false,
-                     verbose)
+                     verbose,
+                     Float64(minimum_relative_step))


why Float64?

I was following the other floats in the Simulation constructor, which are also converted to Float64. I can't remember the PR where this was decided, but it minimizes errors in time-step alignment. The error that this PR solves is an example of this type of error .

Probably we should make it optional.
This will not work if we want to support Metal architectures that do not want Float64.

tomchor added 2 commits May 23, 2024 17:30

reset model clock and skip time_step if next_actuation_time is really…

29709d4

… tiny

move next_actuation_time definition to after IterationInterval is def…

ed130ee

…ined

glwagner reviewed May 24, 2024

View reviewed changes

tomchor mentioned this pull request May 28, 2024

Include pressure gradient force into time-avg output tomchor/submesoscale-headland#29

Merged

tomchor mentioned this pull request Jun 5, 2024

Potential 'output_writers' saving bug? #3614

Closed

Merge branch 'main' into tc/timestepping

af8e6b2

glwagner mentioned this pull request Jun 6, 2024

Compute third stage time-step for RK3 in a way that reduces the accumulation of error #3617

Merged

tomchor added 2 commits June 18, 2024 10:27

Merge branch 'main' into tc/timestepping

b469a9b

Merge branch 'main' into tc/timestepping

d50db47

tomchor mentioned this pull request Nov 16, 2024

Try to update the code to Julia1.10 tomchor/submesoscale-headland#36

Merged

tomchor added 3 commits November 16, 2024 04:27

Merge branch 'main' into tc/timestepping

77b3211

Merge branch 'main' into tc/timestepping

efc4507

bugfix

d6b0ca4

tomchor mentioned this pull request Nov 19, 2024

Pressure has extremely high gradients in random chunks of simulation using NonhydrostaticModel with ImmersedBoundaryGrid and BuoyancyTracer #3593

Open

glwagner reviewed Nov 19, 2024

View reviewed changes

src/Simulations/run.jl Outdated Show resolved Hide resolved

tomchor added 4 commits December 13, 2024 05:20

Merge branch 'main' into tc/timestepping

fcc4a00

Merge branch 'main' into tc/timestepping

182bed2

add minimum_relative_step to Simulation

218d803

apply logic

e4aabf3

glwagner reviewed Dec 14, 2024

View reviewed changes

src/Simulations/simulation.jl Outdated Show resolved Hide resolved

glwagner approved these changes Dec 14, 2024

View reviewed changes

tomchor added 2 commits December 14, 2024 12:10

Merge branch 'main' into tc/timestepping

1879d47

bump minor version

8c8285f

tomchor and others added 2 commits December 14, 2024 12:13

Update src/Simulations/simulation.jl

36a2197

Co-authored-by: Gregory L. Wagner <[email protected]>

added minimum_time_step to docstring

5b062d6

tomchor added 2 commits December 14, 2024 22:14

better logic

44efaab

add test

80870de

Merge branch 'main' into tc/timestepping

6d0202b

tomchor commented Dec 15, 2024

View reviewed changes

fix test

4fee2b6

tomchor marked this pull request as ready for review December 15, 2024 15:10

tomchor requested review from navidcy and simone-silvestri December 15, 2024 20:59

tomchor changed the title ~~Reset model clock and skip time_step!ing if next actuation time is tiny~~ (0.95.3) Reset model clock and skip time_step!ing if next actuation time is tiny Dec 15, 2024

navidcy reviewed Dec 15, 2024

View reviewed changes

src/Simulations/simulation.jl Show resolved Hide resolved

navidcy reviewed Dec 15, 2024

View reviewed changes

update docstring call signature

eaa200a

tomchor requested a review from navidcy December 15, 2024 21:23

tomchor mentioned this pull request Dec 15, 2024

(0.95.3) Restores open boundary condition functionality #3937

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(0.95.3) Reset model clock and skip `time_step!`ing if next actuation time is tiny #3606

(0.95.3) Reset model clock and skip `time_step!`ing if next actuation time is tiny #3606

tomchor commented May 24, 2024

glwagner May 24, 2024

tomchor May 24, 2024

glwagner May 28, 2024

glwagner Nov 19, 2024

tomchor Nov 19, 2024

glwagner commented May 24, 2024

tomchor commented May 24, 2024

tomchor commented May 27, 2024

glwagner commented May 28, 2024 •

edited

Loading

tomchor commented Jun 5, 2024

glwagner commented Jun 5, 2024

tomchor commented Jun 5, 2024 •

edited

Loading

glwagner commented Jun 5, 2024

tomchor commented Jun 5, 2024

glwagner commented Jun 6, 2024 •

edited

Loading

glwagner commented Jun 6, 2024

glwagner commented Jun 6, 2024

glwagner commented Jun 6, 2024 •

edited

Loading

glwagner commented Jun 6, 2024

Sbozzolo commented Jun 6, 2024

glwagner commented Dec 14, 2024

tomchor commented Dec 14, 2024

glwagner commented Dec 14, 2024

tomchor commented Dec 15, 2024

tomchor Dec 15, 2024

glwagner Dec 16, 2024

simone-silvestri commented Dec 15, 2024

tomchor commented Dec 15, 2024 •

edited

Loading

navidcy Dec 15, 2024

tomchor Dec 15, 2024

simone-silvestri Dec 16, 2024

	sim.model.clock.time = next_actuation_time(sim)
	sim.model.clock.time += Δt

(0.95.3) Reset model clock and skip time_step!ing if next actuation time is tiny #3606

Are you sure you want to change the base?

(0.95.3) Reset model clock and skip time_step!ing if next actuation time is tiny #3606

Conversation

tomchor commented May 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glwagner commented May 24, 2024

tomchor commented May 24, 2024

tomchor commented May 27, 2024

glwagner commented May 28, 2024 • edited Loading

tomchor commented Jun 5, 2024

glwagner commented Jun 5, 2024

tomchor commented Jun 5, 2024 • edited Loading

glwagner commented Jun 5, 2024

tomchor commented Jun 5, 2024

glwagner commented Jun 6, 2024 • edited Loading

glwagner commented Jun 6, 2024

glwagner commented Jun 6, 2024

glwagner commented Jun 6, 2024 • edited Loading

glwagner commented Jun 6, 2024

Sbozzolo commented Jun 6, 2024

glwagner commented Dec 14, 2024

tomchor commented Dec 14, 2024

glwagner commented Dec 14, 2024

tomchor commented Dec 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simone-silvestri commented Dec 15, 2024

tomchor commented Dec 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

(0.95.3) Reset model clock and skip `time_step!`ing if next actuation time is tiny #3606

(0.95.3) Reset model clock and skip `time_step!`ing if next actuation time is tiny #3606

glwagner commented May 28, 2024 •

edited

Loading

tomchor commented Jun 5, 2024 •

edited

Loading

glwagner commented Jun 6, 2024 •

edited

Loading

glwagner commented Jun 6, 2024 •

edited

Loading

tomchor commented Dec 15, 2024 •

edited

Loading