Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(0.95.3) Reset model clock and skip time_step!ing if next actuation time is tiny #3606

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

tomchor
Copy link
Collaborator

@tomchor tomchor commented May 24, 2024

Closes #3593

This is somewhat of a hack to investigate (and eventually solve) #3593. For now I just want to investigate if this breaks too much stuff in the code, but it does seem to solve the problem. Here's a MWE to demonstrate:

using Oceananigans

grid_base = RectilinearGrid(topology = (Bounded, Periodic, Bounded),
                            size = (16, 20, 4), extent = (800, 1000, 100),)
    
@inline east_wall(x, y, z) = x > 400
grid = ImmersedBoundaryGrid(grid_base, GridFittedBoundary(east_wall))

model = NonhydrostaticModel(grid = grid, timestepper = :RungeKutta3, buoyancy = BuoyancyTracer(), tracers = :b,)

N² = 6e-6
b∞(x, y, z) =* z
set!(model, b=b∞)
    
simulation = Simulation(model, Δt=25, stop_time=1e4,)

using Statistics: std
using Printf
progress_message(sim) = @printf("Iteration: %04d, time: %s, iteration×Δt: %s, std(pNHS) = %.2e\n",
                                iteration(sim), sim.model.clock.time, iteration(sim) * sim.Δt, std(model.pressures.pNHS))
add_callback!(simulation, progress_message, IterationInterval(1))

simulation.output_writers[:snaps] = NetCDFOutputWriter(model, (; model.pressures.pNHS,),
                                                       filename = "test_pressure.nc",
                                                       schedule = TimeInterval(100),
                                                       overwrite_existing = true,)
run!(simulation)

On main this produces stuff like:

Iteration: 0001, time: 25.0, iteration×Δt: 25.0, std(pNHS) = 6.02e-03
Iteration: 0002, time: 50.0, iteration×Δt: 50.0, std(pNHS) = 6.02e-03
Iteration: 0003, time: 75.0, iteration×Δt: 75.0, std(pNHS) = 6.02e-03
Iteration: 0004, time: 99.99999999999999, iteration×Δt: 100.0, std(pNHS) = 6.02e-03
Iteration: 0005, time: 100.0, iteration×Δt: 125.0, std(pNHS) = 2.72e+10

The last two lines are of note where we went from time: 99.99999999999999 to time: 100.0, implying a very tiny time-step, which results in a weird pressure field, as quantified by the last output of the last line: std(pNHS) = 2.72e+10. Note that because of this, time and iteration×Δt don't match up anymore in the last line. Namely time: 100.0, iteration×Δt: 125.0. This "misstep" happens many times throughout the run on main.

On this branch this doesn't happen anymore, and even after many time-steps things remain aligned (albeit with a very small round-off error):

Iteration: 0396, time: 9900.0, iteration×Δt: 9900.0, std(pNHS) = 5.99e-03
Iteration: 0397, time: 9925.000000000002, iteration×Δt: 9925.0, std(pNHS) = 5.99e-03
Iteration: 0398, time: 9950.000000000004, iteration×Δt: 9950.0, std(pNHS) = 5.99e-03
Iteration: 0399, time: 9975.000000000005, iteration×Δt: 9975.0, std(pNHS) = 5.99e-03

Ideally the way to really fix this would be to figure out a way to avoid round-off errors, but I haven't been able to do that yet.

time_step!(sim.model, Δt, callbacks=model_callbacks)
else
println("Skipping aligned time step, which is of ", Δt)
sim.model.clock.time = next_actuation_time(sim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this also be

Suggested change
sim.model.clock.time = next_actuation_time(sim)
sim.model.clock.time += Δt

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works when we're almost at next_actuation_time since it would skip that actuation time, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is

sim.model.clock.time + Δt

vs

next_actuation_time(sim)

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomchor still have this question

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry I missed this. It's been so long that I forgot the details, but looking at the code for both, it's possible they end up calculating the same thing since Δt is calculated with aligned_time_step().

@glwagner
Copy link
Member

Nice work! I'm curious about the criteria. Should it be something like

dt = 10 * eps(dt) * sim.dt

? Or does it have to be larger than that (hence the factor 1e10).

It'd be nice not to have to define next_actuation_time for every schedule... it doesn't really make sense for WallTimeInterval either. Plus, we want users to be able to provide custom schedules (since they only need to be a function of model that returns true/false) so that people can trigger output / action using interesting custom criteria...

@tomchor
Copy link
Collaborator Author

tomchor commented May 24, 2024

Nice work! I'm curious about the criteria. Should it be something like

dt = 10 * eps(dt) * sim.dt

? Or does it have to be larger than that (hence the factor 1e10).

I actually don't know what the proper criterion should be. With the one you proposed, the error doesn't go away in this example since the tiny time-step is about 1e-12, but 10 * eps(dt) * sim.dt come out to be about 1e-13. If we use 100 * eps(dt) * sim.dt then it works. But I don't yet know how much of this will generalize to other, more complex simulations. I still have to test these on my own simulations to see what works.

It'd be nice not to have to define next_actuation_time for every schedule... it doesn't really make sense for WallTimeInterval either. Plus, we want users to be able to provide custom schedules (since they only need to be a function of model that returns true/false) so that people can trigger output / action using interesting custom criteria...

Yeah, agree. I'm not sure of a good workaround here though. Do you have suggestions?

For the time being we can just set a fallback method as next_actuation_time(scheduke) = Inf I guess? (Similar to what I did for IterationInterval.

Also, nice to see that tests pass and nothing is breaking :)

@tomchor
Copy link
Collaborator Author

tomchor commented May 27, 2024

I did a few tests with some criteria for timestep-skipping with a couple of my own simulations in addition to the MWE included here. In summary:

  1. Criterion sim.Δt / 1e10: successfully gets rids of the problem in both the MWE and in my simulations
  2. Criterion 10 * eps(sim.Δt) * sim.Δt: doesn't get rid of the problem in any simulation
  3. 100 * eps(sim.Δt) * sim.Δt: fixes the problem in the MWE but not in my simulations, although it does decrease its frequency of occurrence a good amount.
  4. 1000 * eps(sim.Δt) * sim.Δt: fixes everything in all simulations I've tried.

So only options 1 and 4 fully fix the problem (at least in the simulations I've tried so far). For me both those options rely on pretty arbitrary numbers though, so I'm not very happy with neither. From the point of view seeing the timestep-skipping as an approximation ($u^{n+1} \approx u^n$), then maybe criterion 1 makes more sense, although I'm not sure how it'd behave for Float32 simulations.

I see three possible ways to go about it right now:

  1. Do what this PR is doing, and manually set the criterion to either option 1 or 4 above. If it turns out that some simulations still have issues, we revisit.
  2. We add min_Δt as a property of NonhydrostaticModel (or maybe Simulation?). I think the minimum Δt for which time skipping will be necessary will vary significantly between simulations, so this solution deals with that by leaving the decision up to the user if they are interested in the pressure output.
  3. We try something that actually prevents these round-off errors instead of dealing with them. @glwagner suggested an Integer-based model clock, but there might be other options.

@glwagner
Copy link
Member

glwagner commented May 28, 2024

I did a few tests with some criteria for timestep-skipping with a couple of my own simulations in addition to the MWE included here. In summary:

  1. Criterion sim.Δt / 1e10: successfully gets rids of the problem in both the MWE and in my simulations
  2. Criterion 10 * eps(sim.Δt) * sim.Δt: doesn't get rid of the problem in any simulation
  3. 100 * eps(sim.Δt) * sim.Δt: fixes the problem in the MWE but not in my simulations, although it does decrease its frequency of occurrence a good amount.
  4. 1000 * eps(sim.Δt) * sim.Δt: fixes everything in all simulations I've tried.

So only options 1 and 4 fully fix the problem (at least in the simulations I've tried so far). For me both those options rely on pretty arbitrary numbers though, so I'm not very happy with neither. From the point of view seeing the timestep-skipping as an approximation (un+1≈un), then maybe criterion 1 makes more sense, although I'm not sure how it'd behave for Float32 simulations.

I see three possible ways to go about it right now:

  1. Do what this PR is doing, and manually set the criterion to either option 1 or 4 above. If it turns out that some simulations still have issues, we revisit.
  2. We add min_Δt as a property of NonhydrostaticModel (or maybe Simulation?). I think the minimum Δt for which time skipping will be necessary will vary significantly between simulations, so this solution deals with that by leaving the decision up to the user if they are interested in the pressure output.
  3. We try something that actually prevents these round-off errors instead of dealing with them. @glwagner suggested an Integer-based model clock, but there might be other options.

Note that eps(sim.Δt) is similar to sim.Δt * eps(typeof(Δt)). So Δt / 1e10 is pretty similar to 100000 * eps(sim.Δt). The only point of using eps is to avoid hard coding Float64.

@tomchor
Copy link
Collaborator Author

tomchor commented Jun 5, 2024

@glwagner are you okay if I just add min_Δt as a property of NonhydrostaticModel and maintain the strategy of skipping the timestep is Δt is smaller than that? I think that's a reasonable and simple way to fix this.

@glwagner
Copy link
Member

glwagner commented Jun 5, 2024

This isn't a problem of the nonhydrostatic model specifically. Why would we add it as a property there?

@tomchor
Copy link
Collaborator Author

tomchor commented Jun 5, 2024

This isn't a problem of the nonhydrostatic model specifically. Why would we add it as a property there?

This PR is supposed to fix the output of the pressure, which can be unreasonably large if the time step is really small due to, I believe, the nonhydrostatic pressure correction, which is specific to the NonhydrostaticModel. Unless I'm missing something.

This wouldn't necessarily fix #3056, which may be what you're thinking of. I can also add min_Δt to Simulation, so that this can be used with other models.

@glwagner
Copy link
Member

glwagner commented Jun 5, 2024

I think we should fix the problem once. Otherwise we'll end up with unnecessary code somewhere that has to be deleted.

@tomchor
Copy link
Collaborator Author

tomchor commented Jun 5, 2024

I think we should fix the problem once. Otherwise we'll end up with unnecessary code somewhere that has to be deleted.

@glwagner Can you please be clearer? Does that mean adding min_Δt to Simulation is an acceptable solution? Or should we try to avoid these round-off errors to even happen in the first place?

@glwagner
Copy link
Member

glwagner commented Jun 6, 2024

I think the solution discussed here, where the time-step change associated with TimeInterval schedules is restricted by a sort of relative tolerance criteria, is acceptable if we can't tease out the underlying issue (or its unsolvable).

If we could indeed solve the problem simply by eliminating round off error, then this would almost certainly be preferred since it might be much simpler (eg just fixing an floating-point-unstable arithmetic operation by rearranging terms). That could be really easy.

@Sbozzolo might be able to help because I believe they do something special to avoid round off issues in ClimaAtmos.

I would hesitate to establish an absolute min_Δt that's independent of the units being used, unless the default is 0.

@glwagner
Copy link
Member

glwagner commented Jun 6, 2024

I'll try to get going with the MWE here. Is the immersed boundary acually part of the MWE (ie the error does not occur with it?) And buoyancy?

@glwagner
Copy link
Member

glwagner commented Jun 6, 2024

It seems that without the immersed boundary, there is still a problem of super small time-steps, but the pressure does not blow up.

@glwagner
Copy link
Member

glwagner commented Jun 6, 2024

Ok getting closer maybe...

I think this problem is generic and cannot be solved in general for arbitrary time steps. Here's a few thoughts:

  • Reading about Kahan summation makes it clear that we simply cannot avoid errors if we would like to add a small floating point number (the time step) to a very large number (the model time).

  • I think the issue with the time-step is whether or not we can compute the RHS of the pressure Poisson equation accurately --- which is div(u') / Δt, where u' = u + Δt * Gu is the predictor velocity and div is the divergence. This is interesting, because I could not figure out why we would ever find large div(u') with small Δt even in this MWE. But now I realize that because of the status of the immersed Poisson solver, the velocity along the boundary is divergent, strongly so. So, div(u') is large along the boundary. And when we divide by Δt we get something huge. The magnitude of div(u') also somehow seems to depend on the time step (as does the magnitude of the spurious circulation). The correct solution to this case remains at rest of course. (An aside is that this problem could be avoided by separately computing the hydrostatic pressure, and then using a special horizontal gradient operators that avoid computing a hydrostatic pressure gradient across an immersed boundary. However, this would only be correct for no-flux boundary conditions on buoyancy on side walls). Anyways, apparently because of this issue with the immersed pressure solver, it seems that div(u') is large (because div(u) is large) even when Δt = O(1e-14)...

  • As a result of all of this I am confused about whether this MWE is actually reliable for debugging the issue. I guess we should expect to see problems simply when Δt = O(eps) because this is when div(u') / Δt cannot be reliably computed, I think. This leads to a fairly simple criteria for the time step that's compatible with the pressure correction. But as noted in this PR, this is not enough to fix issues with the immersed boundary MWE... but whether or not that is because of problems with the setup itself, I'm not sure...

  • All of that said, taking @tomchor suggestion to be more careful in updating the clock for RK3 actually does solve the MWE here. Obviously, this is again addressing the (in principle not entirely solvable) issue of error accumulation in clock.time, rather than addressing the other issue with very small time-steps producing an ill-posed pressure correction. I think we should fix RK3 separately, basically because we cannot completely avoid accumulating error in clock.time, every little thing we do to make it more accurate is a good idea.

@glwagner
Copy link
Member

glwagner commented Jun 6, 2024

PR #3617 helps with this MWE

@Sbozzolo
Copy link
Member

Sbozzolo commented Jun 6, 2024

@Sbozzolo might be able to help because I believe they do something special to avoid round off issues in ClimaAtmos.

With regards to floating point instabilities due to arithmetic with time and time intervals, ultimately, we will be solving this issue (and others) by moving away from a floating point time in favor of an integer one (e.g., milliseconds from a start date). As a fun fact, if you are using Float32 and keep track of your time in seconds, t + dt is going to have numerical error after approximately one year of simulated time (which is not that much).

src/Simulations/run.jl Outdated Show resolved Hide resolved
@glwagner
Copy link
Member

Looks good, maybe bump the minor version

@tomchor
Copy link
Collaborator Author

tomchor commented Dec 14, 2024

Looks good, maybe bump the minor version

Done! I wanna test this code in my actual simulation before sending it off for reviews though.

Also, I'm thinking of using the MWE at the top comment as a test. It's the most reliable way I could reproduce this issue. Thoughts?

@glwagner
Copy link
Member

Looks good, maybe bump the minor version

Done! I wanna test this code in my actual simulation before sending it off for reviews though.

Also, I'm thinking of using the MWE at the top comment as a test. It's the most reliable way I could reproduce this issue. Thoughts?

I would just put in a very simple test that time-stepping is skipped correctly, by manually taking two time-steps and changing dt in between.

I think the kind of test you are describing may not be appropriate for CI; it's more a validation test. I think if you can put a unit test to show the feature works correctly, you can later show that using the feature solves pressure solver issues and be happy that the unit test ensures it will continue to work as in the validation test.

@tomchor
Copy link
Collaborator Author

tomchor commented Dec 15, 2024

I tested this branch on my simulation and things work fine. However, when trying to include a test I came across some weird behavior. Namely the snippet below fails:

using Oceananigans
using Test

grid  = RectilinearGrid(size=(1, 1, 1), extent=(1, 1, 1))
model = NonhydrostaticModel(; grid)

simulation = Simulation(model, Δt=3, stop_iteration=1)
run!(simulation)

stop_time = 1.0
simulation = Simulation(model, Δt=1; stop_time)
run!(simulation)

@test simulation.model.clock.time == stop_time

with

Test Failed at REPL[11]:1
  Expression: simulation.model.clock.time == stop_time
   Evaluated: 4.0 == 1.0

If I re-build model before re-creating a simulation, then it works:

using Oceananigans
using Test

grid  = RectilinearGrid(size=(1, 1, 1), extent=(1, 1, 1))
model = NonhydrostaticModel(; grid)

simulation = Simulation(model, Δt=3, stop_iteration=1)
run!(simulation)

stop_time = 1.0
model = NonhydrostaticModel(; grid)
simulation = Simulation(model, Δt=1; stop_time)
run!(simulation)

@test simulation.model.clock.time == stop_time

So there seems to be some attribute of model (presumably model.clock?) that's leading to different Simulation behavior. Is this expected? I couldn't quite figure out what was happening.

Comment on lines +121 to +123
Δt = aligned_time_step(sim, sim.Δt)
if Δt < sim.minimum_relative_step * sim.Δt
next_time = next_actuation_time(sim)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, next_actuation_time(sim) and sim.model.clock.time + Δt are equivalent. I chose to use the former here to avoid errors related to the addition operation. But let me know if I should change it to just use the latter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it complicate the code to define next_actuation_time in addition to aligned_time_step? Things go wrong if you change one but not the other. It's usually best to have one "source" of reality / truth

@simone-silvestri
Copy link
Collaborator

So there seems to be some attribute of model (presumably model.clock?) that's leading to different Simulation behavior. Is this expected? I couldn't quite figure out what was happening.

Indeed, the clock is not reset to zero when the simulation is built, so you are restarting from model.clock == 3.0.

@tomchor
Copy link
Collaborator Author

tomchor commented Dec 15, 2024

So there seems to be some attribute of model (presumably model.clock?) that's leading to different Simulation behavior. Is this expected? I couldn't quite figure out what was happening.

Indeed, the clock is not reset to zero when the simulation is built, so you are restarting from model.clock == 3.0.

Ah, true! I'm ashamed I forgot about that. Although, in this case shouldn't the simulation just not iterate and keep the clock at 3?

@tomchor tomchor marked this pull request as ready for review December 15, 2024 15:10
@tomchor tomchor changed the title Reset model clock and skip time_step!ing if next actuation time is tiny (0.95.3) Reset model clock and skip time_step!ing if next actuation time is tiny Dec 15, 2024
@@ -88,7 +93,8 @@ function Simulation(model; Δt,
0.0,
false,
false,
verbose)
verbose,
Float64(minimum_relative_step))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why Float64?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the other floats in the Simulation constructor, which are also converted to Float64. I can't remember the PR where this was decided, but it minimizes errors in time-step alignment. The error that this PR solves is an example of this type of error .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we should make it optional.
This will not work if we want to support Metal architectures that do not want Float64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants