Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Principled forward model for uniformly converged cost to go #766

Closed
dannyopts opened this issue Jul 25, 2024 · 3 comments
Closed

Principled forward model for uniformly converged cost to go #766

dannyopts opened this issue Jul 25, 2024 · 3 comments

Comments

@dannyopts
Copy link

Hi,

I am interested in understanding the cost to go function across a broad range of different states, not just on chain.

With the default forward pass we end up only exploring the state space in a restricted domain which is visitied when the system is controlled close to optimally and the model can perform poorly "off-chain".

For example I have an energy system with a number of stores and I want to understand the marginal value of additional storage in different states on the converged model, so uniformly converging the store state components of the CTG function is important.

One option would be to uniformly randomly sample from the store state in the forward pass, but I'm curious if there is a more principled way to select the next set of points to explore (forward path) in order to maximise the expected learning.

At first I thought about if bayesian optimisation or similar could be used to sample from the state space like in hyper parameter optimisation however the big difference here is that the "black box" function we want to minimise is something like a combination of expected optimal policy improvement and the expected off policy CTG error improvement which is continually changing.

Once we have visited an off-chain point, it seems reasonable to assume that the expected off policy performance improvement from visiting that point again will be significantly reduced since we have corrected the first order errors. So a reasonable approximation of expected off policy performance improvement would be the time since we evaluated near that point.

One way to do this would be to have a modified forward pass which maintains a tabu list and includes the L1 distance from each of these points as a form of regularisation.

Does this make sense? Is there a better way of approaching this problem?

@odow
Copy link
Owner

odow commented Jul 25, 2024

You are entirely correct that the forward pass does not need to "optimize" anything and that we can select the trial points somewhat arbitrarily.

I don't have any suggestions for what the right approach is, and I'm not aware of any literature on this, but it shouldn't be too hard to implement as a new forward pass:
https://github.com/odow/SDDP.jl/blob/master/src/plugins/forward_passes.jl

We do a form of upper confidence bound learning in the duality handlers:

function BanditDuality()
return BanditDuality(ContinuousConicDuality(), StrengthenedConicDuality())
end
function _choose_best_arm(handler::BanditDuality)
_, index = findmax(
map(handler.arms) do arm
return Statistics.mean(arm.rewards) + Statistics.std(arm.rewards)
end,
)
handler.last_arm_index = index
return handler.arms[index]
end
function _update_rewards(handler::BanditDuality, log::Vector{Log})
# The bound is monotonic, so instead of worring about whether we are
# maximizing or minimizing, let's just compute:
# |bound_t - bound_{t-1}|
# reward = -----------------------
# time_t - time_{t-1}
t, t′ = log[end], log[end-1]
reward = abs(t.bound - t′.bound) / (t.time - t′.time)
# This check is needed because we should probably keep using the first
# handler until we start to improve the bound. This can take quite a few
# iterations in some models. (Until we start to improve, the reward will be
# zero, so we'd never revisit it.
const_bound = isapprox(log[1].bound, log[end].bound; atol = 1e-6)
# To start with, we should add the reward to all arms to construct a prior
# distribution for the arms. The 10 is somewhat arbitrary.
if length(log) < 10 || const_bound
for arm in handler.arms
push!(arm.rewards, reward)
end
else
push!(handler.arms[handler.last_arm_index].rewards, reward)
end
return
end

We also do a form of the tabu list/L2-distance for cyclical graphs where we terminate after each cycle:

# ===== Begin: starting state for infinite horizon =====
starting_states = options.starting_states[node_index]
if length(starting_states) > 0
# There is at least one other possible starting state. If our
# incoming state is more than δ away from the other states, add it
# as a possible starting state.
if distance(starting_states, incoming_state_value) >
options.cycle_discretization_delta
push!(starting_states, incoming_state_value)
end
# TODO(odow):
# - A better way of randomly sampling a starting state.
# - Is is bad that we splice! here instead of just sampling? For
# convergence it is probably bad, since our list of possible
# starting states keeps changing, but from a computational
# perspective, we don't want to keep a list of discretized points
# in the state-space δ distance apart...
incoming_state_value =
splice!(starting_states, rand(1:length(starting_states)))

The closest thing is Regan's problem-child algorithm, but this selects only the realization of the random variable that leads to be the greatest learning: https://optimization-online.org/wp-content/uploads/2018/02/6449.pdf

@odow
Copy link
Owner

odow commented Oct 14, 2024

A few other comments on this:

  • I don't have plans to implement anything similar to this
  • A key feature of SDDP is precisely that it has poor approximations of the value function "off chain"
  • Making an accurate approximation only at the points that we need to visit is how we reduce the curse of dimensionality
  • If your out-of-sample forward model is very different to the forward model used in training, then you need to improve the forward model used during training. Note that the training forward pass does not need to be convex, use the same dynamics as the backward pass, etc.
  • A learner that picked the UCB for the "biggest improvement" would likely sample lots of unlikely places in the state space.
  • Regan's algorithm is nice in theory, but I don't know if anyone has an efficient implementation of it.

@odow
Copy link
Owner

odow commented Jan 13, 2025

I'm going to close this issue because I don't have plans to implement it. A pretty fundamental design decision of SDDP is that we purposefully don't do well "off-chain". The suggestion would be to use SDDP in some sort of rolling horizon scheme where you re-solve (or re-train) the policy as you step forward in time and realize different values.

If anyone does implement it and wants to contribute it back, please open a pull request.

@odow odow closed this as not planned Won't fix, can't repro, duplicate, stale Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants