-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Principled forward model for uniformly converged cost to go #766
Comments
You are entirely correct that the forward pass does not need to "optimize" anything and that we can select the trial points somewhat arbitrarily. I don't have any suggestions for what the right approach is, and I'm not aware of any literature on this, but it shouldn't be too hard to implement as a new forward pass: We do a form of upper confidence bound learning in the duality handlers: SDDP.jl/src/plugins/duality_handlers.jl Lines 351 to 388 in f5d24b7
We also do a form of the tabu list/L2-distance for cyclical graphs where we terminate after each cycle: SDDP.jl/src/plugins/forward_passes.jl Lines 76 to 94 in f5d24b7
The closest thing is Regan's problem-child algorithm, but this selects only the realization of the random variable that leads to be the greatest learning: https://optimization-online.org/wp-content/uploads/2018/02/6449.pdf |
A few other comments on this:
|
I'm going to close this issue because I don't have plans to implement it. A pretty fundamental design decision of SDDP is that we purposefully don't do well "off-chain". The suggestion would be to use SDDP in some sort of rolling horizon scheme where you re-solve (or re-train) the policy as you step forward in time and realize different values. If anyone does implement it and wants to contribute it back, please open a pull request. |
Hi,
I am interested in understanding the cost to go function across a broad range of different states, not just on chain.
With the default forward pass we end up only exploring the state space in a restricted domain which is visitied when the system is controlled close to optimally and the model can perform poorly "off-chain".
For example I have an energy system with a number of stores and I want to understand the marginal value of additional storage in different states on the converged model, so uniformly converging the store state components of the CTG function is important.
One option would be to uniformly randomly sample from the store state in the forward pass, but I'm curious if there is a more principled way to select the next set of points to explore (forward path) in order to maximise the expected learning.
At first I thought about if bayesian optimisation or similar could be used to sample from the state space like in hyper parameter optimisation however the big difference here is that the "black box" function we want to minimise is something like a combination of expected optimal policy improvement and the expected off policy CTG error improvement which is continually changing.
Once we have visited an off-chain point, it seems reasonable to assume that the expected off policy performance improvement from visiting that point again will be significantly reduced since we have corrected the first order errors. So a reasonable approximation of expected off policy performance improvement would be the time since we evaluated near that point.
One way to do this would be to have a modified forward pass which maintains a tabu list and includes the L1 distance from each of these points as a form of regularisation.
Does this make sense? Is there a better way of approaching this problem?
The text was updated successfully, but these errors were encountered: