Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eventually: add flexibility for the exploration criterion #14

Open
zsunberg opened this issue Apr 9, 2016 · 4 comments
Open

Eventually: add flexibility for the exploration criterion #14

zsunberg opened this issue Apr 9, 2016 · 4 comments

Comments

@zsunberg
Copy link
Member

zsunberg commented Apr 9, 2016

Right now, the UCB exploration criterion is hard-coded into the solver. We should eventually make this flexible.

@zsunberg
Copy link
Member Author

This should be done similar to how it is in POMCPOW: https://github.com/JuliaPOMDP/POMCPOW.jl/blob/master/src/criteria.jl

It may need some thought about the interface. E.g. should select_best return an action or a node index?

@zsunberg
Copy link
Member Author

Also, it will be annoying to deprecate the c keyword argument :(

@rcnlee
Copy link
Contributor

rcnlee commented Aug 8, 2018

I'm interested in this for DPW / continuous actions. I need it for my research. For me, it should look as much like a continuous bandit problem as possible. Best case scenario, it can be independently tested as a bandit (or even have it as a separate bandit package).

This might mean that the interface should pass s, a, and r. Might also need to have two functions, one for selecting an action and another for updating after observing r. Let me know what you think on the design side. I can implement it. Would be nice to have something shareable and generic. i need it soon though, so I might need to hack something together and clean it up later.

@zsunberg
Copy link
Member Author

zsunberg commented Aug 9, 2018

I think select_best(criterion, snode::StateNode, rng) should be adequate for most cases (similar to how it is in POMCPOW.jl above). It should return the state-action node to try (we need to make a state-action node type for DPW). The criterion object is passed by the user to the solver, so it can be arbitrarily customized. The person who implements select_best will have access to Q, N, etc for each child action node.

But it sounds like you want to do something different - you want to use r instead of Q for your bandit? You could, of course, call generate_sr as much as you want within select_best, but it sounds like you want to get it from the simulations performed by MCTS already. That means maybe we need something like an update!(criterion, snode, anode, r, spnode) that is called at line 166 in dpw.jl?

I am a little hesitant to put the update! function in because it seems pretty esoteric. Doesn't it make more sense to use Q instead of r for the bandit? that's what the whole idea behind UCB in the first place was.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants