Eventually: add flexibility for the exploration criterion #14

zsunberg · 2016-04-09T22:51:03Z

Right now, the UCB exploration criterion is hard-coded into the solver. We should eventually make this flexible.

zsunberg · 2018-07-18T01:42:52Z

This should be done similar to how it is in POMCPOW: https://github.com/JuliaPOMDP/POMCPOW.jl/blob/master/src/criteria.jl

It may need some thought about the interface. E.g. should select_best return an action or a node index?

zsunberg · 2018-07-18T01:43:58Z

Also, it will be annoying to deprecate the c keyword argument :(

rcnlee · 2018-08-08T01:25:54Z

I'm interested in this for DPW / continuous actions. I need it for my research. For me, it should look as much like a continuous bandit problem as possible. Best case scenario, it can be independently tested as a bandit (or even have it as a separate bandit package).

This might mean that the interface should pass s, a, and r. Might also need to have two functions, one for selecting an action and another for updating after observing r. Let me know what you think on the design side. I can implement it. Would be nice to have something shareable and generic. i need it soon though, so I might need to hack something together and clean it up later.

zsunberg · 2018-08-09T03:17:15Z

I think select_best(criterion, snode::StateNode, rng) should be adequate for most cases (similar to how it is in POMCPOW.jl above). It should return the state-action node to try (we need to make a state-action node type for DPW). The criterion object is passed by the user to the solver, so it can be arbitrarily customized. The person who implements select_best will have access to Q, N, etc for each child action node.

But it sounds like you want to do something different - you want to use r instead of Q for your bandit? You could, of course, call generate_sr as much as you want within select_best, but it sounds like you want to get it from the simulations performed by MCTS already. That means maybe we need something like an update!(criterion, snode, anode, r, spnode) that is called at line 166 in dpw.jl?

I am a little hesitant to put the update! function in because it seems pretty esoteric. Doesn't it make more sense to use Q instead of r for the bandit? that's what the whole idea behind UCB in the first place was.

zsunberg mentioned this issue Apr 9, 2016

Rename AgUCTSolver to AgMCTSSolver #15

Closed

zsunberg mentioned this issue Jul 18, 2018

UCB choice? #44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eventually: add flexibility for the exploration criterion #14

Eventually: add flexibility for the exploration criterion #14

zsunberg commented Apr 9, 2016

zsunberg commented Jul 18, 2018

zsunberg commented Jul 18, 2018

rcnlee commented Aug 8, 2018

zsunberg commented Aug 9, 2018 •

edited

Loading

Eventually: add flexibility for the exploration criterion #14

Eventually: add flexibility for the exploration criterion #14

Comments

zsunberg commented Apr 9, 2016

zsunberg commented Jul 18, 2018

zsunberg commented Jul 18, 2018

rcnlee commented Aug 8, 2018

zsunberg commented Aug 9, 2018 • edited Loading

zsunberg commented Aug 9, 2018 •

edited

Loading