Rewards and Auxiliary Tasks Discussion #85

JaCoderX · 2018-12-06T18:53:55Z

In many of the RL research fields 'Hard Exploration' is a big problem as the agent need to make many steps before it sees a reward, which in term cripple the ability to learn in an efficient way. One of the ways researchers in other fields are tackling this problem (except for developing a good reward shaping function), is by introducing various auxiliary tasks to help the agent achieve the main goal (like in the case of UNREAL).

@Kismuz , I have a couple of questions/thoughts that maybe from your experience and knowledge you can answer.

Would you say that 'Trading' have characteristic of 'Hard Exploration' problem?
Is any of the UNREAL auxiliary tasks (Pixel Control/Reward Prediction/Value Replay) gives good result / worth using for 'Trading'?

Kismuz · 2018-12-06T20:10:39Z

@JacobHanouna,

Would you say that 'Trading' have characteristic of 'Hard Exploration' problem?

IMO no, it haven't: unlike 'maze exploration'-like tasks where it is common for agent to wander along for a long time without any hint until 'prize' is found and reward is obtained, 'trade' agent with opened position has pnl dynamics at hand, making reward shaping natural and efficient;

Is any of the UNREAL auxiliary tasks (Pixel Control/Reward Prediction/Value Replay) gives good result / worth using for 'Trading'?

Value Replay - definitely yes; this task can be seen as sort of unbiased off-policy training and is domain agnostic; my experiments with other two didn't brought any noticeable improvement;

JaCoderX · 2018-12-06T20:34:00Z

Ok thanks.

I have found this paper Learning by Playing – Solving Sparse Reward Tasks from Scratch that have an interesting mechanism that add Reward auxiliary tasks. to add small achievement for the agent.

but from your answer it is probably not necessary but still an interesting read :)

JaCoderX · 2018-12-14T14:42:12Z

@Kismuz on #23 you have replied

Yes, total reward received is usually bigger than final account value (we should see 'broker value' as we suppose all positions will be forcefully closed at the end of the episode).
This is indeed the flaw in reward function a have to address. Simply, in sense of 'expected reward' they are same and as RL MDP task is formulated as 'maximising expected return' - that's why this function was chosen. But in fact, having a lot of good trades, agent sometimes 'spoils' the entire episode in last moment.

It can be viewed a gap between theoretical RL formulation (expected performance) and real-life application(one-shot performance).
One of solutions I think of is additional reward shaping functions forcing agent to close all trades
or penalising big exposures near the end of the episode. Anyway it's an essential direction to work on.

Did you applied any form of solution for the 'spoils' in the end of the episode scenario?

Even without the problem at the end of the episode, I was wondering if the agent have a different perspective on it's expected rewards as it progress through the episode and/or even during 'Time embedding period'?

I'm trying to imagine things from the agent perspective, at the start of the window it can make actions that later can turn into profit/loss but will become observable for the agent to learn from. but toward the end he never knows what are the consequence of his actions. The question here is, does it affect his learning? something like start with behavior A at the beginning and then drift to behavior B in the end?

Kismuz · 2018-12-14T15:23:15Z

@JacobHanouna,

Did you applied any form of solution for the 'spoils' in the end of the episode scenario?

yes in sense now logic is to force close positions one 2 steps ahead of data end ignore agent actions. Those two steps are meant to let agent observe reward signal about that last closed position.

This is indeed the flaw in reward function a have to address.

Changes in reward functions which been made month ago or so significantly reduced that reward-final value gap.

One can correctly think of agent as of not trying to maximize entire episode PnL but PnL for every single deal opened. My attempts to bind agent performance to episode-level results didn't get any positive rewards (in this formulation reward is really sparse).

JaCoderX · 2019-08-16T10:57:03Z

@Kismuz, I'm revisiting a few of the key elements of the framework and my current focus is on the rewards function.

The issue i'm tackling now is the model 'sensitivity' to different levels of leverage.
In general i'm not really interested in exploring the properties of trading with leverage, but I use it when testing different models on the simple 'sin' data example.

The reason I use leverage on the simple 'sin' is because the difference between bottom and top values are very small (fits well to simulate forex data) and the current reward function seems to need a substantial difference for the model to learn the sin pattern. leverage in this case is actually amplifying the sin pattern so the model can learn it shape. (just to complete the argument... with no leverage my models fail to find a good trading policy).

if we look form a human perspective, trading on a sin is trivial and decisions are independent of the sin amplitude.

any thoughts on the subject?

Kismuz · 2019-08-16T12:47:39Z

@JacobHanouna ,

current reward function seems to need a substantial difference for the model to learn the sin pattern

reward scaling is far better alternative to modifying environment properties in attempt to achieve proper gradients amplitude, see:
https://github.com/Kismuz/btgym/blob/master/btgym/strategy/base.py#L64
https://github.com/Kismuz/btgym/blob/master/btgym/strategy/base.py#L731

JaCoderX · 2019-08-22T10:18:29Z

@Kismuz ,

reward scaling is far better alternative to modifying environment properties

Until now I didn't pay too much attention to 'reward scaling' param. but because my models are based on your example models. it seems that all my models worked with the 'reward scaling' param in addition to leverage.

So to test the affect 'reward scaling' have on my models I ran a few test experiments:

Base Model: reward scaling = 7.0, leverage = 10.0
those params are used in my main model, that show good result on sin (both in value and rate of convergence)
Leverage only Model: reward scaling = 1.0, leverage = 10.0
Reward scaling only Model: reward scaling = 10.0, leverage = 1.0

The result were interesting:

Leverage only Model gave a bit lower end value in compare to Base model. convergence took about the same time (but with lower 'cpu_time_sec' values and more higher env step till convergence). and less fluctuation were observed across all metrics.
Reward scaling only Model didn't work at all!!! not showing even a single trade.

Kismuz added algorithm discussion labels Dec 6, 2018

JaCoderX closed this as completed Dec 9, 2018

JaCoderX changed the title ~~Auxiliary Tasks Discussion~~ Rewards and Auxiliary Tasks Discussion Dec 14, 2018

JaCoderX reopened this Dec 14, 2018

JaCoderX closed this as completed Dec 16, 2018

JaCoderX reopened this Aug 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewards and Auxiliary Tasks Discussion #85

Rewards and Auxiliary Tasks Discussion #85

JaCoderX commented Dec 6, 2018

Kismuz commented Dec 6, 2018

JaCoderX commented Dec 6, 2018

JaCoderX commented Dec 14, 2018

Kismuz commented Dec 14, 2018

JaCoderX commented Aug 16, 2019 •

edited

Loading

Kismuz commented Aug 16, 2019

JaCoderX commented Aug 22, 2019

Rewards and Auxiliary Tasks Discussion #85

Rewards and Auxiliary Tasks Discussion #85

Comments

JaCoderX commented Dec 6, 2018

Kismuz commented Dec 6, 2018

JaCoderX commented Dec 6, 2018

JaCoderX commented Dec 14, 2018

Kismuz commented Dec 14, 2018

JaCoderX commented Aug 16, 2019 • edited Loading

Kismuz commented Aug 16, 2019

JaCoderX commented Aug 22, 2019

JaCoderX commented Aug 16, 2019 •

edited

Loading