-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewards and Auxiliary Tasks Discussion #85
Comments
@JacobHanouna,
IMO no, it haven't: unlike 'maze exploration'-like tasks where it is common for agent to wander along for a long time without any hint until 'prize' is found and reward is obtained, 'trade' agent with opened position has pnl dynamics at hand, making reward shaping natural and efficient;
Value Replay - definitely yes; this task can be seen as sort of unbiased off-policy training and is domain agnostic; my experiments with other two didn't brought any noticeable improvement; |
Ok thanks. I have found this paper Learning by Playing – Solving Sparse Reward Tasks from Scratch that have an interesting mechanism that add Reward auxiliary tasks. to add small achievement for the agent. but from your answer it is probably not necessary but still an interesting read :) |
@Kismuz on #23 you have replied
Did you applied any form of solution for the 'spoils' in the end of the episode scenario? Even without the problem at the end of the episode, I was wondering if the agent have a different perspective on it's expected rewards as it progress through the episode and/or even during 'Time embedding period'? I'm trying to imagine things from the agent perspective, at the start of the window it can make actions that later can turn into profit/loss but will become observable for the agent to learn from. but toward the end he never knows what are the consequence of his actions. The question here is, does it affect his learning? something like start with behavior A at the beginning and then drift to behavior B in the end? |
@JacobHanouna,
yes in sense now logic is to force close positions one 2 steps ahead of data end ignore agent actions. Those two steps are meant to let agent observe reward signal about that last closed position.
Changes in reward functions which been made month ago or so significantly reduced that reward-final value gap. One can correctly think of agent as of not trying to maximize entire episode PnL but PnL for every single deal opened. My attempts to bind agent performance to episode-level results didn't get any positive rewards (in this formulation reward is really sparse). |
@Kismuz, I'm revisiting a few of the key elements of the framework and my current focus is on the rewards function. The issue i'm tackling now is the model 'sensitivity' to different levels of leverage. The reason I use leverage on the simple 'sin' is because the difference between bottom and top values are very small (fits well to simulate forex data) and the current reward function seems to need a substantial difference for the model to learn the sin pattern. leverage in this case is actually amplifying the sin pattern so the model can learn it shape. (just to complete the argument... with no leverage my models fail to find a good trading policy). if we look form a human perspective, trading on a sin is trivial and decisions are independent of the sin amplitude. any thoughts on the subject? |
@JacobHanouna ,
reward scaling is far better alternative to modifying environment properties in attempt to achieve proper gradients amplitude, see: |
@Kismuz ,
Until now I didn't pay too much attention to 'reward scaling' param. but because my models are based on your example models. it seems that all my models worked with the 'reward scaling' param in addition to leverage. So to test the affect 'reward scaling' have on my models I ran a few test experiments:
The result were interesting:
|
In many of the RL research fields 'Hard Exploration' is a big problem as the agent need to make many steps before it sees a reward, which in term cripple the ability to learn in an efficient way. One of the ways researchers in other fields are tackling this problem (except for developing a good reward shaping function), is by introducing various auxiliary tasks to help the agent achieve the main goal (like in the case of UNREAL).
@Kismuz , I have a couple of questions/thoughts that maybe from your experience and knowledge you can answer.
The text was updated successfully, but these errors were encountered: