- RL relies on carefully engineering a policy with rewards extrinsic to the agent
- How to model curiosity: a type of intrinsic reward function which uses prediction error as a reward signal
- First large-scale study of curiosity driven learning: without any extrinsic rewards
- Examples of intrinsic rewards include "Curiosity" or "visitation counts" (discouraging the agent to revisit the same state too many times)
- Idea is that intrinsic rewards will bridge the gaps between sparse extrinsic rewards
- How to deal with scenarios with no extrinsic reward at all?
- Like babies: employ goal-less exploration to learn skills that will be useful in life later on
- Minecraft
- Visiting your local zoo
- We tested in 54 environments: videogames, physics engine simulations, 3D navigation tasks
- Predicting future state in high dimensional raw observation space is a challennging problem
- Learning dynamics in an auxiliary space leads to improved results
- Observations: x, actions: a, transitions to state x_t+1, rewards: r,
- We want to provide a reward based on how informative the transition was
- Exploration bonus:
- Embed observations into representations
- Forward dynamics network to predict representation of the next state
- Exploration reward defined as: r_t = -log(p(fi(xt_1) given x_t, a_t))
- Also called the surprisal
- Feature spaces for forward dynamics:
- Should be compact, low dimensional, filter irrelevant parts of the observation space
- Sufficient: contain all of the important information
- Stable: non-stationary rewards make it difficult to learn. The forward dynamics model evolves over time which intrinsically make this function unstable but we will try to minimize it
- Efficacy of few methods:
- Pixels: sufficient and stable, but too high-dimensional
- Random features: we take our embedding network and fix it after random initialization: features are compact but not sufficient
- Variational Autoencoders (VAE): inference network q(z|x) which models the world with noise
- Inverse Dynamic Features (IDF): given a transition, the inverse dynamics task is to predict the action given the states. features might not be sufficient
- PPO for learning
- Reward normalization (dividing the rewards by a running estimate of the stddev of the sum of discounted rewards)
- Advantage normalization
- Observaiton normalization
- More actors - more actors might learn a policy in parallel
- In a game environment, a game over or 'done' signal can leak too much info about the true reward function
- If you don't remove this signal, a simple strategy of trying to keep the agent alive (+1 if agent is alive in a certain timestep and -1 on death) is enough to create a good policy
- Death should become another transition to the agent - to be avoided only if it's boring
- Both policy and embedded network work directly from pixels
- Curiosity-driven without extrinsic rewards
- In Atari games, we measured how extrinsic rewards performed (only evaluated them, but didn't use them for training)
- Just with intrinsic rewards: extrinsic evaluation also went up, in some cases with scores comparable to learning with directly extrinsic rewards
- Might be because game designers are good at guiding the players to finish the game:
- e.g: in Breakout, the more bricks you destroy, the more interesting the patterns become, making the bot curious just to see what is new and what is not new
- To test this, we pre-train the agent using curiosity only on Mario Bros 1-1
- Moving 1-1 to 1-2 generalizes well. However, 1-2 has a darker blue background, which makes it generalize worse.
- RF seems to transfer more weakly than IDF-learned features
- E.g: a maze, where the only reward is knowing you have found the exit.
- Trained on a maze in Unity 3D with 9 rooms