Skip to content

Latest commit

 

History

History
118 lines (92 loc) · 9.96 KB

reinforcement_learning.md

File metadata and controls

118 lines (92 loc) · 9.96 KB

Reinforcement Learning

Parameter Space Noise for Exploration [arXiv 2017]

Evolution Strategies as a Scalable Alternative to Reinforcement Learning [arXiv 2017]

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks [ICML 2017]

  • Objective function is loss after one gradient step; initial weights are optimized
  • Applicable to all models using gradient descent: experiments include sinusoid regression, one-shot image classification, and fast adaptation in RL environments
  • (Meta-)Learned initial weights are sensitive to task identity

Neural Episodic Control [ICML 2017]

  • .

Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening [ICLR 2017]

  • Adds n-step optimality bounds to the Q-learning loss
  • Results in faster information propogation between different Q-values
  • Learns to play Atari games in ~24 hours(~10x speedup compared to DQN)

PGQ: Combining policy gradient and Q-learning [ICLR 2017]

  • We can estimate Q from policy and value function. Add a Q layer on an Actor-Critic NN, and run Q-Learning on the top layer in addition to the policy gradient update
  • Shows the parallel between the Dueling Network and entropy-regulated Actor-Critic

Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic [ICLR 2017]

  • Partitions the policy gradient of Actor-Critic into an off-policy part(DPG) and a residual REINFORCE gradient
  • Derived using the linearization of Q as a control variate

Sample Efficient Actor-Critic with Experience Replay [ICLR 2017]

  • Partitions the policy gradient of Actor-Critic into a stable off-policy part(Retrace) and its on-policy residual
  • Achieves better sample efficiency by using experience replay on the off-policy component

Unifying Count-Based Exploration and Intrinsic Motivation [NIPS 2016]

  • Introduces 'pseudo-count': an approximation to count from an arbitrary density model
  • Pseudo-count enables us to use count based algorithms on large or continuous state space MDPs
  • Explores Montezuma's Revenge effectively(!)

Mastering the game of Go with deep neural networks and tree search [Nature 2016]

  • Use RL to play Go
  • Core algorithm is Monte Carlo Tree Search using a trained policy network to get action probabilities
  • Leaf nodes are evaluated using both a fast rollout policy network and a trained value network

Asynchronous Methods for Deep Reinforcement Learning [ICML 2016]

  • Suggests A3C(Asynchronous Advantage Actor-Critic), which is the standard Actor-Critic algorithm run by many instances in parallel
  • Surpasses current state-of-the-art in shorter time using a CPU

Dueling Network Architectures for Deep Reinforcement Learning [ICML 2016]

  • Two-stream DQN, each stream representing V (value function) and A (advantage function)
  • Eliminates the instability of adding two numbers of different scale(V is usually much larger than A)
  • By updating multiple Q values on each observation, effectively updates more frequently than a single-stream DQN
  • Implicitly splits the credit assignment problem into a recursive binary problem of "now or later"

Deep Exploration via Bootstrapped DQN [NIPS 2016]

  • Bootstraps DQN heads with shared lower layers
  • Results in more consistent('deep') exploration

Continuous Control with Deep Reinforcement Learning [ICLR 2016]

High-dimensional Continuous Control Using Generalized Advantage Functions [ICLR 2016]

  • Derives a class of estimators(GAE) of the advantage function, parameterized by two real numbers in [0,1]
  • Empirical performance of TRPO+GAE is better than TRPO in some tasks

Connecting Generative Adversarial Networks and Actor-Critic Methods [arxiv 2016]

  • Constructs an Actor-Critic architecture that is equivalent to GAN: randomly choose state between a real image and an image generated by the actor, action is setting every pixel in an image. Real image: reward 1, Synthetic image: reward 0
  • Cross-examines the approaches used to stabilize GANs and AC architectures

Deep Reinforcement Learning with Double Q-Learning [AAAI 2016]

  • Points out that once DQN overestimates a Q value, the overestimation 'spills over' to states that precede it
  • Uses different Q networks for action selection and evaluation

Action-Conditional Video Prediction using Deep Networks in Atari Games [NIPS 2015]

  • Constructs a NN that predicts the next frame of an Atari game given the current frame and action
  • Exploration using predicted frames marginally increases score

Human-level Control Through Deep Reinforcement Learning [Nature 2015]

  • Proposes using a deep neural network with Bellman backup as regression target
  • Uses three tricks for stability: separate prediction/target networks, experience replay, reward clipping
  • Points out that a scheme similar to experience replay happens in the hippocampus of the mammalian brain

Trust Region Policy Optimization [ICML 2015]

  • Builds on Approximately Optimal Approximate Reinforcement Learning
  • Instead of using a linear mixture, TRPO uses average KL divergence to ensure that the next policy is sufficiently close to current policy(in practice, approximate L linearly and KL divergence quadratically)
  • The natural policy gradient has the same direction as TRPO; the difference is that TRPO chooses a step size based on the trust region defined by KL divergence

Deterministic Policy Gradient Algorithms [ICML 2014]

  • Derives a gradient for deterministic policies(assuming a continuous action space)
  • Proves(under mild regularity conditions) that all previously developed machinery for stochastic policy gradients(i.e. compatible function approximation, actor-critic, natural gradients, and episodic/batch methods) apply to deterministic policy gradients.
  • All proofs are in a separate document

Reinforcement learning of motor skills with policy gradients [Neural Networks 2008]

  • An extensive survey of policy gradient methods
  • Covers naive finite-difference methods, REINFORCE, NAC(Natural Actor-Critic)
  • NAC is shown to be the state of the art

Natural Actor-Critic [Neurocomputing 2008]

  • Proves that the weight vector discussed in A Natural Policy Gradient is actually the natural gradient, rather than just a gradient defined by an average of point Fisher information matrices
  • Suggests an Actor-Critic style algorithm using the natural gradient

A Natural Policy Gradient [NIPS 2002]

  • Parameterizes Q(s,a) as a weighted sum of log p(a|s)
  • Proves that the weight vector(above) is the direction of steepest descent w.r.t. the expectation of the Fisher information matrix(natural policy gradient)
  • Suggests a REINFORCE-style algorithm using the natural gradient

Approximately Optimal Approximate Reinforcement Learning [ICML 2002]

  • Points out the inefficiency of policy gradients using two example MDPs(section 3.2)
  • Derives a conservative policy iteration scheme that finds a policy that is almost optimal(within epsilon) in polynomial(w.r.t. epsilon) time
  • The key idea is that we can get a provably improved policy by using a linear mixture between the current policy and the greedily improved policy

Convergence of Stochastic Iterative Dynamic Programming Algorithms [NIPS 1994]

  • Proves the convergence of Q-Learning to optimal Q values given some mild regulatory conditions
  • Gives a similar proof for TD(lambda)