Reinforcement Learning

Parameter Space Noise for Exploration [arXiv 2017]

Accompanying blog post(https://blog.openai.com/better-exploration-with-parameter-noise/)
Add noise to parameters instead of actions, fix noise for episode: more consistent exploration

Evolution Strategies as a Scalable Alternative to Reinforcement Learning [arXiv 2017]

Accompanying blog post(https://blog.openai.com/evolution-strategies/)
Score Function Gradient (REINFORCE) directly on parameters
Ignores all time structure in envs
Very parallelizable: many CPUs = fast training

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks [ICML 2017]

Objective function is loss after one gradient step; initial weights are optimized
Applicable to all models using gradient descent: experiments include sinusoid regression, one-shot image classification, and fast adaptation in RL environments
(Meta-)Learned initial weights are sensitive to task identity

Neural Episodic Control [ICML 2017]

.

Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening [ICLR 2017]

Adds n-step optimality bounds to the Q-learning loss
Results in faster information propogation between different Q-values
Learns to play Atari games in ~24 hours(~10x speedup compared to DQN)

PGQ: Combining policy gradient and Q-learning [ICLR 2017]

We can estimate Q from policy and value function. Add a Q layer on an Actor-Critic NN, and run Q-Learning on the top layer in addition to the policy gradient update
Shows the parallel between the Dueling Network and entropy-regulated Actor-Critic

Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic [ICLR 2017]

Partitions the policy gradient of Actor-Critic into an off-policy part(DPG) and a residual REINFORCE gradient
Derived using the linearization of Q as a control variate

Sample Efficient Actor-Critic with Experience Replay [ICLR 2017]

Partitions the policy gradient of Actor-Critic into a stable off-policy part(Retrace) and its on-policy residual
Achieves better sample efficiency by using experience replay on the off-policy component

Unifying Count-Based Exploration and Intrinsic Motivation [NIPS 2016]

Introduces 'pseudo-count': an approximation to count from an arbitrary density model
Pseudo-count enables us to use count based algorithms on large or continuous state space MDPs
Explores Montezuma's Revenge effectively(!)

Mastering the game of Go with deep neural networks and tree search [Nature 2016]

Use RL to play Go
Core algorithm is Monte Carlo Tree Search using a trained policy network to get action probabilities
Leaf nodes are evaluated using both a fast rollout policy network and a trained value network

Asynchronous Methods for Deep Reinforcement Learning [ICML 2016]

Suggests A3C(Asynchronous Advantage Actor-Critic), which is the standard Actor-Critic algorithm run by many instances in parallel
Surpasses current state-of-the-art in shorter time using a CPU

Dueling Network Architectures for Deep Reinforcement Learning [ICML 2016]

Two-stream DQN, each stream representing V (value function) and A (advantage function)
Eliminates the instability of adding two numbers of different scale(V is usually much larger than A)
By updating multiple Q values on each observation, effectively updates more frequently than a single-stream DQN
Implicitly splits the credit assignment problem into a recursive binary problem of "now or later"

Deep Exploration via Bootstrapped DQN [NIPS 2016]

Bootstraps DQN heads with shared lower layers
Results in more consistent('deep') exploration

Continuous Control with Deep Reinforcement Learning [ICLR 2016]

Videos available here
Suggests DDPG, which improves the actor-critic algorithm in Deterministic Policy Gradient Algorithms by using a DQN as the critic

High-dimensional Continuous Control Using Generalized Advantage Functions [ICLR 2016]

Derives a class of estimators(GAE) of the advantage function, parameterized by two real numbers in [0,1]
Empirical performance of TRPO+GAE is better than TRPO in some tasks

Connecting Generative Adversarial Networks and Actor-Critic Methods [arxiv 2016]

Constructs an Actor-Critic architecture that is equivalent to GAN: randomly choose state between a real image and an image generated by the actor, action is setting every pixel in an image. Real image: reward 1, Synthetic image: reward 0
Cross-examines the approaches used to stabilize GANs and AC architectures

Deep Reinforcement Learning with Double Q-Learning [AAAI 2016]

Points out that once DQN overestimates a Q value, the overestimation 'spills over' to states that precede it
Uses different Q networks for action selection and evaluation

Action-Conditional Video Prediction using Deep Networks in Atari Games [NIPS 2015]

Constructs a NN that predicts the next frame of an Atari game given the current frame and action
Exploration using predicted frames marginally increases score

Human-level Control Through Deep Reinforcement Learning [Nature 2015]

Proposes using a deep neural network with Bellman backup as regression target
Uses three tricks for stability: separate prediction/target networks, experience replay, reward clipping
Points out that a scheme similar to experience replay happens in the hippocampus of the mammalian brain

Trust Region Policy Optimization [ICML 2015]

Builds on Approximately Optimal Approximate Reinforcement Learning
Instead of using a linear mixture, TRPO uses average KL divergence to ensure that the next policy is sufficiently close to current policy(in practice, approximate L linearly and KL divergence quadratically)
The natural policy gradient has the same direction as TRPO; the difference is that TRPO chooses a step size based on the trust region defined by KL divergence

Deterministic Policy Gradient Algorithms [ICML 2014]

Derives a gradient for deterministic policies(assuming a continuous action space)
Proves(under mild regularity conditions) that all previously developed machinery for stochastic policy gradients(i.e. compatible function approximation, actor-critic, natural gradients, and episodic/batch methods) apply to deterministic policy gradients.
All proofs are in a separate document

Reinforcement learning of motor skills with policy gradients [Neural Networks 2008]

An extensive survey of policy gradient methods
Covers naive finite-difference methods, REINFORCE, NAC(Natural Actor-Critic)
NAC is shown to be the state of the art

Natural Actor-Critic [Neurocomputing 2008]

Proves that the weight vector discussed in A Natural Policy Gradient is actually the natural gradient, rather than just a gradient defined by an average of point Fisher information matrices
Suggests an Actor-Critic style algorithm using the natural gradient

A Natural Policy Gradient [NIPS 2002]

Parameterizes Q(s,a) as a weighted sum of log p(a|s)
Proves that the weight vector(above) is the direction of steepest descent w.r.t. the expectation of the Fisher information matrix(natural policy gradient)
Suggests a REINFORCE-style algorithm using the natural gradient

Approximately Optimal Approximate Reinforcement Learning [ICML 2002]

Points out the inefficiency of policy gradients using two example MDPs(section 3.2)
Derives a conservative policy iteration scheme that finds a policy that is almost optimal(within epsilon) in polynomial(w.r.t. epsilon) time
The key idea is that we can get a provably improved policy by using a linear mixture between the current policy and the greedily improved policy

Convergence of Stochastic Iterative Dynamic Programming Algorithms [NIPS 1994]

Proves the convergence of Q-Learning to optimal Q values given some mild regulatory conditions
Gives a similar proof for TD(lambda)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reinforcement_learning.md

reinforcement_learning.md

Reinforcement Learning

Parameter Space Noise for Exploration [arXiv 2017]

Evolution Strategies as a Scalable Alternative to Reinforcement Learning [arXiv 2017]

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks [ICML 2017]

Neural Episodic Control [ICML 2017]

Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening [ICLR 2017]

PGQ: Combining policy gradient and Q-learning [ICLR 2017]

Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic [ICLR 2017]

Sample Efficient Actor-Critic with Experience Replay [ICLR 2017]

Unifying Count-Based Exploration and Intrinsic Motivation [NIPS 2016]

Mastering the game of Go with deep neural networks and tree search [Nature 2016]

Asynchronous Methods for Deep Reinforcement Learning [ICML 2016]

Dueling Network Architectures for Deep Reinforcement Learning [ICML 2016]

Deep Exploration via Bootstrapped DQN [NIPS 2016]

Continuous Control with Deep Reinforcement Learning [ICLR 2016]

High-dimensional Continuous Control Using Generalized Advantage Functions [ICLR 2016]

Connecting Generative Adversarial Networks and Actor-Critic Methods [arxiv 2016]

Deep Reinforcement Learning with Double Q-Learning [AAAI 2016]

Action-Conditional Video Prediction using Deep Networks in Atari Games [NIPS 2015]

Human-level Control Through Deep Reinforcement Learning [Nature 2015]

Trust Region Policy Optimization [ICML 2015]

Deterministic Policy Gradient Algorithms [ICML 2014]

Reinforcement learning of motor skills with policy gradients [Neural Networks 2008]

Natural Actor-Critic [Neurocomputing 2008]

A Natural Policy Gradient [NIPS 2002]

Approximately Optimal Approximate Reinforcement Learning [ICML 2002]

Convergence of Stochastic Iterative Dynamic Programming Algorithms [NIPS 1994]

Files

reinforcement_learning.md

Latest commit

History

reinforcement_learning.md

File metadata and controls

Reinforcement Learning

Parameter Space Noise for Exploration [arXiv 2017]

Evolution Strategies as a Scalable Alternative to Reinforcement Learning [arXiv 2017]

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks [ICML 2017]

Neural Episodic Control [ICML 2017]

Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening [ICLR 2017]

PGQ: Combining policy gradient and Q-learning [ICLR 2017]

Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic [ICLR 2017]

Sample Efficient Actor-Critic with Experience Replay [ICLR 2017]

Unifying Count-Based Exploration and Intrinsic Motivation [NIPS 2016]

Mastering the game of Go with deep neural networks and tree search [Nature 2016]

Asynchronous Methods for Deep Reinforcement Learning [ICML 2016]

Dueling Network Architectures for Deep Reinforcement Learning [ICML 2016]

Deep Exploration via Bootstrapped DQN [NIPS 2016]

Continuous Control with Deep Reinforcement Learning [ICLR 2016]

High-dimensional Continuous Control Using Generalized Advantage Functions [ICLR 2016]

Connecting Generative Adversarial Networks and Actor-Critic Methods [arxiv 2016]

Deep Reinforcement Learning with Double Q-Learning [AAAI 2016]

Action-Conditional Video Prediction using Deep Networks in Atari Games [NIPS 2015]

Human-level Control Through Deep Reinforcement Learning [Nature 2015]

Trust Region Policy Optimization [ICML 2015]

Deterministic Policy Gradient Algorithms [ICML 2014]

Reinforcement learning of motor skills with policy gradients [Neural Networks 2008]

Natural Actor-Critic [Neurocomputing 2008]

A Natural Policy Gradient [NIPS 2002]

Approximately Optimal Approximate Reinforcement Learning [ICML 2002]

Convergence of Stochastic Iterative Dynamic Programming Algorithms [NIPS 1994]