A Tensorflow implementation of a Distributed Distributional Deep Deterministic Policy Gradients (D4PG) network, for continuous control.
D4PG builds on the Deep Deterministic Policy Gradients (DDPG) approach (paper, code), making several improvements including the introduction of a distributional critic, using distributed agents running on multiple threads to collect experiences, prioritised experience replay (PER) and N-step returns.
Trained on OpenAI Gym environments.
This implementation has been successfully trained and tested on the Pendulum-v0, BipedalWalker-v2 and LunarLanderContinuous-v2 environments. This code can however be run on any environment with a low-dimensional (non-image) state space and continuous action space.
This currently holds the high score for the Pendulum-v0 environment on the OpenAI leaderboard
Note: Versions stated are the versions I used, however this will still likely work with other versions.
- Ubuntu 16.04 (Most (non-Atari) envs will also work on Windows)
- python 3.5
- OpenAI Gym 0.10.8 (See link for installation instructions + dependencies)
- tensorflow-gpu 1.5.0
- numpy 1.15.2
- scipy 1.1.0
- opencv-python 3.4.0
- imageio 2.4.1 (requires pillow)
- inotify-tools 3.14
The default environment is 'Pendulum-v0'. To use a different environment simply change the ENV
parameter in params.py
before running the following files.
To train the D4PG network, run
$ python train.py
This will train the network on the specified environment and periodically save checkpoints to the /ckpts
folder.
To test the saved checkpoints during training, run
$ python test_every_new_ckpt.py
This should be run alongside the training script, allowing to periodically test the latest checkpoints as the network trains. This script will invoke the run_every_new_ckpt.sh
shell script which monitors the given checkpoint directory and runs the test.py
script on the latest checkpoint every time a new checkpoint is saved. Test results are saved to a text file in the /test_results
folder (optional).
Once we have a trained network, we can visualise its performance in the environment by running
$ python play.py
This will play the environment on screen using the trained network and save a GIF (optional).
Note: To reproduce the best 100-episode performance of -123.11 +/- 6.86 that achieved the top score on the 'Pendulum-v0' OpenAI leaderboard, run
$ python test.py
specifying the train_params.ENV
and test_params.CKPT_FILE
parameters in params.py
as Pendulum-v0
and Pendulum-v0.ckpt-660000
respectively.
Result of training the D4PG on the 'Pendulum-v0' environment:
Result of training the D4PG on the 'LunarLanderContinuous-v2' environment:
Result of training the D4PG on the 'BipedalWalker-v2' environment:
Result of training the D4PG on the 'BipedalWalkerHardcore-v2' environment:
Environment | Best 100-episode performance | Ckpt file |
---|---|---|
Pendulum-v0 | -123.11 +/- 6.86 | ckpt-660000 |
LunarLanderContinuous-v2 | 290.87 +/- 2.00 | ckpt-320000 |
BipedalWalker-v2 | 304.62 +/- 0.13 | ckpt-940000 |
BipedalWalkerHardcore-v2 | 256.29 +/- 7.08 | ckpt-8130000 |
All checkpoints for the above results are saved in the ckpts
folder and the results can be reproduced by running python test.py
and specifying the train_params.ENV
and test_params.CKPT_FILE
parameters in params.py
for the desired environment and checkpoint file.
- Train/test on further environments, including Mujoco
- A Distributional Perspective on Reinforcement Learning
- Distributed Distributional Deterministic Policy Gradients
- OpenAI Baselines - Prioritised Experience Replay implementation
- OpenAI Baselines - Segment Tree implementation
- DeepMind TRFL Library - L2 Projection
MIT License