Skip to content
/ rlrd Public

PyTorch implementation of our paper Reinforcement Learning with Random Delays (ICLR 2020)

License

Notifications You must be signed in to change notification settings

rmst/rlrd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reinforcement Learning with Random Delays

PyTorch implementation of our paper Reinforcement Learning with Random Delays (ICLR 2020)[Arxiv]

Getting Started

This repository can be pip-installed via:

pip install git+https://github.com/rmst/rlrd.git

DC/AC can be run on a simple 1-step delayed Pendulum-v0 task via:

python -m rlrd run rlrd:DcacTraining Env.id=Pendulum-v0

Hyperparameters can be set via command line. E.g.:

python -m rlrd run rlrd:DcacTraining \
Env.id=Pendulum-v0 \
Env.min_observation_delay=0 \
Env.sup_observation_delay=2 \
Env.min_action_delay=0 \
Env.sup_action_delay=3 \
Agent.batchsize=128 \
Agent.memory_size=1000000 \
Agent.lr=0.0003 \
Agent.discount=0.99 \
Agent.target_update=0.005 \
Agent.reward_scale=5.0 \
Agent.entropy_scale=1.0 \
Agent.start_training=10000 \
Agent.device=cuda \
Agent.training_steps=1.0 \
Agent.loss_alpha=0.2 \
Agent.Model.hidden_units=256 \
Agent.Model.num_critics=2

Note that our gym wrapper adds a constant 1-step delay to the action delay, i.e. Env.min_action_delay=0 actually means that the minimum action delay is 1 whereas Env.min_observation_delay=0 means that the minimum observation delay is 0 (we assume that the action delay cannot be less than 1 time-step, e.g. for action inference). For instance:

  • Env.min_observation_delay=0 Env.sup_observation_delay=2 means that the observation delay is randomly 0 or 1.
  • Env.min_action_delay=0 Env.sup_action_delay=2 means that the action delay is randomly 1 or 2.
  • Env.min_observation_delay=1 Env.sup_observation_delay=2 means that the observation delay is always 1.
  • Env.min_observation_delay=0 Env.sup_observation_delay=3 means that the observation delay is randomly 0, 1 or 2.
  • etc.

Mujoco Experiments

To install Mujoco, follow the instructions at openai/gym. The following environments were used in the paper:

MuJoCo

To train DC/AC on a 1-step delayed version of HalfCheetah-v2, run:

python -m rlrd run rlrd:DcacTraining Env.id=HalfCheetah-v2

To train SAC on a 1-step delayed version of Ant-v2 run:

python -m rlrd run rlrd:DelayedSacTraining Env.id=Ant-v2

Weights and Biases API

Your curves can be exported directly to the Weights and Biases (wandb) website by using run-wandb. For example, to run DC/AC on Pendulum with a 1-step delay and export the curves to your wanb project:

python -m rlrd run-wandb \
yourWandbID \
yourWandbProjectName \
aNameForTheWandbRun \
aFileNameForLocalCheckpoints \
rlrd:DcacTraining Env.id=Pendulum-v0

Use the optional hyperparameters descibed before to play with more meaningful delays.

Contribute / known issues

Contributions are welcome. Please submit a PR with your name in the contributors list.

We did not yet optimize our python implementation of DC/AC, this is the most important thing to do right now as it is quite slow.

In particular, a lot of time is wasted when artificially re-creating a batched tensor for computing the value estimates in one forward pass, and the replay buffer is inefficient. See the #FIXME in dcac.py