Agent

DDPG Algorithm

The DDPG (Deep Deterministic Policy Gradient) algorithm is based on Deep Q Learning. Similarly to DQNs, it uses a replay buffer as well as fixed-Q targets for its networks to help the model converge. Its actor also picks the best action deterministically, rather than stochastically like traditional policy methods. Where DDPG differs from Deep Q Learning is that it is able to generalize to continuous action spaces, something that Deep Q Learning is unable to do.

DDPG algorithm is a (self-proclaimed) actor-critic method. It has 4 networks; an actor network, a critic network, and fixed target networks for both of them. The actor network takes a state as input and outputs the believed best action deterministaically. The critic takes both the state and action as input and outputs the Q-value for that state-action pair. Similarly to DQNs, the loss is calculated between the predicted reward at the current state and of the next state, which is then used to backpropagate through both networks to allow them to learn. The soft update used to update the fixed networks is slightly different than that of Deep Q-learning, it updates only a small percentage of the fixed network at a time (0.1%) to allow the learning to be more stable.

DDPG was chosen due to its ability to perform well for tasks with continuous action states! It is a direct extension of Deep Q Learning, which has proven to be exceptional for multiple reinforcement learning tasks. A replay buffer was used and shared among both agents. To incite exploration, generated OU (Ornstein–Uhlenbeck) noise was added to the actions of each agent.

Implementation

Agent

Actor (policy) method and its fixed target
Critic (baseline) method and its fixed target
One replay buffer
Act method: uses the local actor network along with the noise function to output the next action
Step method: store experience tuple in replay buffer, and learns UPDATE_EVERY
Learn method: updates local actor and critic networks with gradient descent through loss.backward and optimizer.step, and soft updates the target for both networks

Training:

Loop over episodes:
- Reset environment and observe initial state
- Loop over time steps
  - Select action (Agent act method)
  - Execute selected action
  - Observe reward, next state, and done
  - Pass (state,action,reward,next_state, done) tuple to Agent (Agent step method)
    - Agent stores tuple in replay buffer
  - After UPDATE_EVERY iterations, sample batch of experiences from replay buffer and update both networks (Agent learn method) UPDATE_TIMES times
    - Update target networks after a certain number of updates

Hyperparameters

These hyperparameters were mostly adopted from the DDPG paper. UPDATE_EVERY and UPDATE_TIMES were a result of experimentation as they proved to faciliate the quickest and most stable learning.

BUFFER_SIZE = int(1e6)  # replay buffer size
BATCH_SIZE = 128        # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR_ACTOR = 1e-4         # learning rate of the actor 
LR_CRITIC = 1e-3        # learning rate of the critic
WEIGHT_DECAY = 0        # L2 weight decay
UPDATE_EVERY = 20       # how often to update the network
UPDATE_TIMES = 20       # how many times to update the network

Deep Q-Network Architecture

The architecture was chosen through deliberation and experimentation. The hidden layers of size 128 for both the actor and critic seemed to work best. Smaller sized layers may have been able to learn quicker but might not be as accurte, while larger layers may be prone to overfitting.

Actor

Fully Connected: 33 → 128 (Input layer)
RELU (Activation)
Batchnorm (Normalization)
Fully Connected: 128 → 128 (Hidden layer)
RELU (Activation)
Fully Connected: 128 → 4 (Output layer)
Tanh (Activation)

Critic

Fully Connected: 33 → 128 (Input layer)
RELU (Activation)
Batchnorm (Normalization)
Concatenation 128 → 132 (Concatenating actions and state)
Fully Connected: 132 → 128 (Hidden layer)
RELU (Activation)
Fully Connected: 128 → 1 (Output layer)

Plot of Rewards

Environment solved in 611 episodes! Average Score: 0.51

Ideas for Future Work

To improve performance, other algorithms like PPO, D4PG, GAE and A3C could be implemented. As well, more experimentation could be done with hyperparameters, network layer sizes, and training epochs. Finally, the MADDPG agent that uses centralized learning and decentralized execution could be used to further improve performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report.md

Report.md

Agent

DDPG Algorithm

Implementation

Hyperparameters

Deep Q-Network Architecture

Actor

Critic

Plot of Rewards

Ideas for Future Work

Files

Report.md

Latest commit

History

Report.md

File metadata and controls

Agent

DDPG Algorithm

Implementation

Hyperparameters

Deep Q-Network Architecture

Actor

Critic

Plot of Rewards

Ideas for Future Work