diff --git a/Practical_4_Reinforcement_Learning.ipynb b/Practical_4_Reinforcement_Learning.ipynb index 8e76493..8cb1290 100644 --- a/Practical_4_Reinforcement_Learning.ipynb +++ b/Practical_4_Reinforcement_Learning.ipynb @@ -1,937 +1,1039 @@ { - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Eou_PrPyXAgT" + }, + "source": [ + "# Practical 4: Reinforcement Learning" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "QTOkLq7QSxaw" + }, + "source": [ + "## Introduction\n", + "In this practical we introduce the idea of reinforcement learning, discuss how it differs from supervised and unsupervised learning and then build an agent that learns to play a simple game called \"Catcher\".\n", + "\n", + "## Learning Objectives\n", + "* Understand the relationship between the **environment** and the **agent** \n", + "* Understand how a **policy** is used by an agent to select an action\n", + "* Describe how to implement a **run-loop** that controls the interaction between environement and agent.\n", + "* Understand how the **state**, **action** and **reward** are communicated between the agent and the environment. \n", + "* Be able to implement the a simple **policy-gradient** RL algorithm call **REINFORCE**\n", + "* Discover at least one potential issue with the REINFORCE algorithm." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "V3Yu5en6YvbB" + }, + "outputs": [], + "source": [ + "#@title [RUN ME!] Install pre-requisites. { display-mode: \"form\" }\n", + "import os\n", + "import sys\n", + "import math\t\n", + "\n", + "!git clone https://github.com/ntasfi/PyGame-Learning-Environment.git\n", + "os.chdir('PyGame-Learning-Environment')\n", + "!pip -q install -e .\n", + "!pip -q install pygame\n", + "# os.chdir('/content') # is this a collab thing?\n", + "\n", + "sys.path.append('/PyGame-Learning-Environment')\n", + "os.environ[\"SDL_VIDEODRIVER\"] = \"dummy\" # prevent trying to open a window\n", + "\n", + "!pip -q install moviepy\n", + "\n", + "print('Installed pre-requisites...')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "LejsQxNCB2fM" + }, + "outputs": [], + "source": [ + "#@title [RUN ME!] Imports { display-mode: \"form\" }\n", + "\n", + "from __future__ import absolute_import\n", + "from __future__ import division\n", + "from __future__ import print_function\n", + "from __future__ import unicode_literals\n", + "\n", + "import os\n", + "import moviepy.editor as mpy\n", + "from ple import PLE\n", + "from ple.games import pong\n", + "from ple.games import pixelcopter\n", + "from ple.games import flappybird\n", + "from ple.games import catcher\n", + "from IPython import display\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from collections import deque\n", + "import seaborn as sns\n", + "\n", + "import tensorflow as tf\n", + "\n", + "try:\n", + " tf.enable_eager_execution()\n", + " print('Running Eagerly')\n", + "except ValueError:\n", + " print('Already running Eagerly')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "o8kIIBytaQqH" + }, + "outputs": [], + "source": [ + "#@title [RUN ME!] Helper Functions { display-mode: \"form\" }\n", + "\n", + "def make_animation(images, fps=60, true_image=False):\n", + " duration = len(images) / fps\n", + "\n", + " def make_frame(t):\n", + " try:\n", + " x = images[int(len(images) / duration * t)]\n", + " except:\n", + " x = images[-1]\n", + "\n", + " if true_image:\n", + " return x.astype(np.uint8)\n", + " else:\n", + " return ((x + 1) / 2 * 255).astype(np.uint8)\n", + "\n", + " clip = mpy.VideoClip(make_frame, duration=duration)\n", + " clip.fps = fps\n", + " return clip\n", + "\n", + "def progress(value, max=100, message=''):\n", + " return display.HTML(\"\"\"\n", + " \n", + "
{message}
\n", + " \"\"\".format(value=value, max=max, message=message))\n", + "\n", + "def plot_rolling_returns(rolling_returns): \n", + " sns.tsplot(rolling_returns)\n", + " plt.title('Rolling Returns')\n", + " plt.xlabel('# Epsiodes')\n", + " plt.ylabel('Rolling Return')\n", + " \n", + "def state_to_buckets(state, bucket_width=0.25):\n", + " return tuple(math.ceil(s / bucket_width)-1 for s in state)\n", + "\n", + "def visualize_policy(ax, policy):\n", + " # plot the policy (this gives us a more visual indication that it is a\n", + " # probability distribution over available actions)\n", + " action_strings = [\"right\", \"left\", \"None\"]\n", + " \n", + " ax.bar(action_strings, policy.numpy())\n", + " ax.set_title(\"Distribution over actions\")\n", + " ax.set_xlabel(\"Actions\")\n", + " ax.set_ylabel(\"Probability\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "2wZMmj4Toy8e" + }, + "source": [ + "## Reinforcement Learning\n", + "So far we have encountered **supervised learning**, where we have an input and a target value or class that we want to predict. We have also encountered **unsupervised learning** where we are only given an input and look for patterns in that input. In this practical, we look into **reinforcement learning** which can loosely be defined as training an **agent** to maximise a numerical **reward** it obtains through interaction with an **environment**. \n", + "\n", + "The environment defines a set of **actions** that an agent can take. The agent observes the current **state** of the environment, tries actions and *learns* a **policy** which is a distribution over the possible actions given a state of the environment. \n", + "\n", + "The following diagram illustrates the interaction between the agent and environment. We will explore each of the terms in more detail throughout this practical. \n", + "\n", + "\n", + "![Interaction of Agent and Environment](https://github.com/deep-learning-indaba/indaba-2018/blob/master/images/rl_diagram1.png?raw=true)\n", + "\n", + "\n", + "## Outline\n", + "We will train an agent to play a very simple game called \"Catcher\" which is often used as a test bed for RL algorithms. In the process we will set up all the necessary framework to explore variations of the algorithm or switch to more advanced games! In particular, the steps we will follow in this practical are as follows:\n", + "\n", + "1. Introduce the game environment, explore the states and actions available. \n", + "2. Create a simple agent that takes random actions\n", + "3. Write a run-loop which controls the interaction and manages the communication between the agent and environment\n", + "4. Implement a policy as a feed-forward neural network\n", + "5. Explain and implement the REINFORCE algorithm to learn how to play the game" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "RnNYUFu7V1I2" + }, + "source": [ + "## The Environment\n", + "The environment we consider is the game Catcher from the PyGame Learning Environment (PLE) library. The player (agent) controls a paddle that it must use to catch a falling fruit. Each time the game runs, the fruit falls from from the top to the bottom of the screen, starting at a different X coordinate (which doesn't change during the episode) and the paddle starts in a random location along the bottom of the screen. The player wins if they manage to catch the fruit and loses if it falls to the ground. \n", + "\n", + "![Catcher game illustration](https://pygame-learning-environment.readthedocs.io/en/latest/_images/catcher.gif)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "9vTOQs5eQZoE" + }, + "source": [ + "### Create the environment using PLE\n", + "For this game, we'll set things up so that the reward will be non-zero only at the end of each episode, and will be +1 for catching the fruit and -1 for missing it. For all other frames the reward will be zero.\n", + "\n", + "**Note**: PLE has a number of games, which it wraps in a generic \"environment\". \n", + "So, in this case, both ```evironment``` and ```game``` constitute our environment. ```environment``` allows us to perform actions \n", + "and returns states and rewards, while ```game``` handles the specifics of Catcher (or whichever other game we decide to use)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "L_awOQZJC-d_" + }, + "outputs": [], + "source": [ + "# Create an instance of the catcher game\n", + "game = catcher.Catcher(init_lives=1)\n", + "game_height = game.height\n", + "game_width = game.width\n", + "\n", + "frame_skip = 3 # Skip 3 frames at each step to speed up the game\n", + "\n", + "# Wrap the game in a PLE environment and configure the rewards\n", + "environment = PLE(game, display_screen=False, force_fps=True, \n", + " reward_values={'win': 1.0, 'loss': -1.0, 'negative': 0.0}, \n", + " frame_skip=frame_skip) \n", + "\n", + "# The reward_values dictionary above allows us to override the default reward structure provided by PLE. \n", + "# win and loss specify the rewards for winning or losing an episode, while positive and negative \n", + "# specify the rewards received for positive or negative events that can occur during the game.\n", + "# If you change the game, start by *removing* the overrides and see what the default is before deciding if\n", + "# you want to modify it. \n", + "\n", + "# Initialise the environment\n", + "environment.init()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "36vk0-0qST83" + }, + "source": [ + "### What does the state look like?\n", + "The environment provides a rendering of the virtual 'screen' of the game to an RGB image, by using the method ```getScreenRGB```. For this practical, in the interests of simplicity and quick trianing time, we use the *game state* directly, which provides a summary (in a dictionary) of important peices of information making up the current state of the game. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { "colab": { - "name": "Practical 4: Reinforcement Learning", - "version": "0.3.2", - "provenance": [], - "collapsed_sections": [] - }, - "kernelspec": { - "name": "python2", - "display_name": "Python 2" - }, - "accelerator": "GPU" - }, - "cells": [ - { - "metadata": { - "id": "Eou_PrPyXAgT", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "# Practical 4: Reinforcement Learning" - ] - }, - { - "metadata": { - "id": "QTOkLq7QSxaw", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "## Introduction\n", - "In this practical we introduce the idea of reinforcement learning, discuss how it differs from supervised and unsupervised learning and then build an agent that learns to play a simple game called \"Catcher\".\n", - "\n", - "## Learning Objectives\n", - "* Understand the relationship between the **environment** and the **agent** \n", - "* Understand how a **policy** is used by an agent to select an action\n", - "* Describe how to implement a **run-loop** that controls the interaction between environement and agent.\n", - "* Understand how the **state**, **action** and **reward** are communicated between the agent and the environment. \n", - "* Be able to implement the a simple **policy-gradient** RL algorithm call **REINFORCE**\n", - "* Discover at least one potential issue with the REINFORCE algorithm." - ] - }, - { - "metadata": { - "id": "V3Yu5en6YvbB", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title [RUN ME!] Install pre-requisites. { display-mode: \"form\" }\n", - "import os\n", - "import sys\n", - "import math\t\n", - "\n", - "!git clone https://github.com/ntasfi/PyGame-Learning-Environment.git\n", - "os.chdir('PyGame-Learning-Environment')\n", - "!pip -q install -e .\n", - "!pip -q install pygame\n", - "os.chdir('/content')\n", - "\n", - "sys.path.append('/content/PyGame-Learning-Environment')\n", - "os.environ[\"SDL_VIDEODRIVER\"] = \"dummy\" # prevent trying to open a window\n", - "\n", - "!pip -q install moviepy\n", - "\n", - "print('Installed pre-requisites...')" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "LejsQxNCB2fM", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title [RUN ME!] Imports { display-mode: \"form\" }\n", - "\n", - "from __future__ import absolute_import\n", - "from __future__ import division\n", - "from __future__ import print_function\n", - "from __future__ import unicode_literals\n", - "\n", - "import os\n", - "import moviepy.editor as mpy\n", - "from ple import PLE\n", - "from ple.games import pong\n", - "from ple.games import pixelcopter\n", - "from ple.games import flappybird\n", - "from ple.games import catcher\n", - "from IPython import display\n", - "import numpy as np\n", - "import matplotlib.pyplot as plt\n", - "from collections import deque\n", - "import seaborn as sns\n", - "\n", - "import tensorflow as tf\n", - "try:\n", - " tf.enable_eager_execution()\n", - " print('Running Eagerly')\n", - "except ValueError:\n", - " print('Already running Eagerly')" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "o8kIIBytaQqH", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title [RUN ME!] Helper Functions { display-mode: \"form\" }\n", - "\n", - "def make_animation(images, fps=60, true_image=False):\n", - " duration = len(images) / fps\n", - "\n", - " def make_frame(t):\n", - " try:\n", - " x = images[int(len(images) / duration * t)]\n", - " except:\n", - " x = images[-1]\n", - "\n", - " if true_image:\n", - " return x.astype(np.uint8)\n", - " else:\n", - " return ((x + 1) / 2 * 255).astype(np.uint8)\n", - "\n", - " clip = mpy.VideoClip(make_frame, duration=duration)\n", - " clip.fps = fps\n", - " return clip\n", - "\n", - "def progress(value, max=100, message=''):\n", - " return display.HTML(\"\"\"\n", - " \n", - "{message}
\n", - " \"\"\".format(value=value, max=max, message=message))\n", - "\n", - "def plot_rolling_returns(rolling_returns): \n", - " sns.tsplot(rolling_returns)\n", - " plt.title('Rolling Returns')\n", - " plt.xlabel('# Epsiodes')\n", - " plt.ylabel('Rolling Return')\n", - " \n", - "def state_to_buckets(state, bucket_width=0.25):\n", - " return tuple(math.ceil(s / bucket_width)-1 for s in state)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "2wZMmj4Toy8e", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "## Reinforcement Learning\n", - "So far we have encountered **supervised learning**, where we have an input and a target value or class that we want to predict. We have also encountered **unsupervised learning** where we are only given an input and look for patterns in that input. In this practical, we look into **reinforcement learning** which can loosely be defined as training an **agent** to maximise a numerical **reward** it obtains through interaction with an **environment**. \n", - "\n", - "The environment defines a set of **actions** that an agent can take. The agent observes the current **state** of the environment, tries actions and *learns* a **policy** which is a distribution over the possible actions given a state of the environment. \n", - "\n", - "The following diagram illustrates the interaction between the agent and environment. We will explore each of the terms in more detail throughout this practical. \n", - "\n", - "\n", - "![Interaction of Agent and Environment](https://github.com/deep-learning-indaba/indaba-2018/blob/master/images/rl_diagram1.png?raw=true)\n", - "\n", - "\n", - "## Outline\n", - "We will train an agent to play a very simple game called \"Catcher\" which is often used as a test bed for RL algorithms. In the process we will set up all the necessary framework to explore variations of the algorithm or switch to more advanced games! In particular, the steps we will follow in this practical are as follows:\n", - "\n", - "1. Introduce the game environment, explore the states and actions available. \n", - "2. Create a simple agent that takes random actions\n", - "3. Write a run-loop which controls the interaction and manages the communication between the agent and environment\n", - "4. Implement a policy as a feed-forward neural network\n", - "5. Explain and implement the REINFORCE algorithm to learn how to play the game" - ] - }, - { - "metadata": { - "id": "RnNYUFu7V1I2", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "## The Environment\n", - "The environment we consider is the game Catcher from the PyGame Learning Environment (PLE) library. The player (agent) controls a paddle that it must use to catch a falling fruit. Each time the game runs, the fruit falls from from the top to the bottom of the screen, starting at a different X coordinate (which doesn't change during the episode) and the paddle starts in a random location along the bottom of the screen. The player wins if they manage to catch the fruit and loses if it falls to the ground. \n", - "\n", - "![Catcher game illustration](https://pygame-learning-environment.readthedocs.io/en/latest/_images/catcher.gif)" - ] - }, - { - "metadata": { - "id": "9vTOQs5eQZoE", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "### Create the environment using PLE\n", - "For this game, we'll set things up so that the reward will be non-zero only at the end of each episode, and will be +1 for catching the fruit and -1 for missing it. For all other frames the reward will be zero.\n", - "\n", - "**Note**: PLE has a number of games, which it wraps in a generic \"environment\". \n", - "So, in this case, both ```evironment``` and ```game``` constitute our environment. ```environment``` allows us to perform actions \n", - "and returns states and rewards, while ```game``` handles the specifics of Catcher (or whichever other game we decide to use)" - ] - }, - { - "metadata": { - "id": "L_awOQZJC-d_", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "# Create an instance of the catcher game\n", - "game = catcher.Catcher(init_lives=1)\n", - "game_height = game.height\n", - "game_width = game.width\n", - "\n", - "frame_skip = 3 # Skip 3 frames at each step to speed up the game\n", - "\n", - "# Wrap the game in a PLE environment and configure the rewards\n", - "environment = PLE(game, display_screen=False, force_fps=True, \n", - " reward_values={'win': 1.0, 'loss': -1.0, 'negative': 0.0}, \n", - " frame_skip=frame_skip) \n", - "\n", - "# The reward_values dictionary above allows us to override the default reward structure provided by PLE. \n", - "# win and loss specify the rewards for winning or losing an episode, while positive and negative \n", - "# specify the rewards received for positive or negative events that can occur during the game.\n", - "# If you change the game, start by *removing* the overrides and see what the default is before deciding if\n", - "# you want to modify it. \n", - "\n", - "# Initialise the environment\n", - "environment.init()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "36vk0-0qST83", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "### What does the state look like?\n", - "The environment provides a rendering of the virtual 'screen' of the game to an RGB image, by using the method ```getScreenRGB```. For this practical, in the interests of simplicity and quick trianing time, we use the *game state* directly, which provides a summary (in a dictionary) of important peices of information making up the current state of the game. " - ] - }, - { - "metadata": { - "id": "kcGwYfj6SWl2", - "colab_type": "code", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 34 - }, - "outputId": "95dc4e2c-25ee-485d-a4d5-9b8e0fd2372f" - }, - "cell_type": "code", - "source": [ - "print('Current game state:', game.getGameState())" - ], - "execution_count": 6, - "outputs": [ - { - "output_type": "stream", - "text": [ - "Current game state: {'fruit_x': 8, 'player_vel': 0.0, 'player_x': 26, 'fruit_y': -8}\n" - ], - "name": "stdout" - } - ] - }, - { - "metadata": { - "id": "CTH3IMzp_IUl", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "### What actions are available?\n", - "The following cell prints the actions that are available in the current game, which are represented with numerical codes. The PLE environment wrapper also adds an additional ```None``` action which means \"do nothing\". " - ] - }, - { - "metadata": { - "id": "WawJ92pOCD4D", - "colab_type": "code", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 51 - }, - "outputId": "7d286730-bcc5-418a-c603-cf45e4c89690" - }, - "cell_type": "code", - "source": [ - "print('Game actions:', game.actions)\n", - "print('Environment actions:', environment.getActionSet())" - ], - "execution_count": 7, - "outputs": [ - { - "output_type": "stream", - "text": [ - "Game actions: {'right': 100, 'left': 97}\n", - "Environment actions: [100, 97, None]\n" - ], - "name": "stdout" - } - ] - }, - { - "metadata": { - "id": "RElMlRA-O-WC", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "#@title [RUN ME!] Setup for the next section { display-mode: \"form\" }\n", - "# Maintain some variables for the next task\n", - "environment.reset_game()\n", - "observed_states = []\n", - "observed_actions = []\n", - "observed_rewards = []\n", - "observed_states.append(game.getGameState())" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "pwl9qQmlURgR", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "### Exploratory Task\n", - "Run the cell below which defines a function that takes a given action in the environment, using the environment's ```act``` method and then renders both the previous and current game screen. Change the action in the drop-down to the right of the 2nd code cell, which calls this function with the chosen action. Observe what happens to the environment(game) state and reward. By running the cell multiple times (and changing the action) until the game comes to an end (when you either win or lose), you will manually create an **episode**, which is a sequence of states, actions and rewards until a termination condition is reached. If an episode $i$ consists of $T_i$ steps and we denote the state, action and reward at step $t$ in episode $i$ respectively as $s_{i, t}$, $a_{i,t}$ and $r_{i,t}$, then in this task, we create a *trajectory* $\\tau_i = (s_{i, 1}, a_{i, 1}, r_{i, 1}, ..., a_{i, T_i-1}, r_{i, T_i-1}, s_{i, T_i})$\n", - "\n", - "#### Question\n", - "Notice how the paddle sometimes moves even if you take the \"None\" action? Can you think of why this happens? \n", - "\n", - "#### Notes\n", - "* If you want to run another episode, re-run the code cell above titled \"Setup for the next section\" to reset the environment\n", - "* This particular game returns a reward of $0$ at each step and a final reward of $-1$ or $1$ at the end of the episode depending on whether you lose or win. Other games may have different reward structures! " - ] - }, - { - "metadata": { - "id": "cmbraWuttEI0", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "previous_frame = environment.getScreenRGB().transpose([1, 0, 2])\n", - "\n", - "def take_action(action):\n", - " global previous_frame\n", - " \n", - " # Look up the action code from the description\n", - " action_code = None if action == 'None' else game.actions[action]\n", - "\n", - " # Take the selected action in the environment\n", - " print('Taking action: {} ({})'.format(action, action_code))\n", - " reward = environment.act(action_code)\n", - "\n", - " observed_actions.append(action)\n", - " observed_rewards.append(reward)\n", - "\n", - " # Print and display the current state and reward\n", - " state = game.getGameState()\n", - " print('Game state:', state)\n", - " print('Reward received: ', reward)\n", - "\n", - " observed_states.append(state)\n", - "\n", - " if reward > 0 or environment.game_over():\n", - " print('Game over, you', 'WON' if reward > 0 else 'LOST')\n", - " print('The episode trajectory was:')\n", - " for s, a, r in zip(observed_states, observed_actions, observed_rewards):\n", - " print('State:', s, 'Action:', a, 'Reward:', r)\n", - " print('Terminal state:', observed_states[-1])\n", - " \n", - " current_frame = environment.getScreenRGB().transpose([1, 0, 2])\n", - " \n", - " fig = plt.figure(figsize=(10, 20))\n", - " \n", - " ax = plt.subplot(1, 2, 1)\n", - " plt.imshow(previous_frame)\n", - " ax.grid(False)\n", - " ax.set_title('PREVIOUS FRAME')\n", - " \n", - " ax = plt.subplot(1, 2, 2)\n", - " plt.imshow(current_frame)\n", - " ax.grid(False)\n", - " ax.set_title('CURRENT FRAME')\n", - " \n", - " previous_frame = current_frame" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "vdo_a8VBU9Sm", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "action = \"None\" #@param ['right', 'left', 'None']\n", - "take_action(action)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "ebckEImi-_k0", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "## The Agent\n", - "We now turn to the agent. An agent receives the current state and (previous) reward from the environment, then uses an internal policy to determine an action to take. We implement an agent as a Python [**class**](https://en.wikibooks.org/wiki/A_Beginner%27s_Python_Tutorial/Classes), which is just a logical wrapper of variables and methods (functions) that operate on those variables. The methods our agent will have are the following:\n", - "\n", - "* **Initialisation** (```__init__```): Initialises the agent the first time it's created. \n", - "* **policy**: The policy is a function that returns a *distribution* over the possible actions, given the current state.\n", - "* **step**: This takes as input, the current state and previous reward from the environment, then uses the internal policy to determine which action to take. Specifically, it does some pre-processing of the state, then samples a single action from the distribution over possible actions obtained from the policy. \n", - "* **reset**: This is called to reset the agent's variables before running it on a new episode\n", - "* **end_episode**: This method signals to the agent that the current episode has come to an end. The agent may do some learning, or clean-up at the end of an episode. \n", - "\n", - "### The Random Agent\n", - "\n", - "To get a feel for an agent and the methods it has, we implement an agent that just takes a *random* action at every step. For demonstration purposes, we also calculate the **episode return** at the end of an episode. The episode return is the sum of the (discounted) rewards obtained during the episode. If the returns for episode $i$, with trajectory $\\tau_{i}$ are denoted $r_{i, t}$, and the **discount factor** is $\\gamma$, then the episode return is calcuated as: $r(\\tau_i) = \\sum_{t=1}^{T_i} \\gamma^t r_{i,t}$. The discount factor allows us to increase the importance of rewards received quickly and decrease the importance of rewards that take long to receive. It is especially important in environments that could have episodes that are infinitely long. In our particular environment where every episode is of the same length and the only non-zero reward is received at the end of the game, the discount factor doesn't make much difference and so we will ignore it (effectively set it to $1$) for the remainder of this practical. " - ] - }, - { - "metadata": { - "id": "IRPa2lmldQoW", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "class RandomAgent(object):\n", - " \n", - " def __init__(self, actions, state_size, seed):\n", - " # When initializing, we let the agent know what actions are available in the \n", - " # environment, how large the state is (not used in the RandomAgent) and the \n", - " # current seed to use (also not used in the RandomAgent)\n", - " self._actions = actions\n", - " self._rewards = []\n", - " self._taken_actions = []\n", - " self._observed_states = []\n", - " \n", - " def policy(self, state): \n", - " # The policy is an internal function that takes a state and returns a distribution over the possible actions. \n", - " # The random agent just returns a uniform distribution over the actions. \n", - " n = len(self._actions) # The number of actions\n", - " return tf.fill([n], 1./n) # This returns a vector of length n, with each entry being 1/n\n", - " \n", - " def step(self, state_dict, reward):\n", - " \n", - " # Pre-process the state to extract the numerical values we're interested in from the state dictionary\n", - " state = np.array([\n", - " state_dict['fruit_x'] / game_width, # Divide by width(or height) to normalise the value to lie between 0 and 1.\n", - " state_dict['player_x'] / game_width,\n", - " state_dict['fruit_y'] / (game_height+1),\n", - " state_dict['player_vel'] / game_width\n", - " ], dtype=np.float32)\n", - " \n", - " self._observed_states.append(state) # Record that the state was observed during the episode\n", - " self._rewards.append(reward) # Keep track of the rewards we've received along the way\n", - " action_distribution = self.policy(state) # Use the policy to get the distribution over actions\n", - " \n", - " # Sample a single action according to the distribution over actions\n", - " action = np.random.choice(self._actions, p=action_distribution.numpy()) \n", - " \n", - " self._taken_actions.append(action) # Record that the action was taken during the episode\n", - " \n", - " return action\n", - " \n", - " def reset(self):\n", - " # This method is called when a new episode starts, we need to clear the \n", - " # states, actions and rewards that we tracked during the last episode.\n", - " self._rewards = [] \n", - " self._taken_actions = [] \n", - " self._observed_states = [] \n", - " \n", - " def end_episode(self, final_reward):\n", - " # We just calculate the episode return\n", - " episode_return = sum(self._rewards) + final_reward\n", - " return episode_return" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "-vF0d4FORBYK", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "## The Run-Loop\n", - "Now that we have an environment and a simple agent, we need a way of controlling the interaction between the agent and environment over multiple episodes. We do so in a **run-loop**. In this simple run-loop, the agent and environment run in lock-step. For each game frame, we get the state from the environment and pass it, along with the previous reward, to the agent. The agent selects an action that it wants to take given the game state. The action is taken in the environment and any reward received is recorded. We run the loop for multiple episodes, each time being careful to reset the game and agent (because they're starting a new game from scratch)." - ] - }, - { - "metadata": { - "id": "ZmzYRPUi9ZGi", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "def run_loop(agent_class, # Which agent to use\n", - " num_episodes=1, # How many episodes to run for\n", - " record_every=1, # How many episodes to record\n", - " seed=1234, # The random seed used\n", - " rolling_return_frequency=100, # The window size used to track the rolling episode return\n", - " state_size=4): # The size of the state\n", - " \n", - " # Set the random seeds\n", - " tf.set_random_seed(seed)\n", - " np.random.seed(seed)\n", - " \n", - " # Initialise the environment\n", - " environment.init()\n", - " \n", - " # Create an agent (this runs the agent's __init__ method)\n", - " agent = agent_class(environment.getActionSet(), state_size, seed)\n", - " \n", - " progress_out = display.display(progress(0, num_episodes), display_id=True) # Create a progress-bar\n", - " \n", - " # Create data structures to store metrics\n", - " windowed_return = deque()\n", - " rolling_returns = []\n", - " frames = []\n", - " \n", - " for episode in range(num_episodes):\n", - " environment.reset_game() # reset the environment\n", - " agent.reset() # reset the agent\n", - " reward = 0\n", - "\n", - " while reward == 0 and not environment.game_over(): # Loop until the episode terminates \n", - " state = game.getGameState() # Get the current game state\n", - " action = agent.step(state, reward) # Pass the current game state and previous reward to the agent, get the action it wants to take\n", - " reward = environment.act(action) # Pass the action to the environment and get the reward.\n", - "\n", - " if episode % record_every == 0:\n", - " # Store the frames for display later, every `record_every` episodes\n", - " frames.append(environment.getScreenRGB())\n", - " \n", - " info = agent.end_episode(reward) # Signal to the agent that the episode has come to and end\n", - " \n", - " # Store the episode return in the window (in this case, with no discounting, the episode return is the same as the environment's score)\n", - " windowed_return.append(environment.score())\n", - " if len(windowed_return) > rolling_return_frequency:\n", - " windowed_return.popleft()\n", - " \n", - " rolling_return = sum(windowed_return) / len(windowed_return)\n", - " rolling_returns.append(rolling_return)\n", - " \n", - " # Update the progress-bar\n", - " message = 'Episode {}/{} ended with score {}, Rolling Return: {}, {}'.format(\n", - " episode+1, num_episodes, environment.score(), \n", - " rolling_return,\n", - " info if info is not None else '')\n", - " progress_out.update(progress(episode+1, num_episodes, message))\n", - " \n", - " message = 'Finished training, rendering video...'\n", - " progress_out.update(progress(episode+1, num_episodes, message))\n", - " \n", - " # Render a video\n", - " clip = make_animation(frames, fps=30, true_image=True).rotate(-90)\n", - " display.display(clip.ipython_display(fps=30, center=False, autoplay=False, loop=False, height=320, width=240, max_duration=1000))\n", - " \n", - " message = 'Done...'\n", - " progress_out.update(progress(episode+1, num_episodes, message))\n", - " \n", - " return rolling_returns" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "wkiokNVeg167", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "We now run our RandomAgent with the run-loop for 100 episodes to check that everything is working so far. (Note: the blue progress bar shows how many of the ```num_episodes``` episodes we've completed. The small black progress bar is for the video rendering, ignore that one!)" - ] - }, - { - "metadata": { - "id": "6-TgMf2bcVJ0", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "rolling_returns = run_loop(RandomAgent, num_episodes=100, record_every=5, rolling_return_frequency=5)\n", - "plot_rolling_returns(rolling_returns)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "s_2-dlEqvNLe", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "## A Policy Network\n", - "Remember, the policy is a distribution over the possible actions the agent can take in the environment given the current state of the environment, denoted $\\pi(a|s)$. In a Deep RL agent, the policy is represented by a neural network with parameters $\\theta$, so we have $\\pi_\\theta(a|s) = NN(s; \\theta)$, where $NN(s; \\theta)$ is some potentially complex function represented by a neural network with parameters $\\theta$. In other words, our neural network takes in the state as input and outputs the appropriate distribution over actions. Let us implement an agent who's policy is defined by a simple feed-forward neural network. We will name the class 'FixedAgent' because this agent will do no learning. As a result the policy network's weights will be fixed and the agent will take random actions as before.\n", - "\n", - "The ```reset```, ```step``` and ```end_episode``` methods of our fixed agent will be identical to the RandomAgent we built earlier. We'll only change the ```__init__``` and ```policy``` methods. To avoid having to rewrite all that code, we will use Python's **inheritance** to reuse all the methods in RandomAgent except for policy which we *override* here." - ] - }, - { - "metadata": { - "id": "x5LkKobJhkV3", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "# Lets build a fixed agent\n", - "class FixedAgent(RandomAgent): # Inherit all the methods of RandomAgent\n", - " \n", - " def __init__(self, actions, state_size, seed):\n", - " super(FixedAgent, self).__init__(actions, state_size, seed)\n", - " \n", - " # Define the policy network in the initialize method (constructor) because it should persist\n", - " # through multiple usages over multiple episodes of the agent.\n", - " # (We change the default weight initialiser to truncated random normal which \n", - " # works better for the RL algorithm we'll use in this practical.)\n", - " self._policy_network = tf.keras.Sequential([\n", - " # Add a hidden layer with 64 neurons\n", - " tf.keras.layers.Dense(64, input_shape=[state_size], activation=tf.nn.relu, \n", - " kernel_initializer=tf.truncated_normal_initializer(seed=seed)),\n", - " # Add a hidden layer with 32 neurons\n", - " tf.keras.layers.Dense(32, activation=tf.nn.relu, \n", - " kernel_initializer=tf.truncated_normal_initializer(seed=seed)),\n", - " # Add an output layer with action-many neurons and a softmax activation function\n", - " tf.keras.layers.Dense(len(actions), activation='softmax'),\n", - " ])\n", - " \n", - " # Override the policy\n", - " def policy(self, state):\n", - " layer_input = tf.expand_dims(state, axis=0) # Add a dummy batch dimension\n", - " action_distribution = self._policy_network(layer_input) # Get the distribution over actions from the policy network\n", - " action_distribution = tf.squeeze(action_distribution, axis=0) # Remove the dummy batch dimension\n", - " \n", - " return action_distribution" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "_9LtvT6d3xsZ", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "Let's test our FixedAgent, this is just to see that it runs, as we don't expect it to perform any better than the RandomAgent because it isn't learning anything yet! " - ] - }, - { - "metadata": { - "id": "6kAPun6f6nzS", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "# Our fixed-weight agent\n", - "rolling_returns = run_loop(FixedAgent, num_episodes=100, record_every=5, rolling_return_frequency=5)\n", - "plot_rolling_returns(rolling_returns)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "PPesBe8q3621", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "## Learning with Policy Gradients\n", - "Finally, let's give our agent some intelligence by making it learn from its experience in interacting with the environment. In order to learn, we need a loss function or *objective*. In RL, the objective is to maximise the expected episode return (rewards) by taking actions in the environment. The actions our agent takes are determined by the policy $\\pi_\\theta(a|s)$, which are in turn determined by the neural network parameters $\\theta$. So, we want to find the neural network parameters $\\theta$ that maximise \n", - "\n", - "$J(\\theta) = \\mathbb{E}_{\\tau}[r(\\tau)]$\n", - "\n", - "**Note:** If the maths in the next section looks intimidating, feel free to skip over it, read the intuition and code and come back to it later! \n", - "\n", - "\n", - "### The derivative of the objective\n", - "We now turn to our usual tool of stochastic gradient descent to optimise the objective, but there are two complications. Firstly, the term $\\pi_\\theta(a|s)$ represented by our neural network doesn't appear in the equation (or does it?). Secondly, how do we deal with the expectation?\n", - "\n", - "The first thing is to realise that our trajectories $\\tau$ depend on the policy $\\pi_\\theta(a|s)$ (**Question:** Why?) So we can (informally) write:\n", - "\n", - "\\begin{align}\n", - "J(\\theta) &= \\mathbb{E}_{\\tau \\sim \\pi_\\theta(\\tau)} [r(\\tau)] & \\\\\n", - "&= \\int \\pi_\\theta (\\tau) r(\\tau)d\\tau & (\\text{Definition of expectation}) \\\\\n", - "\\end{align}\n", - "\n", - "Then the gradient is:\n", - "\n", - "\\begin{align}\n", - "\\nabla_\\theta J(\\theta) &= \\nabla_\\theta \\int \\pi_\\theta (\\tau) r(\\tau)d\\tau & \\\\\n", - "&= \\int \\pi_\\theta (\\tau) \\nabla_\\theta log \\pi_\\theta (\\tau) r(\\tau)d\\tau & (\\text{\"Log derivative trick\"}) \\\\\n", - "&= \\mathbb{E}_{\\tau \\sim \\pi_\\theta(\\tau)}[\\nabla_\\theta log \\pi_\\theta (\\tau) r(\\tau)]\n", - "\\end{align}\n", - "\n", - "Finally, since we don't know the true distribution of $\\tau$, we can approximate the expectation using a *monte-carlo* approximation, where the sample trajectories come from $N$ episodes of interaction with the environment. \n", - "\n", - "\\begin{align}\n", - "\\nabla_\\theta J(\\theta) &= \\frac{1}{N} \\sum_{i=1}^N \\nabla_\\theta log \\pi_\\theta (\\tau_i) r(\\tau_i)\n", - "\\end{align}\n", - "\n", - "Expanding this out (and considering that episode $i$ has $T_i$ steps) gives:\n", - "\n", - "\\begin{align}\n", - "\\nabla_\\theta J(\\theta) &= \\frac{1}{N} \\sum_{i=1}^N (\\sum_{t=1}^{T_i} \\nabla_\\theta log(\\pi_\\theta(a_{i,t} | s_{i, t})) \\sum_{t=1}^{T_i} \\gamma^t r_{i,t} )\n", - "\\end{align}\n", - "\n", - "We skipped a few steps in the maths here for brevity (see chapter 13 of [Sutton and Barto](https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view) for all the details if you're interested!). If the maths looks intimidating, don't worry! The important things to realise are the following:\n", - "* We define an objective $J(\\theta)$ that is exactly what we want to do with RL, maximise the expected return.\n", - "* We can run our agent in the environment to generate *trajectories* for multiple episodes\n", - "* When an episode comes to an end and we know the return and trajectory, we can compute a term in (the Monte-carlo approximation to) the objective function. \n", - "* We can use Tensorflow to compute the gradient of an individual term in the monte-carlo approximation and apply it to the parameters of our neural network. To do this, we define the *loss* to minimise as follows (where the sums can be represented by loops and we set $\\gamma = 1$ for simplicity):\n", - "\n", - "\\begin{align}\n", - "L(\\theta) &= -\\sum_{t=1}^{T_i} log(\\pi_\\theta(a_{i,t} | s_{i, t})) \\sum_{t=1}^{T_i} r_{i,t} \n", - "\\end{align}\n", - "\n", - "The name \"policy gradient\" comes from the fact that we're directly taking the gradient of the policy, rather than the alternative, value-based RL, which uses iterative update rules to calculate the expected return assocated with a state. The particular flavour of policy gradient which uses the loss function above, along with the Monte-carlo approximation of the objective is known as the **REINFORCE** algorithm. \n", - "\n" - ] + "base_uri": "https://localhost:8080/", + "height": 34 }, - { - "metadata": { - "id": "42i6ZNe1T6ME", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "Finally, we implement the REINFORCE algorithm to optimize the parameters of the neural network. The only things we're changing, compared to our FixedAgent are the ```__init__``` and ```end_episode``` methods, so we use *inheritance* again to automatically \"copy\" all the methods from ```FixedAgent``` and ```RandomAgent```. \n", - "\n", - "**Note:** ```ReinforceAgent``` indirectly inherits the methods from ```RandomAgent``` through ```FixedAgent``` (which directly inherits from ```RandomAgent```). Both ```RandomAgent``` and ```FixedAgent``` have a ```policy``` method, but the one that gets \"copied\" to ```ReinforceAgent``` is the one from ```FixedAgent``` because it appeared later in the chain." - ] - }, - { - "metadata": { - "id": "hQldYOWuu9RO", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "class ReinforceAgent(FixedAgent):\n", - " \n", - " # Override the initialization method because this agent also needs an optimizer\n", - " # and a variable to track the step\n", - " def __init__(self, actions, state_size, seed):\n", - " super(ReinforceAgent, self).__init__(actions, state_size, seed)\n", - " self._optimizer = tf.train.RMSPropOptimizer(learning_rate=0.001) \n", - " self._step_counter = tf.train.get_or_create_global_step()\n", - " \n", - " def end_episode(self, final_reward): \n", - " \"\"\"At the end of an episode, we compute the loss for the episode and take a \n", - " step in parameter speace in the direction of the gradients.\"\"\"\n", - " \n", - " # Compute the return (cumulative discounted reward) for the episode\n", - " episode_return = sum(self._rewards) + final_reward # Assuming \\gamma = 1\n", - "\n", - " with tf.GradientTape() as tape:\n", - " # Loop over the states and actions making up the episode trajectory\n", - " loss = 0\n", - " for state, action in zip(self._observed_states, self._taken_actions): \n", - " # Get the probabilities assigned to the actions given the state by the policy\n", - " action_distribution = self.policy(state) \n", - " action_index = self._actions.index(action)\n", - " # Get the log probability of the chosen action under the policy\n", - " log_action = tf.log(action_distribution[action_index])\n", - " # Add to the running total for the episode\n", - " loss -= log_action * episode_return # Add your baseline value for TASK 4 here. \n", - "\n", - " # Compute the gradient of the loss with respect to the variables in the model\n", - " grads = tape.gradient(loss, self._policy_network.variables) \n", - " \n", - " # Use the optimizer to apply the gradient\n", - " self._optimizer.apply_gradients(\n", - " zip(grads, self._policy_network.variables), global_step=self._step_counter)\n", - " \n", - " return 'Loss: {}'.format(loss)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "BchVaPCm36TY", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "Notice that during the episode we run only the forward-pass of the policy network (inference). At the end of the episode, we replay the states that occured during the episode and run both the forward and backward pass of the policy network (notice the gradient tape!) because we can only compute the loss once we have the episode return at the end of the episode. If the policy network is very complex, this could be inefficient. In that case you could run both the forward an backward pass during the episode and store intermediate gradients/partial derivatives to use in the update at the end of the episode." - ] - }, - { - "metadata": { - "id": "0jS05xq2PuZp", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "And finally we train our **REINFORCE** agent and plot the resulting rolling episode returns (over a window of 100 episodes)." - ] - }, - { - "metadata": { - "id": "EG4H4PsjPyVD", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "rolling_returns = run_loop(ReinforceAgent, num_episodes=2400, record_every=30)\n", - "plot_rolling_returns(rolling_returns)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "metadata": { - "id": "3yvh_FF0djwu", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "## Your Tasks\n", - "### Task 1: Learning Objectives [ALL]\n", - "Review the learning objectives and ensure that you understand how the code in this practical relates to them. Ask your tutors if you don't understand anything!\n", - "\n", - "### Task 2: Network Architecture [ALL]\n", - "Experiment with different network architectures and parameters and see how this affects the performance of the agent. What do you notice? Do you think the algorithm is sensitive to the network parameters? \n", - "\n", - "**HINT**: Modify the code for the policy network in the ```__init__``` method of the ```FixedAgent``` class. \n", - "\n", - "### Task 3: Seed Variance **[ALL]** \n", - "Reveal and run the code in the cells below. This code will run the entire training procedure of the REINFORCE agent 10 times (using only 1000 episodes per run to save some time). It will then plot a chart that shows the mean of the rolling returns over the multiple runs along with an estimated *confidence interval* for the mean. You should notice that the confidence interval is fairly wide given that only the random seed is changing. This illustrates a problem with the REINFORCE algorithm: it has **high variance**. (It is however an **unbiased estimator** of the policy gradient!)\n", - "\n", - "### Task 4: Variance Reduction with a Basline **[INTERMEDIATE]** \n", - "Read about value functions and how to approximate them using *Monte-Carlo* methods in [Slides 5 to 7 Here](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf). Then read slides [80, 84 and 85 here](http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture14.pdf). The goal of this task is to use a simple value function $V(s)$ to implement **REINFORCE with a Baseline**, where the loss function per episode changes to: \n", - "\n", - "\\begin{align}\n", - "L(\\theta) &= -\\sum_{t=1}^{T_i} log(\\pi_\\theta(a_{i,t} | s_{i, t})) \\sum_{t=1}^{T_i} [r_{i,t} - V(s_{i, t})]\n", - "\\end{align}\n", - "\n", - "To do this, add code to the ```end_episode``` method of the ```ReinforceAgent``` class to estimate the value function. Subtract the value estimate for the state from the episode return at each step in the loop that computes the log_action_sum. \n", - "\n", - "Re-run Task 3's code to plot the confidence interval around the mean episode returns to check what effect it has on the variance. \n", - "\n", - "**HINT**: You will need to *discretise* the state-space. We've provided a very crude function called ```state_to_buckets``` that you can use to do this, or implement your own! \n", - "\n", - "**Further Reading (Optional)**: See the section on how to introduce a baseline [here](https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/) for more details about how and why this works!\n", - "\n", - "**Further Reading (Optional)**: This [blog post](https://flyyufelix.github.io/2017/10/12/dqn-vs-pg.html) contrasts policy gradient methods with an alternative value-based approach to RL called Deep Q-Networks. It also discusses some approaches to reducing the variance of the policy gradient estimator. \n", - "\n", - "### Task 5: Learning from pixels **[OPTIONAL]**\n", - "The agent we implemented in this practical uses a simple numerical representation of the state of the environment. In many cases such a representation would not be available. Change the run-loop to instead pass the array of pixel values to the agent. (which you can get by calling ```environment.getScreenRGB()```). Change the agent's policy network to cater for this image-representation of the state. \n", - "\n", - "\n", - "\n", - "### Task 6: Other games **[OPTIONAL]**\n", - "The PyGame Learning Environment (PLE) has [a number of games built-in](https://pygame-learning-environment.readthedocs.io/en/latest/user/games.html). Change the code in this practical to run on a different game and learn either from pixels or from the state representation provided by PLE. One interesting game you could try is FlappyBird! Remember to remove the reward overrides we set when trying a new game! " - ] - }, - { - "metadata": { - "id": "loTWt0eMoVEA", - "colab_type": "text" - }, - "cell_type": "markdown", - "source": [ - "## Additional Code for Task 3\n", - "This might take some time to run, continue reading for Task 4 and ask any questions you have while waiting!" - ] + "colab_type": "code", + "id": "kcGwYfj6SWl2", + "outputId": "95dc4e2c-25ee-485d-a4d5-9b8e0fd2372f" + }, + "outputs": [], + "source": [ + "print('Current game state:', game.getGameState())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "CTH3IMzp_IUl" + }, + "source": [ + "### What actions are available?\n", + "The following cell prints the actions that are available in the current game, which are represented with numerical codes. The PLE environment wrapper also adds an additional ```None``` action which means \"do nothing\". " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 }, - { - "metadata": { - "id": "TFh26u1eoUad", - "colab_type": "code", - "colab": {} - }, - "cell_type": "code", - "source": [ - "all_rolling_returns = []\n", - "\n", - "for i in range(10):\n", - " rolling_returns = run_loop(ReinforceAgent, num_episodes=1000, record_every=25, seed=np.random.randint(100000))\n", - " all_rolling_returns.append(rolling_returns)\n", - "\n", - "plot_rolling_returns(np.array(all_rolling_returns))" - ], - "execution_count": 0, - "outputs": [] - } - ] -} \ No newline at end of file + "colab_type": "code", + "id": "WawJ92pOCD4D", + "outputId": "7d286730-bcc5-418a-c603-cf45e4c89690" + }, + "outputs": [], + "source": [ + "print('Game actions:', game.actions)\n", + "print('Environment actions:', environment.getActionSet())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "RElMlRA-O-WC" + }, + "outputs": [], + "source": [ + "#@title [RUN ME!] Setup for the next section { display-mode: \"form\" }\n", + "# Maintain some variables for the next task\n", + "environment.reset_game()\n", + "observed_states = []\n", + "observed_actions = []\n", + "observed_rewards = []\n", + "observed_states.append(game.getGameState())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "pwl9qQmlURgR" + }, + "source": [ + "### Exploratory Task\n", + "Run the cell below which defines a function that takes a given action in the environment, using the environment's ```act``` method and then renders both the previous and current game screen. Change the action in the drop-down to the right of the 2nd code cell, which calls this function with the chosen action. Observe what happens to the environment(game) state and reward. By running the cell multiple times (and changing the action) until the game comes to an end (when you either win or lose), you will manually create an **episode**, which is a sequence of states, actions and rewards until a termination condition is reached. If an episode $i$ consists of $T_i$ steps and we denote the state, action and reward at step $t$ in episode $i$ respectively as $s_{i, t}$, $a_{i,t}$ and $r_{i,t}$, then in this task, we create a *trajectory* $\\tau_i = (s_{i, 1}, a_{i, 1}, r_{i, 1}, ..., a_{i, T_i-1}, r_{i, T_i-1}, s_{i, T_i})$\n", + "\n", + "#### Question\n", + "Notice how the paddle sometimes moves even if you take the \"None\" action? Can you think of why this happens? \n", + "\n", + "#### Notes\n", + "* If you want to run another episode, re-run the code cell above titled \"Setup for the next section\" to reset the environment\n", + "* This particular game returns a reward of $0$ at each step and a final reward of $-1$ or $1$ at the end of the episode depending on whether you lose or win. Other games may have different reward structures! " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "cmbraWuttEI0" + }, + "outputs": [], + "source": [ + "previous_frame = environment.getScreenRGB().transpose([1, 0, 2])\n", + "\n", + "def take_action(action, agent=None):\n", + " global previous_frame\n", + " \n", + " # Look up the action code from the description\n", + " action_code = None if action == 'None' else game.actions[action]\n", + "\n", + " # Take the selected action in the environment\n", + " print('Taking action: {} ({})'.format(action, action_code))\n", + " reward = environment.act(action_code)\n", + "\n", + " observed_actions.append(action)\n", + " observed_rewards.append(reward)\n", + "\n", + " # Print and display the current state and reward\n", + " state = game.getGameState()\n", + " print('Game state:', state)\n", + " print('Reward received: ', reward)\n", + "\n", + " observed_states.append(state)\n", + "\n", + " if reward > 0 or environment.game_over():\n", + " print(\"-------------------------------------------------------\")\n", + " print('Game over, you', 'WON' if reward > 0 else 'LOST')\n", + " print(\"-------------------------------------------------------\")\n", + " print('The episode trajectory was:')\n", + " for s, a, r in zip(observed_states, observed_actions, observed_rewards):\n", + " print('State:', s, 'Action:', a, 'Reward:', r)\n", + " print('Terminal state:', observed_states[-1])\n", + " \n", + " current_frame = environment.getScreenRGB().transpose([1, 0, 2])\n", + " \n", + " if agent is None:\n", + " fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))\n", + " ax1.imshow(previous_frame)\n", + " ax1.grid(False)\n", + " ax1.set_title('PREVIOUS FRAME')\n", + " \n", + " ax2.imshow(current_frame)\n", + " ax2.grid(False)\n", + " ax2.set_title('CURRENT FRAME')\n", + " else:\n", + "# fig = plt.figure(figsize=(10, 30))\n", + " \n", + " fig, [ax1, ax2, ax3] = plt.subplots(1, 3, figsize=(15, 5))\n", + " ax1.imshow(previous_frame)\n", + " ax1.grid(False)\n", + " ax1.set_title('PREVIOUS FRAME')\n", + " \n", + " ax2.imshow(current_frame)\n", + " ax2.grid(False)\n", + " ax2.set_title('CURRENT FRAME')\n", + " \n", + " state = game.getGameState()\n", + "\n", + " # Pre-process the state to extract the numerical values we're interested in from the state dictionary\n", + " number_state = np.array([\n", + " state['fruit_x'] / game_width, # Divide by width(or height) to normalise the value to lie between 0 and 1.\n", + " state['player_x'] / game_width,\n", + " state['fruit_y'] / (game_height+1),\n", + " state['player_vel'] / game_width\n", + " ], dtype=np.float32)\n", + "\n", + " # evaluate the policy of the trained agent (see what move the agent would make)\n", + " policy = agent.policy(number_state)\n", + " # plot the policy (this gives us a more visual indication that it is a\n", + " # probability distribution over available actions)\n", + " \n", + " visualize_policy(ax3, policy)\n", + " \n", + " plt.tight_layout()\n", + " plt.show()\n", + " \n", + " previous_frame = current_frame" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "vdo_a8VBU9Sm" + }, + "outputs": [], + "source": [ + "action = \"None\" #@param ['right', 'left', 'None']\n", + "take_action(action)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "ebckEImi-_k0" + }, + "source": [ + "## The Agent\n", + "We now turn to the agent. An agent receives the current state and (previous) reward from the environment, then uses an internal policy to determine an action to take. We implement an agent as a Python [**class**](https://en.wikibooks.org/wiki/A_Beginner%27s_Python_Tutorial/Classes), which is just a logical wrapper of variables and methods (functions) that operate on those variables. The methods our agent will have are the following:\n", + "\n", + "* **Initialisation** (```__init__```): Initialises the agent the first time it's created. \n", + "* **policy**: The policy is a function that returns a *distribution* over the possible actions, given the current state.\n", + "* **step**: This takes as input, the current state and previous reward from the environment, then uses the internal policy to determine which action to take. Specifically, it does some pre-processing of the state, then samples a single action from the distribution over possible actions obtained from the policy. \n", + "* **reset**: This is called to reset the agent's variables before running it on a new episode\n", + "* **end_episode**: This method signals to the agent that the current episode has come to an end. The agent may do some learning, or clean-up at the end of an episode. \n", + "\n", + "### The Random Agent\n", + "\n", + "To get a feel for an agent and the methods it has, we implement an agent that just takes a *random* action at every step. For demonstration purposes, we also calculate the **episode return** at the end of an episode. The episode return is the sum of the (discounted) rewards obtained during the episode. If the returns for episode $i$, with trajectory $\\tau_{i}$ are denoted $r_{i, t}$, and the **discount factor** is $\\gamma$, then the episode return is calcuated as: $r(\\tau_i) = \\sum_{t=1}^{T_i} \\gamma^t r_{i,t}$. The discount factor allows us to increase the importance of rewards received quickly and decrease the importance of rewards that take long to receive. It is especially important in environments that could have episodes that are infinitely long. In our particular environment where every episode is of the same length and the only non-zero reward is received at the end of the game, the discount factor doesn't make much difference and so we will ignore it (effectively set it to $1$) for the remainder of this practical. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "IRPa2lmldQoW" + }, + "outputs": [], + "source": [ + "class RandomAgent(object):\n", + " \n", + " def __init__(self, actions, state_size, seed):\n", + " # When initializing, we let the agent know what actions are available in the \n", + " # environment, how large the state is (not used in the RandomAgent) and the \n", + " # current seed to use (also not used in the RandomAgent)\n", + " self._actions = actions\n", + " self._rewards = []\n", + " self._taken_actions = []\n", + " self._observed_states = []\n", + " \n", + " def policy(self, state): \n", + " # The policy is an internal function that takes a state and returns a distribution over the possible actions. \n", + " # The random agent just returns a uniform distribution over the actions. \n", + " n = len(self._actions) # The number of actions\n", + " return tf.fill([n], 1./n) # This returns a vector of length n, with each entry being 1/n\n", + " \n", + " def step(self, state_dict, reward):\n", + " \n", + " # Pre-process the state to extract the numerical values we're interested in from the state dictionary\n", + " state = np.array([\n", + " state_dict['fruit_x'] / game_width, # Divide by width(or height) to normalise the value to lie between 0 and 1.\n", + " state_dict['player_x'] / game_width,\n", + " state_dict['fruit_y'] / (game_height+1),\n", + " state_dict['player_vel'] / game_width\n", + " ], dtype=np.float32)\n", + " \n", + " self._observed_states.append(state) # Record that the state was observed during the episode\n", + " self._rewards.append(reward) # Keep track of the rewards we've received along the way\n", + " action_distribution = self.policy(state) # Use the policy to get the distribution over actions\n", + " \n", + " # Sample a single action according to the distribution over actions\n", + " action = np.random.choice(self._actions, p=action_distribution.numpy()) \n", + " \n", + " self._taken_actions.append(action) # Record that the action was taken during the episode\n", + " \n", + " return action\n", + " \n", + " def reset(self):\n", + " # This method is called when a new episode starts, we need to clear the \n", + " # states, actions and rewards that we tracked during the last episode.\n", + " self._rewards = [] \n", + " self._taken_actions = [] \n", + " self._observed_states = [] \n", + " \n", + " def end_episode(self, final_reward):\n", + " # We just calculate the episode return\n", + " episode_return = sum(self._rewards) + final_reward\n", + " return episode_return" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "-vF0d4FORBYK" + }, + "source": [ + "## The Run-Loop\n", + "Now that we have an environment and a simple agent, we need a way of controlling the interaction between the agent and environment over multiple episodes. We do so in a **run-loop**. In this simple run-loop, the agent and environment run in lock-step. For each game frame, we get the state from the environment and pass it, along with the previous reward, to the agent. The agent selects an action that it wants to take given the game state. The action is taken in the environment and any reward received is recorded. We run the loop for multiple episodes, each time being careful to reset the game and agent (because they're starting a new game from scratch)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "ZmzYRPUi9ZGi" + }, + "outputs": [], + "source": [ + "def run_loop(agent_class, # Which agent to use\n", + " num_episodes=1, # How many episodes to run for\n", + " record_every=1, # How many episodes to record\n", + " seed=1234, # The random seed used\n", + " rolling_return_frequency=100, # The window size used to track the rolling episode return\n", + " state_size=4,\n", + " return_agent=False\n", + " ): # The size of the state\n", + " \n", + " # Set the random seeds\n", + " tf.set_random_seed(seed)\n", + " np.random.seed(seed)\n", + " \n", + " # Initialise the environment\n", + " environment.init()\n", + " \n", + " # Create an agent (this runs the agent's __init__ method)\n", + " agent = agent_class(environment.getActionSet(), state_size, seed)\n", + " \n", + " progress_out = display.display(progress(0, num_episodes), display_id=True) # Create a progress-bar\n", + " \n", + " # Create data structures to store metrics\n", + " windowed_return = deque()\n", + " rolling_returns = []\n", + " frames = []\n", + " \n", + " for episode in range(num_episodes):\n", + " environment.reset_game() # reset the environment\n", + " agent.reset() # reset the agent\n", + " reward = 0\n", + "\n", + " while reward == 0 and not environment.game_over(): # Loop until the episode terminates \n", + " state = game.getGameState() # Get the current game state\n", + " action = agent.step(state, reward) # Pass the current game state and previous reward to the agent, get the action it wants to take\n", + " reward = environment.act(action) # Pass the action to the environment and get the reward.\n", + "\n", + " if episode % record_every == 0:\n", + " # Store the frames for display later, every `record_every` episodes\n", + " frames.append(environment.getScreenRGB())\n", + " \n", + " info = agent.end_episode(reward) # Signal to the agent that the episode has come to and end\n", + " \n", + " # Store the episode return in the window (in this case, with no discounting, the episode return is the same as the environment's score)\n", + " windowed_return.append(environment.score())\n", + " if len(windowed_return) > rolling_return_frequency:\n", + " windowed_return.popleft()\n", + " \n", + " rolling_return = sum(windowed_return) / len(windowed_return)\n", + " rolling_returns.append(rolling_return)\n", + " \n", + " # Update the progress-bar\n", + " message = 'Episode {}/{} ended with score {}, Rolling Return: {}, {}'.format(\n", + " episode+1, num_episodes, environment.score(), \n", + " rolling_return,\n", + " info if info is not None else '')\n", + " progress_out.update(progress(episode+1, num_episodes, message))\n", + " \n", + " message = 'Finished training, rendering video...'\n", + " progress_out.update(progress(episode+1, num_episodes, message))\n", + " \n", + " # Render a video\n", + " clip = make_animation(frames, fps=30, true_image=True).rotate(-90)\n", + " display.display(clip.ipython_display(fps=30, center=False, autoplay=False, loop=False, height=320, width=240, max_duration=1000))\n", + " \n", + " message = 'Done...'\n", + " progress_out.update(progress(episode+1, num_episodes, message))\n", + " \n", + " if return_agent:\n", + " return rolling_returns, agent\n", + " else:\n", + " return rolling_returns" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "wkiokNVeg167" + }, + "source": [ + "We now run our RandomAgent with the run-loop for 100 episodes to check that everything is working so far. (Note: the blue progress bar shows how many of the ```num_episodes``` episodes we've completed. The small black progress bar is for the video rendering, ignore that one!)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "6-TgMf2bcVJ0" + }, + "outputs": [], + "source": [ + "rolling_returns = run_loop(RandomAgent, num_episodes=100, record_every=5, rolling_return_frequency=5)\n", + "plot_rolling_returns(rolling_returns)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "s_2-dlEqvNLe" + }, + "source": [ + "## A Policy Network\n", + "Remember, the policy is a distribution over the possible actions the agent can take in the environment given the current state of the environment, denoted $\\pi(a|s)$. In a Deep RL agent, the policy is represented by a neural network with parameters $\\theta$, so we have $\\pi_\\theta(a|s) = NN(s; \\theta)$, where $NN(s; \\theta)$ is some potentially complex function represented by a neural network with parameters $\\theta$. In other words, our neural network takes in the state as input and outputs the appropriate distribution over actions. Let us implement an agent who's policy is defined by a simple feed-forward neural network. We will name the class 'FixedAgent' because this agent will do no learning. As a result the policy network's weights will be fixed and the agent will take random actions as before.\n", + "\n", + "The ```reset```, ```step``` and ```end_episode``` methods of our fixed agent will be identical to the RandomAgent we built earlier. We'll only change the ```__init__``` and ```policy``` methods. To avoid having to rewrite all that code, we will use Python's **inheritance** to reuse all the methods in RandomAgent except for policy which we *override* here." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "x5LkKobJhkV3" + }, + "outputs": [], + "source": [ + "# Lets build a fixed agent\n", + "class FixedAgent(RandomAgent): # Inherit all the methods of RandomAgent\n", + " \n", + " def __init__(self, actions, state_size, seed):\n", + " super(FixedAgent, self).__init__(actions, state_size, seed)\n", + " \n", + " # Define the policy network in the initialize method (constructor) because it should persist\n", + " # through multiple usages over multiple episodes of the agent.\n", + " # (We change the default weight initialiser to truncated random normal which \n", + " # works better for the RL algorithm we'll use in this practical.)\n", + " self._policy_network = tf.keras.Sequential([\n", + " # Add a hidden layer with 64 neurons\n", + " tf.keras.layers.Dense(64, input_shape=[state_size], activation=tf.nn.relu, \n", + " kernel_initializer=tf.truncated_normal_initializer(seed=seed)),\n", + " # Add a hidden layer with 32 neurons\n", + " tf.keras.layers.Dense(32, activation=tf.nn.relu, \n", + " kernel_initializer=tf.truncated_normal_initializer(seed=seed)),\n", + " # Add an output layer with action-many neurons and a softmax activation function\n", + " tf.keras.layers.Dense(len(actions), activation='softmax'),\n", + " ])\n", + " \n", + " # Override the policy\n", + " def policy(self, state):\n", + " layer_input = tf.expand_dims(state, axis=0) # Add a dummy batch dimension\n", + " action_distribution = self._policy_network(layer_input) # Get the distribution over actions from the policy network\n", + " action_distribution = tf.squeeze(action_distribution, axis=0) # Remove the dummy batch dimension\n", + " \n", + " return action_distribution" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "_9LtvT6d3xsZ" + }, + "source": [ + "Let's test our FixedAgent, this is just to see that it runs, as we don't expect it to perform any better than the RandomAgent because it isn't learning anything yet! " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "6kAPun6f6nzS" + }, + "outputs": [], + "source": [ + "# Our fixed-weight agent\n", + "rolling_returns = run_loop(FixedAgent, num_episodes=100, record_every=5, rolling_return_frequency=5)\n", + "plot_rolling_returns(rolling_returns)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To illustrate the idea of the policy visually we plot it.\n", + "Remember, the policy is the probability distribution over the actions.\n", + "Thus, for our random agent we expect to see a uniform distribution over the possible actions the agent can take (we expect the probability of the agent taking each action to be the same)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# init a dummy random agent\n", + "random_agent = RandomAgent(environment.getActionSet(), 4, 1234)\n", + "# give it some randomly generated state and evaluate the policy\n", + "policy = random_agent.policy(np.random.normal(loc=0, scale=2, size=(4,)))\n", + "\n", + "fig, ax = plt.subplots(1, 1)\n", + "visualize_policy(ax, policy)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "PPesBe8q3621" + }, + "source": [ + "## Learning with Policy Gradients\n", + "Finally, let's give our agent some intelligence by making it learn from its experience in interacting with the environment. In order to learn, we need a loss function or *objective*. In RL, the objective is to maximise the expected episode return (rewards) by taking actions in the environment. The actions our agent takes are determined by the policy $\\pi_\\theta(a|s)$, which are in turn determined by the neural network parameters $\\theta$. So, we want to find the neural network parameters $\\theta$ that maximise \n", + "\n", + "$J(\\theta) = \\mathbb{E}_{\\tau}[r(\\tau)]$\n", + "\n", + "**Note:** If the maths in the next section looks intimidating, feel free to skip over it, read the intuition and code and come back to it later! \n", + "\n", + "\n", + "### The derivative of the objective\n", + "We now turn to our usual tool of stochastic gradient descent to optimise the objective, but there are two complications. Firstly, the term $\\pi_\\theta(a|s)$ represented by our neural network doesn't appear in the equation (or does it?). Secondly, how do we deal with the expectation?\n", + "\n", + "The first thing is to realise that our trajectories $\\tau$ depend on the policy $\\pi_\\theta(a|s)$ (**Question:** Why?) So we can (informally) write:\n", + "\n", + "\\begin{align}\n", + "J(\\theta) &= \\mathbb{E}_{\\tau \\sim \\pi_\\theta(\\tau)} [r(\\tau)] & \\\\\n", + "&= \\int \\pi_\\theta (\\tau) r(\\tau)d\\tau & (\\text{Definition of expectation}) \\\\\n", + "\\end{align}\n", + "\n", + "Then the gradient is:\n", + "\n", + "\\begin{align}\n", + "\\nabla_\\theta J(\\theta) &= \\nabla_\\theta \\int \\pi_\\theta (\\tau) r(\\tau)d\\tau & \\\\\n", + "&= \\int \\pi_\\theta (\\tau) \\nabla_\\theta log \\pi_\\theta (\\tau) r(\\tau)d\\tau & (\\text{\"Log derivative trick\"}) \\\\\n", + "&= \\mathbb{E}_{\\tau \\sim \\pi_\\theta(\\tau)}[\\nabla_\\theta log \\pi_\\theta (\\tau) r(\\tau)]\n", + "\\end{align}\n", + "\n", + "Finally, since we don't know the true distribution of $\\tau$, we can approximate the expectation using a *monte-carlo* approximation, where the sample trajectories come from $N$ episodes of interaction with the environment. \n", + "\n", + "\\begin{align}\n", + "\\nabla_\\theta J(\\theta) &= \\frac{1}{N} \\sum_{i=1}^N \\nabla_\\theta log \\pi_\\theta (\\tau_i) r(\\tau_i)\n", + "\\end{align}\n", + "\n", + "Expanding this out (and considering that episode $i$ has $T_i$ steps) gives:\n", + "\n", + "\\begin{align}\n", + "\\nabla_\\theta J(\\theta) &= \\frac{1}{N} \\sum_{i=1}^N (\\sum_{t=1}^{T_i} \\nabla_\\theta log(\\pi_\\theta(a_{i,t} | s_{i, t})) \\sum_{t=1}^{T_i} \\gamma^t r_{i,t} )\n", + "\\end{align}\n", + "\n", + "We skipped a few steps in the maths here for brevity (see chapter 13 of [Sutton and Barto](https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view) for all the details if you're interested!). If the maths looks intimidating, don't worry! The important things to realise are the following:\n", + "* We define an objective $J(\\theta)$ that is exactly what we want to do with RL, maximise the expected return.\n", + "* We can run our agent in the environment to generate *trajectories* for multiple episodes\n", + "* When an episode comes to an end and we know the return and trajectory, we can compute a term in (the Monte-carlo approximation to) the objective function. \n", + "* We can use Tensorflow to compute the gradient of an individual term in the monte-carlo approximation and apply it to the parameters of our neural network. To do this, we define the *loss* to minimise as follows (where the sums can be represented by loops and we set $\\gamma = 1$ for simplicity):\n", + "\n", + "\\begin{align}\n", + "L(\\theta) &= -\\sum_{t=1}^{T_i} log(\\pi_\\theta(a_{i,t} | s_{i, t})) \\sum_{t=1}^{T_i} r_{i,t} \n", + "\\end{align}\n", + "\n", + "The name \"policy gradient\" comes from the fact that we're directly taking the gradient of the policy, rather than the alternative, value-based RL, which uses iterative update rules to calculate the expected return assocated with a state. The particular flavour of policy gradient which uses the loss function above, along with the Monte-carlo approximation of the objective is known as the **REINFORCE** algorithm. \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "42i6ZNe1T6ME" + }, + "source": [ + "Finally, we implement the REINFORCE algorithm to optimize the parameters of the neural network. The only things we're changing, compared to our FixedAgent are the ```__init__``` and ```end_episode``` methods, so we use *inheritance* again to automatically \"copy\" all the methods from ```FixedAgent``` and ```RandomAgent```. \n", + "\n", + "**Note:** ```ReinforceAgent``` indirectly inherits the methods from ```RandomAgent``` through ```FixedAgent``` (which directly inherits from ```RandomAgent```). Both ```RandomAgent``` and ```FixedAgent``` have a ```policy``` method, but the one that gets \"copied\" to ```ReinforceAgent``` is the one from ```FixedAgent``` because it appeared later in the chain." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "hQldYOWuu9RO" + }, + "outputs": [], + "source": [ + "class ReinforceAgent(FixedAgent):\n", + " \n", + " # Override the initialization method because this agent also needs an optimizer\n", + " # and a variable to track the step\n", + " def __init__(self, actions, state_size, seed):\n", + " super(ReinforceAgent, self).__init__(actions, state_size, seed)\n", + " self._optimizer = tf.train.RMSPropOptimizer(learning_rate=0.001) \n", + " self._step_counter = tf.train.get_or_create_global_step()\n", + " \n", + " def end_episode(self, final_reward): \n", + " \"\"\"At the end of an episode, we compute the loss for the episode and take a \n", + " step in parameter speace in the direction of the gradients.\"\"\"\n", + " \n", + " # Compute the return (cumulative discounted reward) for the episode\n", + " episode_return = sum(self._rewards) + final_reward # Assuming \\gamma = 1\n", + "\n", + " with tf.GradientTape() as tape:\n", + " # Loop over the states and actions making up the episode trajectory\n", + " loss = 0\n", + " for state, action in zip(self._observed_states, self._taken_actions): \n", + " # Get the probabilities assigned to the actions given the state by the policy\n", + " action_distribution = self.policy(state) \n", + " action_index = self._actions.index(action)\n", + " # Get the log probability of the chosen action under the policy\n", + " log_action = tf.log(action_distribution[action_index])\n", + " # Add to the running total for the episode\n", + " loss -= log_action * episode_return # Add your baseline value for TASK 4 here. \n", + "\n", + " # Compute the gradient of the loss with respect to the variables in the model\n", + " grads = tape.gradient(loss, self._policy_network.variables) \n", + " \n", + " # Use the optimizer to apply the gradient\n", + " self._optimizer.apply_gradients(\n", + " zip(grads, self._policy_network.variables), global_step=self._step_counter)\n", + " \n", + " return 'Loss: {}'.format(loss)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "BchVaPCm36TY" + }, + "source": [ + "Notice that during the episode we run only the forward-pass of the policy network (inference). At the end of the episode, we replay the states that occured during the episode and run both the forward and backward pass of the policy network (notice the gradient tape!) because we can only compute the loss once we have the episode return at the end of the episode. If the policy network is very complex, this could be inefficient. In that case you could run both the forward an backward pass during the episode and store intermediate gradients/partial derivatives to use in the update at the end of the episode." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "0jS05xq2PuZp" + }, + "source": [ + "And finally we train our **REINFORCE** agent and plot the resulting rolling episode returns (over a window of 100 episodes)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "EG4H4PsjPyVD" + }, + "outputs": [], + "source": [ + "rolling_returns, trained_agent = run_loop(ReinforceAgent, num_episodes=2400, record_every=30, return_agent=True)\n", + "plot_rolling_returns(rolling_returns)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we have a trained agent with a decent policy, let's do the exploratory task again and see what actions the agent recommends we take.\n", + "To the right of the current frame we can see the trained agent's policy and we can decide whether we agree with the action it suggests we perform, or not." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#@title [RUN ME!] Setup for the next section { display-mode: \"form\" }\n", + "# Maintain some variables for the next task\n", + "environment.reset_game()\n", + "observed_states = []\n", + "observed_actions = []\n", + "observed_rewards = []\n", + "observed_states.append(game.getGameState())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "action = \"None\" #@param ['right', 'left', 'None']\n", + "\n", + "take_action(action, agent=trained_agent)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "3yvh_FF0djwu" + }, + "source": [ + "## Your Tasks\n", + "### Task 1: Learning Objectives [ALL]\n", + "Review the learning objectives and ensure that you understand how the code in this practical relates to them. Ask your tutors if you don't understand anything!\n", + "\n", + "### Task 2: Network Architecture [ALL]\n", + "Experiment with different network architectures and parameters and see how this affects the performance of the agent. What do you notice? Do you think the algorithm is sensitive to the network parameters? \n", + "\n", + "**HINT**: Modify the code for the policy network in the ```__init__``` method of the ```FixedAgent``` class. \n", + "\n", + "### Task 3: Seed Variance **[ALL]** \n", + "Reveal and run the code in the cells below. This code will run the entire training procedure of the REINFORCE agent 10 times (using only 1000 episodes per run to save some time). It will then plot a chart that shows the mean of the rolling returns over the multiple runs along with an estimated *confidence interval* for the mean. You should notice that the confidence interval is fairly wide given that only the random seed is changing. This illustrates a problem with the REINFORCE algorithm: it has **high variance**. (It is however an **unbiased estimator** of the policy gradient!)\n", + "\n", + "### Task 4: Variance Reduction with a Basline **[INTERMEDIATE]** \n", + "Read about value functions and how to approximate them using *Monte-Carlo* methods in [Slides 5 to 7 Here](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf). Then read slides [80, 84 and 85 here](http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture14.pdf). The goal of this task is to use a simple value function $V(s)$ to implement **REINFORCE with a Baseline**, where the loss function per episode changes to: \n", + "\n", + "\\begin{align}\n", + "L(\\theta) &= -\\sum_{t=1}^{T_i} log(\\pi_\\theta(a_{i,t} | s_{i, t})) \\sum_{t=1}^{T_i} [r_{i,t} - V(s_{i, t})]\n", + "\\end{align}\n", + "\n", + "To do this, add code to the ```end_episode``` method of the ```ReinforceAgent``` class to estimate the value function. Subtract the value estimate for the state from the episode return at each step in the loop that computes the log_action_sum. \n", + "\n", + "Re-run Task 3's code to plot the confidence interval around the mean episode returns to check what effect it has on the variance. \n", + "\n", + "**HINT**: You will need to *discretise* the state-space. We've provided a very crude function called ```state_to_buckets``` that you can use to do this, or implement your own! \n", + "\n", + "**Further Reading (Optional)**: See the section on how to introduce a baseline [here](https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/) for more details about how and why this works!\n", + "\n", + "**Further Reading (Optional)**: This [blog post](https://flyyufelix.github.io/2017/10/12/dqn-vs-pg.html) contrasts policy gradient methods with an alternative value-based approach to RL called Deep Q-Networks. It also discusses some approaches to reducing the variance of the policy gradient estimator. \n", + "\n", + "### Task 5: Learning from pixels **[OPTIONAL]**\n", + "The agent we implemented in this practical uses a simple numerical representation of the state of the environment. In many cases such a representation would not be available. Change the run-loop to instead pass the array of pixel values to the agent. (which you can get by calling ```environment.getScreenRGB()```). Change the agent's policy network to cater for this image-representation of the state. \n", + "\n", + "\n", + "\n", + "### Task 6: Other games **[OPTIONAL]**\n", + "The PyGame Learning Environment (PLE) has [a number of games built-in](https://pygame-learning-environment.readthedocs.io/en/latest/user/games.html). Change the code in this practical to run on a different game and learn either from pixels or from the state representation provided by PLE. One interesting game you could try is FlappyBird! Remember to remove the reward overrides we set when trying a new game! " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "loTWt0eMoVEA" + }, + "source": [ + "## Additional Code for Task 3\n", + "This might take some time to run, continue reading for Task 4 and ask any questions you have while waiting!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "TFh26u1eoUad" + }, + "outputs": [], + "source": [ + "all_rolling_returns = []\n", + "\n", + "for i in range(10):\n", + " rolling_returns = run_loop(ReinforceAgent, num_episodes=1000, record_every=25, seed=np.random.randint(100000))\n", + " all_rolling_returns.append(rolling_returns)\n", + "\n", + "plot_rolling_returns(np.array(all_rolling_returns))" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "collapsed_sections": [], + "name": "Practical 4: Reinforcement Learning", + "provenance": [], + "version": "0.3.2" + }, + "kernelspec": { + "display_name": "Python [conda env:indaba]", + "language": "python", + "name": "conda-env-indaba-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.15" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +}