This project was associated with "[MAT-DSAM3A] Advanced Data Assimilation and Modeling A: Reinforcement Learning", Summer Semester 2021, for my Masters of Science: Data Science, University of Potsdam, Germany.
The folder, "Smart Cab RL" contains 4 subfolder: "Smart Cab Q-learning", "Smart Cab Q-Learning REWARD=0", "Smart Cab SARSA", "Smart Cab SARSA REWARD=0". Each of these folder contains the file for running or training the smart cab using Q-learning and SARSA algorithm with different reward function.
To run/train the smart cab:
- Open command prompt, cmd.
- Change the directory to whichever smart cab you want to run/train.
- type: "python RL.py", and hit enter.
- It will the ask for your input: "Enter 'TRAIN' to train or 'RUN' to run the game:"
- type 'RUN' or 'TRAIN' and it will do the required task.
• To study and compare Q Learning and SARSA algorithm.
• Model free and Model based RL algorithm.
• Approach Temporal difference learning learns how to predict a quantity that depends on future values of a
given signal learns from experience.
• Temporal difference update step:
𝑁𝑒𝑤𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 ← 𝑂𝑙𝑑𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 + 𝑆𝑡𝑒𝑝𝑠𝑖𝑧𝑒[𝑇𝑎𝑟𝑔𝑒𝑡 − 𝑜𝑙𝑑𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒]
• Inspired from OpenAI gym environment.
• 2D grid 5x5 cells.
• Agent Cab
• Drop off and pick up locations.
• Objective of the game:
- pick up the passenger.
- drop off the passenger at the right location.
- take as minimum time as possible.
• Coordinate System: See Screenshot.
• Pickup positions: [0, 0], [0, 4], [4, 0] and [4, 3].
• Dropoff positions: [0, 0], [0, 4], [4, 0] and [4, 3].
• Rules:
- Drop off location should not be equal to Pickup location in one episode.
- Cab cannot go through the walls.
- Cab can move “UP”, “DOWN”, “LEFT”, and “RIGHT. No Diagonal
- Cannot go beyond the extreme rows and columns.
Total number of States = 52 + 336 = 388
• On policy learning.
• Learning rate, 𝛼=0.1
• Discount factor, 𝛾=1
• 𝜀 greedy algorithm, 𝜀=0.4(Slightly more chances for exploitation than exploration).
• Balances exploitation and exploration.
• Tries to go to each states.
• Trained for 500,000 episodes.
• Total average cumulative reward, 25.12 , with reward = 0 for picking up from the right location.
• Total average cumulative reward, 5.72 , with reward = 30 for picking up from the right location.
• Off policy learning.
• Learning rate, 𝛼=0.1.
• Discount factor, 𝛾=1.
• Trained for 500,000 episodes.
• Total average cumulative reward 10.92 , with reward = 0 for picking up from the right location.
• Total average cumulative reward 0.9812 , with reward = 30 for picking up from the right location.