Skip to content

Building and training a Furuta Pendulum Robot with an SAC/PPO reinforcement learning implementation

License

Notifications You must be signed in to change notification settings

energy-in-joles/Inverted-Pendulum-Robot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Inverted-Pendulum-Robot (Furuta Pendulum)

Introduction

A Furuta Pendulum is a control problem which involves a driven arm (free-rotating in the vertical plane) attached to a horizontal rotating plane. The goal of the robot is to balance the arm in the vertical position. This is a more advanced adaptation of the classic cartpole problem, where it includes swing-up (swinging the arm to the vertical position in the first place).

Hardware Implementation

The following parts were procured for the robot assembly:

  1. Nema 17 Stepper Motor 2.8V
  2. 600 PPR Photoeletric Incremental Rotary Encoder
  3. Motor Driver Board
  4. 12V 2A Power Supply Plug Charger
  5. DRV8825 Stepper Motor Driver
  6. Arduino Nano

The pendulum and encoder housing were designed in Fusion 360 and produced with FDM printing. The .step and .f3d files can be found in /robot/CAD.

The pendulum was attached to the encoder with a rigid shaft coupler. The encoder housing was attached to the stepper motor with a modified M6 bolt in a T-nut fitted in another rigid shafter coupler.

The Arduino communicates with the Python script with bi-directional serial communication, where it:

  1. Reads the acceleration input for the motor from the PC and implements it.
  2. Writes the motor position and pendulum position data for generating the observation space for the model.
  3. Implements reset functions for resetting the motor position and encoder value.

Pendulum Env

Follows the standard Gymnasium class format. Defined below is the observation space, action space, and reward function used for the experiment:

  • $\theta$: angular position of pendulum. 0 at the top and normalised between $[-\pi,\pi]$.
  • $\dot{\theta}$: angular velocity of pendulum. Experimentally bounded between $[-10,10]$, then normalised to between $[-2, 2]$.
  • $\alpha$: motor position (measured in steps instead of angle). Steps range physically limited to 90° left and right or $[-200, 200]$ step range. However, observation space spans further between $[-300, 300]$ to account for the motor exceeding the limit slightly. The range is then normalised between $[-3,3]$.
  • $\dot{\alpha}$: motor velocity (steps per second). Experimentally bounded between $[-4, 4]$, then normalised to between $[-1, 1]$
  • $\ddot{\alpha}$: motor acceleration (steps per second squared). Control input into the system. Bounded between $[-20000,20000]$ and normalised between $[-2, 2]$.

Note that all values are continuous.

Observation Space

$\left[\cos{\theta}, \sin{\theta}, \dot{\theta}, \alpha, \dot{\alpha}\right]$

ndarray of size (5,) containing 5 continuous observation values. Using $\cos$ and $\sin$ values of $\theta$ experimentally showed better convergence rates than just $\theta$.

Action Space

$\left[\ddot{\alpha}\right]$

ndarray of size (1,) containing motor acceleration value. Continuous action space.

Reward Function

$\gamma-\left(\theta^2+C_1\times\dot{\theta}^2+C_2\times\alpha^2+C_3\times\dot{\alpha}^2+C_4\times\ddot{\alpha}^2\right)$

$\gamma$: reward offset value (offset) to ensure that reward is always positive.

If reward was $-\inf$ to $0$, episodes with early termination would generate a higher reward (because they would accumulate smaller negative reward), resulting in a faulty reward system. Hence, the offset aims to prevent this issue. Constants $C$ are used to adjust the reward function weightage and are defined in the /conf/config.yaml.

Termination

The environment terminates when the stepper motor exceeds the allowed range of motion (more than 90° left or right from the zero position).

Truncation

The environment is truncated after 500 timesteps, but this can be adjusted in /conf/mode/train.yaml.

Usage

Configuration

The script provides a high level of flexibility for training and evaluating the model. Configurations are stored in the conf folder, organised in the following manner:

  1. config.yaml
    • model selection (PPO or SAC)
    • mode selection (train or eval)
    • serial communication configuration between PC and arduino
    • action and observation space configuration
    • reward function weights
    • toggle logging with tensorboard
  2. /mode/train.yaml:
    • new model file configuration
    • device config (cuda or cpu)
    • training total timesteps and episode length
  3. /mode/eval.yaml:
    • device config (cuda or cpu)
    • episode length set to -1 for infinite episode length
  4. /model (PPO.yaml and SAC.yaml) config files:
    • model weights

Install and Run

  1. To run the code, first build dependencies in the root directory with: pip install -e .

  2. Upload the Arduino code in /robot/arduino/main/main.ino to the Arduino board.

  3. Connect the robot to the PC and run the script, remembering to set the mode to train: python src/main.py

  4. The trained model will then be stored in /model, where it can be trained further or evaluated.

Credits

We largely referred to the following resources for guidance:

  1. Armandpl's video and repository on building a similar project as us.
  2. Inspiration for this project was taken from Quanser Qube design. Reward function adapted from Quanser's code.
  3. Stable Baseline's guide on custom environment creation.
  4. Farama's guide on handling termination vs truncated scenarios when designing our environment.

About

Building and training a Furuta Pendulum Robot with an SAC/PPO reinforcement learning implementation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published