Welcome to RL playground!

About RL Playground

This playground is to record some result and tips from my experiment.

The name of Reinforcemnet Learning is quite comfusing. For me, I see "learning" as infrastructure, which not necessarily contains neural network.

Deep "learning" is simple one.
Reinforcemnet Learning is complex one which combines HMM.
Self-supervised learning is using multi-task trick to enhance model ability.

So by this insight, my implementation contains several modules:

Main: the "Break_out_TD_A2C". The most important implementation. It defines how the model been trained, the training method is exactly the "algorithm of Reinforcement Learning".
Model: the "RL_model". Neural network model implementation. It could be as easy as several CNN layers. Or contains complex implementation of curious-driven unit.
Module: like "octave_module". Some handy implementations are here.

Implementation Detail

The hardest part is how to compute the reward correctly. The design must rightly deal with reward, timing of training, end of game. The second one is the policy gradient loss. It is very different from the loss we use every day.

Here's explain of my implementation by image:

For now, I try curiosity model to accelerate the training process:

####But#### ,Even I implement above all correctly, still I can't promise it will work perfectly for every case. For me, there's several tips to deal with conditions that model is broken.

If all probability of action stuck for every frame in the end: decrease the learning rate.
How to know that all hyperparameter are been set properly: observe the probabilities change frame by frame.
How to know that it is converging: the episode reward should go up. But the process is extremely slow. I take a month to train. So if it seems not converge for days, be patient.
The losses of actor and critic explain few.
The advantage should be sometimes positive, sometimes negative.
All elements in loss function should be given a proper coefficient. Look carefully how loss change, and adjust them. For my experience, try and error is the only way.

By all the tips and correct implementation, I take a month to train. It performs not particular good. But it indeed learned how to catch the ball.

Experiment Result

Here's the result of A2C model:

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
A2C		A2C
common		common
multitask		multitask
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to RL playground!

About RL Playground

Implementation Detail

Experiment Result

About

Releases

Packages

Languages

FinnWeng/rl_playground

Folders and files

Latest commit

History

Repository files navigation

Welcome to RL playground!

About RL Playground

Implementation Detail

Experiment Result

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages