logs13:Understand Policy Gradient

Jump to bottom

Higepon Taro Minowa edited this page May 11, 2018 · 14 revisions

Understand Policy Gradient

1: What specific output am I working on right now?

What is policy gradient?
We multiply loss value by reward, but what does it actually mean?
How should we initialize model based on policy
Write blog about Policy Gradient blog link

TO READ in this order

DONE Deep Reinforcement Learning: Pong from Pixels
1. Teaching
  1. DONE RL Course by David Silver - Lecture 1: Introduction to Reinforcement Learning - YouTube `. __DONE Lecture 2: Markov Decision Process
2. John Schulman 2: Deep Reinforcement Learning - YouTube
DONE Policy gradients for reinforcement learning in TensorFlow (OpenAI gym CartPole environment) SHOULD REVISIT
Simple Reinforcement Learning with Tensorflow: Part 2 - Policy-based Agents
https://github.com/williamFalcon/DeepRLHacks
http://joschu.net/docs/nuts-and-bolts.pdf
https://www.alexirpan.com/2018/02/14/rl-hard.html
https://blog.openai.com/deep-reinforcement-learning-from-human-preferences/

Questions to Answer

should reward be positive and negative?
should reward be normalized?
we are summing up reward and multiply but it doesn't make sense?
- each action(=each reply) should get reward?
Cross-entropy method - Wikipedia Andrej said this is the first one we should do.
Make matching charts
- Pong case: batch, reward, policy
- seq2seq case: batch, reward, policy

2: Thinking out loud - e.g. hypotheses about the current problem, what to work on next, how can I verify

3: A record of currently ongoing runs along with a short reminder of what question each run is supposed to answer

run1: title

4: Results of runs (TensorBoard graphs, any other significant observations), separated by type of run (e.g. by the environment the agent is being trained in)