-
Notifications
You must be signed in to change notification settings - Fork 19
logs7:Test RL with small model
Higepon Taro Minowa edited this page Apr 21, 2018
·
3 revisions
- Log 1: what specific output am I working on right now?
- I try to verify the RL framework is working.
- If avg_len goes up and it's reproducible, it's working
- Log 2: thinking out loud - e.g. hypotheses about the current problem, what to work on next
- Run and observe graph
- The loss begin with 20, oops forgot to copy the learned model.
- Here are the results from small data RL initialized from normal seq2seq.
- Observations
- valid_loss seems okay, it started from almost zero and goes down.
- reward and reply_len don't change
- for each input, model is returning exact same reward.
- Things to do
- Check if we can reproduce -> Yes we could.
- Observations
- the reward doesn't change.
- what does it mean?
- replies from seq2seq are same length? let's confirm -> YES
- the reward doesn't change.
replies[[ 6 7 5 20 1 27 28 29 4 1]
[23 24 25 4 26 27 28 29 4 1]
[30 31 9 32 33 5 4 1 1 1]
[ 6 7 5 20 1 27 28 29 4 1]
[23 24 25 4 26 27 28 29 4 1]
[30 31 9 32 33 5 4 1 1 1]]
length = [5, 10, 8, 5, 10, 8]
reward= 0.19166666666666668
-
My hypothesis is the seq2seq model is already converged and it's too late to move it from local optimum.
- How to confirm. Stop the seq2seq traing at 120 steps then compare with the previous result.
- previous result: avg len 19.0, validation loss 0.0037353924, reward= 0.19166666666666668
- result: avg len 18.5, validation loss 0.016834794 reward= 0.19166666666666668
- The results tell almost nothing.
- We don't surely know if RL worked because it eventually converges to loss=0. (Model is too big).
- How to confirm. Stop the seq2seq traing at 120 steps then compare with the previous result.
-
Conclusion: Just testing small model and data didn't work. We should explore the entropy method described in the blog.
-
Log 3: record of currently ongoing runs along with a short reminder of what question each run is supposed to answer
-
Log 4: results of runs (TensorBoard graphs, any other significant observations), separated by type of run (e.g. by the environment the agent is being trained in)