Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test error does not decrease #3

Open
jiwoongim opened this issue Sep 9, 2020 · 3 comments
Open

Test error does not decrease #3

jiwoongim opened this issue Sep 9, 2020 · 3 comments

Comments

@jiwoongim
Copy link

jiwoongim commented Sep 9, 2020

Hi Peter,

I set the offline_test_mode='cold_test' and n_test_steps=1000. Then, I ran the demo_uoro_abnb.py.
I notice that the training error (online recent error) decreases over time. However, the test error does not decrease. It is fixed at Test: 1.73.

INFO:artemis:Saving Result for Experiment "2020.09.08T21.02.19.597706-demo_anbn_prediction.insane.uoro"
    Progress: 61%.  2812.4s Elapsed, 1766.4s Remaining, 4578.9s Total. . Iteration 614220 of 1000000. Online recent Error: 0.2555997261282156614221 calls averaging 2.2e+02 calls/s
    Progress: 65%.  3012.4s Elapsed, 1555.9s Remaining, 4568.3s Total. . Iteration 659424 of 1000000. Online recent Error: 0.2405024844314412659425 calls averaging 2.2e+02 calls/s
    Iteration 666666 of 1000000: Test: 1.73
    Yielding Result at 696755 iterations.
INFO:artemis:Saving Result for Experiment "2020.09.08T21.02.19.597706-demo_anbn_prediction.insane.uoro"
    Progress: 70%.  3212.4s Elapsed, 1367.4s Remaining, 4579.8s Total. . Iteration 701436 of 1000000. Online recent Error: 0.2685162282956262701437 calls averaging 2.2e+02 calls/s
    Progress: 74%.  3412.4s Elapsed, 1157.7s Remaining, 4570.2s Total. . Iteration 746680 of 1000000. Online recent Error: 0.24070033847752287746681 calls averaging 2.2e+02 calls/s
    Iteration 777777 of 1000000: Test: 1.73
    Progress: 78%.  3612.4s Elapsed, 964.8s Remaining, 4577.2s Total. . Iteration 789226 of 1000000. Online recent Error: 0.24179974138683633789227 calls averaging 2.2e+02 calls/s
    Yielding Result at 789656 iterations.
INFO:artemis:Saving Result for Experiment "2020.09.08T21.02.19.597706-demo_anbn_prediction.insane.uoro"
    Progress: 82%.  3812.5s Elapsed, 785.3s Remaining, 4597.8s Total. . Iteration 829196 of 1000000. Online recent Error: 0.24460602741139306829197 calls averaging 2.2e+02 calls/s
    Progress: 87%.  4012.5s Elapsed, 587.8s Remaining, 4600.3s Total. . Iteration 872223 of 1000000. Online recent Error: 0.2729340723322489872224 calls averaging 2.2e+02 calls/s
    Iteration 888888 of 1000000: Test: 1.73
    Yielding Result at 894169 iterations.

Have you noticed the same?
Does it suppose to work like this?

Thanks

@petered
Copy link
Owner

petered commented Sep 9, 2020

I looked into it and I belive the culprit was the line model.set_state(initial_state) in training.py. For some reason it was setting the network back to the initial state before running each offline test. It looks like I'd not used the offline-testing part of the code in a while and had let it rot. By commenting out that line (and making the other changes enabling offline testing), I get a falling offline test error. See branch here: #4

INFO:artemis:========== Running Experiment: demo_anbn_prediction.insane.uoro ==========
    /home/peter.oconnor/projects/uoro-demo/uoro_demo/torch_utils/variable_workshop.py:91: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
      merged_var = torch.tensor(merged_var.data, requires_grad=requires_grad)
    Progress: 0%.  0.0s Elapsed, nans Remaining, nans Total.  Iteration 0 of 10000000. Online recent Error: 1.646039366722107  1 calls averaging 2.7e+02 calls/s
    Iteration 1000 of 10000000: Test: 0.594
    Yielding Result at 1000 iterations.
INFO:artemis:Saving Result for Experiment "2020.09.09T06.45.17.666388-demo_anbn_prediction.insane.uoro"
    Iteration 2200 of 10000000: Test: 0.475
    Yielding Result at 2200 iterations.
INFO:artemis:Saving Result for Experiment "2020.09.09T06.45.17.666388-demo_anbn_prediction.insane.uoro"
    Progress: 0%.  5.0s Elapsed, 20651.8s Remaining, 20656.9s Total.  Iteration 2423 of 10000000. Online recent Error: 0.3272710291003789  2424 calls averaging 4.8e+02 calls/s
    Iteration 3631 of 10000000: Test: 0.433
    Yielding Result at 3631 iterations.

@petered
Copy link
Owner

petered commented Sep 9, 2020

Hmm or actually maybe that line should be there. The "state" is just the recurrent state, not the parameters.

The real problem might be that the network never learns to get out of a "zero" initial state (because it only does so once in training - at the very beginning).

@petered
Copy link
Owner

petered commented Sep 9, 2020

Well, no time to look into it now, but yeah I think it's something to do with the way the state of the model is reset before the offline test. The full model state has two parts - (1) the weights and biases, and (2) the "online" recurrent activations. Only part (2) should be reset before the offline test (and set back again after).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants