Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-training using GPUs is strange #21

Open
tomohideshibata opened this issue Dec 12, 2019 · 4 comments
Open

Pre-training using GPUs is strange #21

tomohideshibata opened this issue Dec 12, 2019 · 4 comments

Comments

@tomohideshibata
Copy link

I am trying pre-training from scratch using GPUs in Japanese, but the pre-training seems strange.
In the following log, masked_lm_accuracy and sentence_order_accuracy suddenly dropped.

..
I1211 00:37:45.981178 139995264022336 model_training_utils.py:346] Train Step: 45595/273570  / loss = 0.8961147665977478  masked_lm_accuracy = 0.397345  lm_example_loss = 2.636538  sentence_order_accuracy = 0.772450  sentence_order_mean_loss = 0.425534
I1211 14:28:47.512063 139995264022336 model_training_utils.py:346] Train Step: 91190/273570  / loss = 0.7142021656036377  masked_lm_accuracy = 0.454914  lm_example_loss = 2.074183  sentence_order_accuracy = 0.810986  sentence_order_mean_loss = 0.372746
I1212 04:19:05.215945 139995264022336 model_training_utils.py:346] Train Step: 136785/273570  / loss = 1.9355322122573853  masked_lm_accuracy = 0.062883  lm_example_loss = 5.900585  sentence_order_accuracy = 0.572066  sentence_order_mean_loss = 0.668080
..

Has someone succeeded in pre-training from scratch?

@steindor
Copy link

steindor commented Jan 5, 2020

Are you pre-training from scratch or initializing from the .h5 file?

I've been pre-training with init from the h5 file and the loss appears to be unchanged between epochs

Epoch 1:
Train Step: 32256/32256,
loss = 6.186943054199219  
masked_lm_accuracy = 0.117702  
lm_example_loss = 5.309477  
sentence_order_accuracy = 0.550855  
sentence_order_mean_loss = 0.689294

Epoch 2:
Train Step: 64512/64512  
loss = 6.207996845245361  
masked_lm_accuracy = 0.114809  
lm_example_loss = 5.329027  
sentence_order_accuracy = 0.546185  
sentence_order_mean_loss = 0.689931

Going to try from scratch to see if it makes a difference.

@tomohideshibata
Copy link
Author

From scratch.

@steindor
Copy link

I trained from scratch and no difference. I reduced the dataset size to only 10.000 sentences to make it easier to debug and perhaps make the model overfit the data but the loss doesn't change from epoch to epoch. So, still not able to pre-train from scratch but it appears we aren't dealing with the same problem. Would be good to know if anyone succeeded in pre-training from scratch.

@tomohideshibata
Copy link
Author

I think most of the codes were from the following google official tf2.0 code.
https://github.com/tensorflow/models/tree/master/official/nlp/bert

I also tried the pre-training using the above repository, but it failed.

I posted an issue in the official google repository.
tensorflow/models#7903 (comment)
The parameters such as predictions/output_bias and seq_relationship/output_weights are not saved in a checkpoint. I am not sure the pre-training failure arises from this point, but there may be a problem in the pre-training code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants