Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about training stability #14

Open
meme-virus opened this issue Oct 25, 2023 · 3 comments
Open

Question about training stability #14

meme-virus opened this issue Oct 25, 2023 · 3 comments

Comments

@meme-virus
Copy link

Thanks for your work!
I'm now reproducing your paper, but I'm having some difficulties. When I was training language modeling tasks using the default parameters in readme, I encountered unstable training. The details are as follows:
0/200 [train] loss=5.945 [val] loss=5.917, pp=371.43, acc=0.185491 [time per itr] 1403.69ms [lr] 0.00003
0/400 [train] loss=5.655 [val] loss=5.477, pp=239.12, acc=0.196609 [time per itr] 1223.88ms [lr] 0.00005
0/600 [train] loss=5.285 [val] loss=5.259, pp=192.26, acc=0.200577 [time per itr] 1213.27ms [lr] 0.00010
0/800 [train] loss=5.326 [val] loss=5.250, pp=190.52, acc=0.197866 [time per itr] 1206.03ms [lr] 0.00015
0/1000 [train] loss=4.970 [val] loss=5.168, pp=175.63, acc=0.202474 [time per itr] 1197.92ms [lr] 0.00022
0/1200 [train] loss=5.088 [val] loss=5.093, pp=162.88, acc=0.206467 [time per itr] 1198.33ms [lr] 0.00031
0/1400 [train] loss=nan [val] loss=nan, pp=nan, acc=0.000956 [time per itr] 1183.48ms [lr] 0.00041
0/1600 [train] loss=nan [val] loss=nan, pp=nan, acc=0.001068 [time per itr] 1086.21ms [lr] 0.00052
0/1800 [train] loss=nan [val] loss=nan, pp=nan, acc=0.000971 [time per itr] 1090.91ms [lr] 0.00063
0/2000 [train] loss=nan [val] loss=nan, pp=nan, acc=0.001470 [time per itr] 1092.99ms [lr] 0.00075
0/2200 [train] loss=nan [val] loss=nan, pp=nan, acc=0.001216 [time per itr] 1090.09ms [lr] 0.00088
0/2400 [train] loss=nan [val] loss=nan, pp=nan, acc=0.001114 [time per itr] 1089.34ms [lr] 0.00101
0/2600 [train] loss=nan [val] loss=nan, pp=nan, acc=0.001083 [time per itr] 1090.32ms [lr] 0.00114
0/2800 [train] loss=nan [val] loss=nan, pp=nan, acc=0.001165 [time per itr] 1089.56ms [lr] 0.00127
0/3000 [train] loss=nan [val] loss=nan, pp=nan, acc=0.001149 [time per itr] 1085.54ms [lr] 0.00139
0/3200 [train] loss=nan [val] loss=nan, pp=nan, acc=0.001246 [time per itr] 1087.70ms [lr] 0.00151
You can see that the learning rate keeps climbing and eventually stays at 0.0200. Because of the company firewall, I couldn't access the Internet while running code, so instead of using the tiktoken library for tokenizer, I used the GPT2Tokenizer provided by the transformers library. (I downloaded the vocab.json and merges.txt files to the local host and uploaded them to the server.) The modified code(in /root/xxx/landmark-attention/lm_benchmark/data/pg19/prepare.py) is as follows:
”from transformers import GPT2Tokenizer

vocab_file_path = "/root/xxx/landmark-attention/lm_benchmark/data/pg19/vocab.json"
merges_file_path = "/root/xxx/landmark-attention/lm_benchmark/data/pg19/merges.txt"
gpt2_tokenizer = GPT2Tokenizer(
vocab_file=vocab_file_path,
merges_file=merges_file_path,
)
def _read_directory(path):
...(keep the same)
with open(os.path.join(path, filename), 'r') as f:
texts.extend(gpt2_tokenizer.encode(f.read()))
texts.append(gpt2_tokenizer.eos_token_id)

I want to know if this is related to training instability, because I have not made any other changes other than that, thank you very much for your reply!

@PennyPaetow
Copy link

The same problem! Do you have any solution?

@mkrima
Copy link
Collaborator

mkrima commented Mar 17, 2024

if you are using float16 for training, consider using bfloat16 instead. we experienced some instabilities with float16.

@PennyPaetow
Copy link

Thank you very much for your reply! I'm using float32 for training. The problem was about torch.compile, the training is stable without using torch.compile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants