You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training large language models, we generally adopt the gradient checkpointing technique. Could you please help me turn on this technique in your code?
Thanks a lot!
The text was updated successfully, but these errors were encountered:
I test the modified code on RTX3090 w/ config tinyllama_opt.json, batch size 16 w/ gradient checkpointing. The train loss of the first 20 steps is consistent with that w/o gradient checkpointing, batch size 4 and gradient accumulation 4.
I hope it could help. If it works, I'd appreciate it if you could add a simple PR so that more people could benefit from the gradient checkpointing feature.
Congratulations on the excellent work!
When training large language models, we generally adopt the gradient checkpointing technique. Could you please help me turn on this technique in your code?
Thanks a lot!
The text was updated successfully, but these errors were encountered: