Skip to content

Kilka74/BoutiqueLM_HW

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This Repository contains a code to a model to train it on dataset like the TinyStories paper. I implement pre-norm Transformer decoder with only Masked Self-Attention, like in original paper, add Rotary Embeddings and RMSNorm. Also I set gradient accumulation equal to 4. Number of tokens the model was trained on was set to 5 billions. For generating stories I've implemented nucleus sampling (see train.py)

I have used following hyperparameters:

batch size 512
embed dim 512
num heads 8
num layers 8
sequence length 256
tokenization BPE
vocab size 5000
AdamW beta1 0.9
AdamW beta2 0.95
number of params of the final model 30285824

My Report and wandb logs are available here (in Russian)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published