GitHub

This Repository contains a code to a model to train it on dataset like the TinyStories paper. I implement pre-norm Transformer decoder with only Masked Self-Attention, like in original paper, add Rotary Embeddings and RMSNorm. Also I set gradient accumulation equal to 4. Number of tokens the model was trained on was set to 5 billions. For generating stories I've implemented nucleus sampling (see train.py)

I have used following hyperparameters:

batch size	512
embed dim	512
num heads	8
num layers	8
sequence length	256
tokenization	BPE
vocab size	5000
AdamW beta1	0.9
AdamW beta2	0.95
number of params of the final model	30285824

My Report and wandb logs are available here (in Russian)

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md
bhw1.ipynb		bhw1.ipynb
dataset.py		dataset.py
model.py		model.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

Kilka74/BoutiqueLM_HW

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages