-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
llama2 & mamba training configs (#113)
* Llama 2 example scripts * Mamba example scripts * Added all training config sizes Added an example Llama2 base, Mamba base and all sizes in their respective folders. * Moving and reformating configs * Llama 2 example scripts * Mamba example scripts * Cleaning extra files * remove src/llama2c * moved/renamed configs * bos/eos token ids * updated configs following the meeting * static stuff * updated base configs and simplified config structure * grad_acc_steps & batch_size fix * add @beartype to config classes that don't have it yet * don't ignore incorrect config keys * gradient_accumulation_steps fix * fix broken test config * Updating test configs to work with recent changes * config testing and fixes * re-adding accidentally deleted test config * fix tests that broke after recent changes * remove minibatch divisibility requirement * estimate_loss returns float, not tensor * log train/validation dataset size when training starts * fix incorrect train split default * Updating log spaced checkpoints and checkpointing intervals * update llama2 configs * stories cfgs: checkpoints, evals, bos & eos --------- Co-authored-by: Jett <[email protected]> Co-authored-by: JaiDhyani <[email protected]> Co-authored-by: Jannik Brinkmann <[email protected]>
- Loading branch information
1 parent
ad2936f
commit 262972b
Showing
69 changed files
with
531 additions
and
516 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 12, | ||
"intermediate_size": 48, | ||
"num_attention_heads": 2, | ||
"num_hidden_layers": 1, | ||
"num_key_value_heads": 1 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 332, | ||
"intermediate_size": 896, | ||
"num_attention_heads": 12, | ||
"num_hidden_layers": 6, | ||
"num_key_value_heads": 6 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 84, | ||
"intermediate_size": 256, | ||
"num_attention_heads": 8, | ||
"num_hidden_layers": 4, | ||
"num_key_value_heads": 4 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 168, | ||
"intermediate_size": 384, | ||
"num_attention_heads": 8, | ||
"num_hidden_layers": 4, | ||
"num_key_value_heads": 4 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 28, | ||
"intermediate_size": 96, | ||
"num_attention_heads": 4, | ||
"num_hidden_layers": 2, | ||
"num_key_value_heads": 2 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 484, | ||
"intermediate_size": 1332, | ||
"num_attention_heads": 16, | ||
"num_hidden_layers": 8, | ||
"num_key_value_heads": 8 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 52, | ||
"intermediate_size": 184, | ||
"num_attention_heads": 4, | ||
"num_hidden_layers": 2, | ||
"num_key_value_heads": 2 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 6, | ||
"intermediate_size": 24, | ||
"num_attention_heads": 2, | ||
"num_hidden_layers": 1, | ||
"num_key_value_heads": 1 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 708, | ||
"intermediate_size": 1896, | ||
"num_attention_heads": 16, | ||
"num_hidden_layers": 8, | ||
"num_key_value_heads": 8 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 232, | ||
"intermediate_size": 512, | ||
"num_attention_heads": 12, | ||
"num_hidden_layers": 6, | ||
"num_key_value_heads": 6 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
not using padding, so pad_token_id not set | ||
use_cache - using default | ||
pretraining_tp - experimental parallelization we're not using, which is the default | ||
tie_word_embeddings - llama2 used False and this is better for interpretability, note that llama2.c is using True by default, which is probably more efficient use of parameters for very small models | ||
rope settings are widely used defaults | ||
attention_bias - no biases on QKV and output projection is the default and that's what we're using | ||
attention_dropout - this is the only dropout llama2 can use, it's set to prob=0 by default and that's what we're using |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
{ | ||
"model_config": { | ||
"model_class": "LlamaForCausalLM", | ||
"vocab_size": 4096, | ||
"hidden_act": "silu", | ||
"max_position_embeddings": 512, | ||
"initializer_range": 0.02, | ||
"rms_norm_eps": 1e-06, | ||
"bos_token_id": 0, | ||
"eos_token_id": 1, | ||
"tie_word_embeddings": false, | ||
"rope_theta": 10000.0, | ||
"rope_scaling": null, | ||
"attention_bias": false, | ||
"attention_dropout": 0.0 | ||
}, | ||
"max_seq_len": 512, | ||
"device": "auto", | ||
"checkpoint_interval": 400, | ||
"extra_checkpoint_iters": [ | ||
1, | ||
2, | ||
4, | ||
8, | ||
16, | ||
32, | ||
64, | ||
128, | ||
256, | ||
512 | ||
], | ||
"log_interval": 40, | ||
"eval_iters": 10, | ||
"batch_size": 256, | ||
"max_epochs": 10, | ||
"grad_clip": 1.0, | ||
"gradient_accumulation_steps": 1, | ||
"adam": { | ||
"learning_rate": 0.0005, | ||
"weight_decay": 0.1, | ||
"beta1": 0.9, | ||
"beta2": 0.95, | ||
"decay_lr": true, | ||
"warmup_iters": 1000, | ||
"min_lr": 0.0 | ||
}, | ||
"batch_ordering_seed": 1337, | ||
"torch_seed": 42, | ||
"dataset": { | ||
"name": "delphi-suite/stories-tokenized" | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 24, | ||
"num_hidden_layers": 2 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 400, | ||
"num_hidden_layers": 8 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 112, | ||
"num_hidden_layers": 6 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 204, | ||
"num_hidden_layers": 6 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 36, | ||
"num_hidden_layers": 4 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 664, | ||
"num_hidden_layers": 8 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 76, | ||
"num_hidden_layers": 4 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 12, | ||
"num_hidden_layers": 2 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 952, | ||
"num_hidden_layers": 8 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"model_config": { | ||
"hidden_size": 308, | ||
"num_hidden_layers": 6 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
pad_token_id - we're not using pad tokens, do we don't set it | ||
layer_norm_eps - different than rms norm eps in mamba | ||
initializer_range - different in mamba & llama | ||
residual_in_fp32 - mamba specific parameter | ||
time_step_* - mamba specific, sane defaults | ||
there is no way to untie embeddings and unembeddings in mamba, they're tied by default | ||
https://github.com/huggingface/transformers/blob/v4.40.0/src/transformers/models/mamba/modeling_mamba.py#L602-L610 | ||
rescale_prenorm_residual was True in original paper, so we set it to True, despite HF default being false | ||
using default for use_cache | ||
state_size is default |
Oops, something went wrong.