From 4c72efc5de154d4be823506e9690fd4d0dc4c57f Mon Sep 17 00:00:00 2001 From: Jett Date: Sat, 25 May 2024 12:04:31 +0200 Subject: [PATCH] README update --- README.md | 30 +++++++++++++++++++++++++++++- configs/stories/llama2/README.md | 13 ++++++------- configs/stories/mamba/README.md | 18 ++++++++---------- 3 files changed, 43 insertions(+), 18 deletions(-) diff --git a/README.md b/README.md index b2c38fe0..f8bf01d3 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,8 @@ +# delphi + +delphi is a set of tools for standardized and (mostly) reproducible training of small language models. You can use delphi to train a custom tokenizer, tokenize your dataset, and train your model. We build on top of HuggingFace, supporting every `CausalLM` architecture. Datasets, tokenizers and models (including checkpoints!) can be downloaded from and uploaded to HuggingFace automatically, with no need to manage local files. + + # Setup 1. Clone the repo @@ -155,7 +160,7 @@ options: -s, --silent Silence all logging. Mutually exclusive with --verbose, --loglevel ``` -You can specify primary config and secondary config, which is useful if you're training a suite of models that only differ in a few parameters. Additionally, you can override specific fields using the `--overrides` flag. If you don't want to push the model and its checkpoints to HF, you need to explicitly set `out_repo=""`. If you don't want to log to W&B, you need to set `wandb=""`. +You can specify primary config and secondary config, which is useful if you're training a suite of models that only differ in a few parameters. Additionally, you can override specific fields using the `--overrides` flag. If you don't want to push the model and its checkpoints to HF, you need to explicitly set `out_repo=""`. If you don't want to log to W&B, you need to set `wandb=""`. Please note that by default we save the optimizer state (2x model size) with every checkpoint. Here is how we trained our `stories-mamba-100k` model ``` @@ -166,3 +171,26 @@ Here is how we trained our `stories-mamba-100k` model out_repo="delphi-suite/stories-mamba-100k" \ wandb="delphi-suite/delphi" ``` + +# Development + +1. Install the `dev` and `notebooks` dependencies `pip install -e ."[dev,notebooks]"`. +2. Run the tests `pytest`. +3. Install pre-commit `pre-commit install`. +4. Install the recommended vscode extensions. + +When you save a file vscode should automatically format it. Otherwise, pre-commit will do that, but you will need to add the changes and commit again. + +# Citation + +If you use delphi in your research, please cite using the following + +```bibtex +@software{delphi, + title = {delphi: small language models training made easy}, + author = {Jett Janiak, Jai Dhyani, Jannik Brinkmann, Gonçalo Paulo, Joshua Wendland, Víctor Abia Alonso, Siwei Li, Phan Anh Duong, Alice Rigg}, + year = 2024, + url = {https://github.com/delphi-suite/delphi}, + license = {apache-2.0} +} +``` \ No newline at end of file diff --git a/configs/stories/llama2/README.md b/configs/stories/llama2/README.md index be1f976e..6192ef38 100644 --- a/configs/stories/llama2/README.md +++ b/configs/stories/llama2/README.md @@ -1,7 +1,6 @@ -not using padding, so pad_token_id not set -use_cache - using default -pretraining_tp - experimental parallelization we're not using, which is the default -tie_word_embeddings - llama2 used False and this is better for interpretability, note that llama2.c is using True by default, which is probably more efficient use of parameters for very small models -rope settings are widely used defaults -attention_bias - no biases on QKV and output projection is the default and that's what we're using -attention_dropout - this is the only dropout llama2 can use, it's set to prob=0 by default and that's what we're using \ No newline at end of file +- use_cache - using default +- pretraining_tp - experimental parallelization we're not using, which is the default +- tie_word_embeddings - llama2 used False and this is better for interpretability, note that llama2.c is using True by default, which is probably more efficient use of parameters for very small models +- rope settings are widely used defaults +- attention_bias - no biases on QKV and output projection is the default and that's what we're using +- attention_dropout - this is the only dropout llama2 can use, it's set to prob=0 by default and that's what we're using \ No newline at end of file diff --git a/configs/stories/mamba/README.md b/configs/stories/mamba/README.md index 3e83bccc..7e30ceb7 100644 --- a/configs/stories/mamba/README.md +++ b/configs/stories/mamba/README.md @@ -1,10 +1,8 @@ -pad_token_id - we're not using pad tokens, do we don't set it -layer_norm_eps - different than rms norm eps in mamba -initializer_range - different in mamba & llama -residual_in_fp32 - mamba specific parameter -time_step_* - mamba specific, sane defaults -there is no way to untie embeddings and unembeddings in mamba, they're tied by default -https://github.com/huggingface/transformers/blob/v4.40.0/src/transformers/models/mamba/modeling_mamba.py#L602-L610 -rescale_prenorm_residual was True in original paper, so we set it to True, despite HF default being false -using default for use_cache -state_size is default \ No newline at end of file +- layer_norm_eps - different than rms norm eps in llama +- initializer_range - different in mamba & llama +- residual_in_fp32 - mamba specific parameter +- time_step_* - mamba specific, sane defaults +- there is no way to untie embeddings and unembeddings in mamba, they're tied by default https://github.com/huggingface/transformers/blob/v4.40.0/src/transformers/models/mamba/modeling_mamba.py#L602-L610 +- rescale_prenorm_residual was True in original paper, so we set it to True, despite HF default being false +- using default for use_cache +- state_size is default \ No newline at end of file