Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training script #31

Merged
merged 150 commits into from
Mar 19, 2024
Merged

Training script #31

merged 150 commits into from
Mar 19, 2024

Conversation

jettjaniak
Copy link
Contributor

No description provided.

scripts/train.py Outdated Show resolved Hide resolved
scripts/train.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should use argparse and take some arguments?

scripts/upload_tokens.py Outdated Show resolved Hide resolved
scripts/upload_tokens.py Outdated Show resolved Hide resolved
src/delphi/llama2.py Outdated Show resolved Hide resolved
src/delphi/mamba.py Outdated Show resolved Hide resolved
scripts/train.py Outdated Show resolved Hide resolved
src/delphi/mamba.py Outdated Show resolved Hide resolved
src/delphi/train/utils.py Outdated Show resolved Hide resolved
scripts/train.py Outdated Show resolved Hide resolved
src/delphi/llama2.py Outdated Show resolved Hide resolved
@jettjaniak
Copy link
Contributor Author

jettjaniak commented Feb 16, 2024

float16 vs bfloat16 vs float32 - Can we just train everything on float32 to ensure reproducibility across all devices?

@SrGonao
Copy link
Contributor

SrGonao commented Feb 26, 2024

What can I do to finish this? Do you want me to take over @jannik-brinkmann

@jettjaniak jettjaniak force-pushed the training-script branch 2 times, most recently from bc335cb to d67d6a1 Compare February 26, 2024 18:56
@jannik-brinkmann
Copy link
Contributor

@SrGonao it would be great if someone could pickup the refactoring of the training - the functionality should be fine. @jettjaniak depending on your feedback, I can already start the trainings on the un-refactored version of the code, to provide new models to the evals team. If you want to wait for the refactoring to be done, I could work on it on Thursday, if no one else could do it before then :)

@jettjaniak
Copy link
Contributor Author

jettjaniak commented Feb 26, 2024

All the comments need to be resolved, but below are some of top priorities:

  • training independent of tokenizer, PretokDataset etc. - just use a tokenized HF dataset
  • revamp TrainingConfig (waiting for try hydra for training configs #48)
  • remove things related to model compilation, ddp and all things that we don't need
  • remove amp, everything should train on float32
  • define a function performing a single training step (wandb should be outside of this)

scripts/upload_tokens.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this script. We're hosting the dataset on HF, so we should host it in the llama2c github repo. How we generated it is perhaps more interesting and could be it's own script, but whatever

Copy link
Collaborator

@jaidhyani jaidhyani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a bunch of comments to try to make this obnoxiously-huge PR slightly less of a pain to review.

Comment on lines +20 to +21
with:
submodules: recursive
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added this when we added the llama2c submodule. Ironically, we also removed llama2c in the course of developing this PR. Technically we don't need this anymore, but it's not a bad idea to have it around for any submodules we add in the future.

@@ -31,11 +33,11 @@ jobs:
- name: dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install -r requirements-nocuda.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mamba really wants to run on CUDA, and there are optional-but-preferred dependencies for doing so. We want to use that when we can, because otherwise training mamba models would take foreverrrrrrrrr - but only on platforms that support it. We split out requirements into those that don't require cuda (-nocuda) and those that do (still in requirements.txt, which automatically includes -nocuda requirements). Github CI doesn't run in a CUDA env because why would Microsoft give away that much GPU compute? So we use the non-cuda requirements.txt for CI.

pip install -e .
- name: black
run: black --check .
- name: isort
run: isort --profile black --check .
run: isort --check .
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--black config is now implicit in pyproject.toml

Comment on lines +9 to +12
bin
include
lib64
pyvenv.cfg
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are llama2c artifacts? Also no technically longer needed, but on the other hand these are generally things we'd want to exclude from git if anything with these names ever showed up.

Comment on lines +167 to +171
# ignore wandb files
**/wandb/*
**/*.wandb
**/wandb-summary.json
**/wandb-metadata.json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugging wandb integration involved a lot of wandb artifacts being created and I was too lazy to change directories.

src/delphi/train/config/model/delphi_mamba_config.py Outdated Show resolved Hide resolved
src/delphi/train/config/model/delphi_model_config.py Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a weird one. Iteration parameters are always derivable from config and dataset size, but it's convenient to store them anyway, if only to cut down on how many parameters we're passing around. Originally implemented while trying to break the original script into functional pieces, and then never revisited because that seemed like too much effort.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main training function! Given a GigaConfig, does setup and runs the training loop. Most of the actual logic lives in train_step.

src/delphi/train/utils.py Outdated Show resolved Hide resolved
@jettjaniak jettjaniak mentioned this pull request Mar 18, 2024
7 tasks
@jaidhyani jaidhyani linked an issue Mar 19, 2024 that may be closed by this pull request
7 tasks
Copy link
Collaborator

@jaidhyani jaidhyani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1hhv9m

@jaidhyani jaidhyani merged commit 35f0c69 into main Mar 19, 2024
1 check passed
@jettjaniak jettjaniak deleted the training-script branch May 22, 2024 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants