Training_script_refactor #54

jaidhyani · 2024-03-06T18:05:06Z

This is a WIP refactor of the training script. It's not in a good state and there's a high probability I broke something while refactoring, but it has one redeeming quality: it runs. Does it run correctly and produce usable models? Excellent question, I have no idea yet.

From src/: python delphi/train/training.py

jaidhyani · 2024-03-07T16:28:17Z

I don't think this is totally ready yet, but at this point I think it's probably worth it to (1) merge this into the training-script branch and (2) then merge the training_script branch into main.

SrGonao · 2024-03-07T18:46:19Z

src/delphi/train/architectures.py

+        gptconf = Llama2ModelArgs(**model_args)
+        model = Llama2Model(gptconf)
+        state_dict = checkpoint["model"]
+        # fix the keys of the state dictionary :(


Are we sure this still happens?

No idea tbqh, that's just something I copied and never checked.

src/delphi/train/architectures.py

src/delphi/train/gigaconfig.py

SrGonao · 2024-03-07T18:55:08Z

src/delphi/train/iteration_params.py

+def set_iteration_params(config, train_ds, validation_ds) -> IterationParams:
+    num_batches = len(train_ds) // config.batch_size
+    num_steps = num_batches // config.gradient_accumulation_steps
+    eval_iters = min(12, len(validation_ds) // config.batch_size)


what is the 12

Something that was already there that I didn't want to change without understanding it first: https://github.com/delphi-suite/delphi/pull/31/files#diff-c113425bb7a4b6c38858b09ca918bc89bd243d139cdca30ab5dd386d9690935bR94

as far as I can tell this is a constant set by Karpathy to log at least every 12 batches.

SrGonao · 2024-03-07T19:02:35Z

src/delphi/train/tokenized_chunks_dataset.py

+    def _default_indices(self):
+        return list(range(len(self.batched_tokens)))
+
+    def shuffle(self, epoch: int):


I don't know if this is correct. Shuffle alters the list in place, why are we changing the indices instead of the list?

Idempotency. I want shuffle(x) to result in the same state regardless of when it's called - so not dependent on the state of the shuffling before it's called. By making it a deterministic shuffle of the default indices we can guarantee that it doesn't matter when you call shuffle(x), you'll still get the same result.

Technically we don't need this for reproducibility if we commit to always calling shuffle with the same arguments in the same sequence, but I prefer to have it work this way to minimize the amount of state that needs to be kept track of when debugging - I don't want to have to repeat every shuffle in a given sequence to reproduce a problem.

I like it that way

I just didnt see where you then apply the indices to the list, but if you do it somewhere I trust you

SrGonao

I left a couple comments, only noticeable thing is on the shuffle, the rest is minor points and then mamba stuff that I should do

jannik-brinkmann · 2024-03-07T19:24:17Z

src/delphi/constants.py

seems good for now, but in the future we should consider adding the dataset as a parameter to the config

Could be both - a config param that defaults to a pre-defined constant.

jannik-brinkmann · 2024-03-07T20:01:16Z

src/delphi/train/tokenized_chunks_dataset.py

+    def __init__(self, tokenized_docs, max_seq_len, device):
+        self.device = device
+        self.tokenized_docs = tokenized_docs
+        self.doc_len = len(tokenized_docs[0]["tokens"])


"doc_len" = "document_length" = context length?

Document length, but I don't think that's context length. max_seq_len is model context length - or more accurately, how long training samples are.

jannik-brinkmann · 2024-03-07T20:11:13Z

src/delphi/train/training.py


+# setup eval callbacks
+eval_callbacks = [save_checkpoint_if_needed]


maybe I am missing something, but if eval_callbacks only includes saving the checkpoint, where does it log to wandb?

nvm, found it in line 44 lol

jannik-brinkmann

looks good to me. as discussed in Discord, I will start a training run tomorrow after the PR is merged to see whether the performance matches our current models

jaidhyani added 30 commits March 3, 2024 02:25

Venting about load_dataset

65926f6

adding hf datasets to constants

29048fa

generic load_delphi_dataset function

80074be

ignore wandb artifacts

25cdb1c

ignore scratch notebook

542d529

wip

0f57eed

remove unused split_slice

def8fe1

I got it to fucking run

3a33c5e

Add TokenizedChunksDataset class for training

7fe6e6c

starting to factor out some training functions

79da58e

working on refactor

7f46d54

it runs

1c1c9dd

gigaconfig

ff68211

Automatic device detection; remove unused stuff

22e53aa

factoring out wandb stuff

e380b9c

preparing to factor out some post-eval logic in the training loop

63f3851

refactoring, cleanup

a29cd87

more refactoring

78ec019

asdf

6856bed

keep track of best_val_loss

ae9d8e6

asdf

e523b21

remove unused imports

af1f588

black

d31dee1

make github use black profile for isort check

c219f07

isort version check

a7871e4

asdf

bdd326c

isort diff

5798c87

wandb isort fix

98bfd7d

remove isort debug command

ede158e

remove unused import

1749937

jaidhyani added 8 commits March 7, 2024 07:00

always_save_checkpoint doesn't imply best validation loss, duh

17773d2

remove unused imports

3c06a0c

delete unused naked min_lr (moved to config)

778c772

refactoring

2cd821e

factoring out iteration params

85fe484

consolidate iteration_param args

1ca7a00

abstract away llama2 specifics to enable mamba implementation

bbddba8

remove unused imports

4325012

jaidhyani marked this pull request as ready for review March 7, 2024 16:27

jaidhyani requested review from jettjaniak, SrGonao and transcendingvictor March 7, 2024 16:28

jaidhyani added 3 commits March 7, 2024 08:32

keep track of running_mfu!

887e0f8

minor cleanup

6dfa4bd

add @beartype to GigaConfig

a061950

SrGonao reviewed Mar 7, 2024

View reviewed changes

src/delphi/train/architectures.py Show resolved Hide resolved

SrGonao reviewed Mar 7, 2024

View reviewed changes

src/delphi/train/gigaconfig.py Outdated Show resolved Hide resolved

SrGonao reviewed Mar 7, 2024

View reviewed changes

jaidhyani added 2 commits March 7, 2024 11:13

add vocab_size to mamba checkpoint args to load

f081a23

track stuff in model_training_state; eliminate vocab_source

07af663

jannik-brinkmann reviewed Mar 7, 2024

View reviewed changes

jannik-brinkmann approved these changes Mar 7, 2024

View reviewed changes

jaidhyani merged commit a00425b into training-script Mar 8, 2024
1 check passed

jaidhyani deleted the training_script_refactor branch March 20, 2024 00:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training_script_refactor #54

Training_script_refactor #54

jaidhyani commented Mar 6, 2024 •

edited

Loading

jaidhyani commented Mar 7, 2024

SrGonao Mar 7, 2024

jaidhyani Mar 7, 2024

SrGonao Mar 7, 2024

jaidhyani Mar 7, 2024

jannik-brinkmann Mar 7, 2024

SrGonao Mar 7, 2024

jaidhyani Mar 7, 2024

jannik-brinkmann Mar 7, 2024

SrGonao Mar 8, 2024

SrGonao left a comment

jannik-brinkmann Mar 7, 2024

jaidhyani Mar 7, 2024

jannik-brinkmann Mar 7, 2024

jaidhyani Mar 8, 2024

jannik-brinkmann Mar 7, 2024

jannik-brinkmann Mar 7, 2024

jannik-brinkmann left a comment


		# setup eval callbacks
		eval_callbacks = [save_checkpoint_if_needed]

Training_script_refactor #54

Training_script_refactor #54

Conversation

jaidhyani commented Mar 6, 2024 • edited Loading

jaidhyani commented Mar 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SrGonao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jannik-brinkmann left a comment

Choose a reason for hiding this comment

jaidhyani commented Mar 6, 2024 •

edited

Loading