Llm experiments #610

laura-burdick-sil · 2024-12-12T15:33:14Z

Laura's code for LLM experiments

This change is

… on both SIL GPUs and ORU GPUs, separating out llm formatting for training and generation

…s if you're using unsloth)

…ey are used as source languages in FT experiment config files)

…rence slightly)

…e it easier to sort and manipulate data before saving it to json

…t loading all Bible data

… and for 300 languages

…ls from file

list of model files

…ore saving

…ing multiple training runs, changed adding tokens to the tokenizer to use the native unsloth code

Mostly bug fixes - getting unsloth and full finetune working smoothly on both SIL GPUs and ORU GPUs, separating out llm formatting for training and generation Move unsloth imports to appropriate scope (and only do unsloth imports if you're using unsloth) Fixing typo - causal llm Load all Bible data, generate llm tags for each Bible Removing output Script for selecting ten pivot source texts (based on how commonly they are used as source languages in FT experiment config files) Save all Bibles verses (with ten pivot languages first) Need to import unsloth FastLanguageModel in one more spot for inference Don't save clearml checkpoints Now works with multilingual data (changed input data functions / inference slightly) More flexible preprocessing of all Scripture - uses dataframes to make it easier to sort and manipulate data before saving it to json Adding in Ethnologue data, adding code to load full Bible data without loading all Bible data Add verse tag in between verses; generate new tokens to add to LLM Adding verse token to set of tokens; separated code for 103 languages and for 300 languages Removing old hf access tokens Loading models from s3 bucket, multilingual model finetuning Refactoring, adding comments, adding support for loading unsloth models from file Checkpointing during model training, no hardcoded list of model files Updated generation function (generate in batches, fixed new line error) Ability to just save adapter, rather than merging into base model before saving Updated imports, simplified code Updated imports (froze some library versions), added structure for doing multiple training runs, changed adding tokens to the tokenizer to use the native unsloth code Additional code for specific language families, etc. Includes additional scoring and formatting options Renaming files to differentiate from Crystal's code Adding comments for functions, explaining what each file does A few more experimental files, more comments Removing language codes and names Remove language names and language codes

…into llm_experiments

isaac091

Reviewed 2 of 9 files at r1, 7 of 7 files at r2, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ddaspit, @LLMresearcher, and @mshannon-sil)

mshannon-sil

Reviewed 2 of 9 files at r1, 7 of 7 files at r2, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ddaspit and @LLMresearcher)

laura-burdick added 28 commits August 8, 2024 11:23

Refactoring of full pipeline; credentials separated from code

3c46f9d

Mostly bug fixes - getting unsloth and full finetune working smoothly…

c00dd8b

… on both SIL GPUs and ORU GPUs, separating out llm formatting for training and generation

Move unsloth imports to appropriate scope (and only do unsloth import…

7ae471e

…s if you're using unsloth)

Fixing typo - causal llm

addb736

Load all Bible data, generate llm tags for each Bible

3718a30

Removing output

dae5ec1

Script for selecting ten pivot source texts (based on how commonly th…

25d4ae7

…ey are used as source languages in FT experiment config files)

Save all Bibles verses (with ten pivot languages first)

b22ba92

Need to import unsloth FastLanguageModel in one more spot for inference

cf00b18

Don't save clearml checkpoints

14f0719

Now works with multilingual data (changed input data functions / infe…

f4e39fc

…rence slightly)

More flexible preprocessing of all Scripture - uses dataframes to mak…

46ba4af

…e it easier to sort and manipulate data before saving it to json

Adding in Ethnologue data, adding code to load full Bible data withou…

59a9075

…t loading all Bible data

Add verse tag in between verses; generate new tokens to add to LLM

be9ac9f

Adding verse token to set of tokens; separated code for 103 languages…

b174d42

… and for 300 languages

Removing old hf access tokens

cf74940

Loading models from s3 bucket, multilingual model finetuning

33f2b2d

Refactoring, adding comments, adding support for loading unsloth mode…

6410766

…ls from file

Checkpointing during model training, no hardcoded

28a9cfb

list of model files

Updated generation function (generate in batches, fixed new line error)

fdd9277

Ability to just save adapter, rather than merging into base model bef…

7b14995

…ore saving

Updated imports, simplified code

a50199d

Updated imports (froze some library versions), added structure for do…

c97301a

…ing multiple training runs, changed adding tokens to the tokenizer to use the native unsloth code

Additional code for specific language families, etc.

45c14a9

Includes additional scoring and formatting options

3c95d7b

Renaming files to differentiate from Crystal's code

b37b9e2

Adding comments for functions, explaining what each file does

ebd2d13

A few more experimental files, more comments

cbbd382

laura-burdick-sil requested review from ddaspit and isaac091 December 12, 2024 16:58

laura-burdick-sil requested review from mshannon-sil and LLMresearcher December 12, 2024 16:58

laura-burdick added 4 commits December 12, 2024 13:27

Removing language codes and names

7080de4

Remove language names and language codes

1dd686d

Merge branch 'llm_experiments' of https://github.com/sillsdev/silnlp …

36ab740

…into llm_experiments

isaac091 approved these changes Dec 13, 2024

View reviewed changes

mshannon-sil approved these changes Dec 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llm experiments #610

Llm experiments #610

laura-burdick-sil commented Dec 12, 2024 •

edited by ddaspit

Loading

isaac091 left a comment

mshannon-sil left a comment

Llm experiments #610

Are you sure you want to change the base?

Llm experiments #610

Conversation

laura-burdick-sil commented Dec 12, 2024 • edited by ddaspit Loading

isaac091 left a comment

Choose a reason for hiding this comment

mshannon-sil left a comment

Choose a reason for hiding this comment

laura-burdick-sil commented Dec 12, 2024 •

edited by ddaspit

Loading