-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llm experiments #610
base: master
Are you sure you want to change the base?
Llm experiments #610
Conversation
… on both SIL GPUs and ORU GPUs, separating out llm formatting for training and generation
…s if you're using unsloth)
…ey are used as source languages in FT experiment config files)
…e it easier to sort and manipulate data before saving it to json
…t loading all Bible data
… and for 300 languages
list of model files
…ing multiple training runs, changed adding tokens to the tokenizer to use the native unsloth code
Mostly bug fixes - getting unsloth and full finetune working smoothly on both SIL GPUs and ORU GPUs, separating out llm formatting for training and generation Move unsloth imports to appropriate scope (and only do unsloth imports if you're using unsloth) Fixing typo - causal llm Load all Bible data, generate llm tags for each Bible Removing output Script for selecting ten pivot source texts (based on how commonly they are used as source languages in FT experiment config files) Save all Bibles verses (with ten pivot languages first) Need to import unsloth FastLanguageModel in one more spot for inference Don't save clearml checkpoints Now works with multilingual data (changed input data functions / inference slightly) More flexible preprocessing of all Scripture - uses dataframes to make it easier to sort and manipulate data before saving it to json Adding in Ethnologue data, adding code to load full Bible data without loading all Bible data Add verse tag in between verses; generate new tokens to add to LLM Adding verse token to set of tokens; separated code for 103 languages and for 300 languages Removing old hf access tokens Loading models from s3 bucket, multilingual model finetuning Refactoring, adding comments, adding support for loading unsloth models from file Checkpointing during model training, no hardcoded list of model files Updated generation function (generate in batches, fixed new line error) Ability to just save adapter, rather than merging into base model before saving Updated imports, simplified code Updated imports (froze some library versions), added structure for doing multiple training runs, changed adding tokens to the tokenizer to use the native unsloth code Additional code for specific language families, etc. Includes additional scoring and formatting options Renaming files to differentiate from Crystal's code Adding comments for functions, explaining what each file does A few more experimental files, more comments Removing language codes and names Remove language names and language codes
…into llm_experiments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 9 files at r1, 7 of 7 files at r2, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ddaspit, @LLMresearcher, and @mshannon-sil)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 9 files at r1, 7 of 7 files at r2, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ddaspit and @LLMresearcher)
Laura's code for LLM experiments
This change is