Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llm experiments #610

Open
wants to merge 32 commits into
base: master
Choose a base branch
from
Open

Llm experiments #610

wants to merge 32 commits into from

Conversation

laura-burdick-sil
Copy link
Collaborator

@laura-burdick-sil laura-burdick-sil commented Dec 12, 2024

Laura's code for LLM experiments


This change is Reviewable

… on both SIL GPUs and ORU GPUs, separating out llm formatting for training and generation
…ey are used as source languages in FT experiment config files)
…e it easier to sort and manipulate data before saving it to json
…ing multiple training runs, changed adding tokens to the tokenizer to use the native unsloth code
Mostly bug fixes - getting unsloth and full finetune working smoothly on both SIL GPUs and ORU GPUs, separating out llm formatting for training and generation

Move unsloth imports to appropriate scope (and only do unsloth imports if you're using unsloth)

Fixing typo - causal llm

Load all Bible data, generate llm tags for each Bible

Removing output

Script for selecting ten pivot source texts (based on how commonly they are used as source languages in FT experiment config files)

Save all Bibles verses (with ten pivot languages first)

Need to import unsloth FastLanguageModel in one more spot for inference

Don't save clearml checkpoints

Now works with multilingual data (changed input data functions / inference slightly)

More flexible preprocessing of all Scripture - uses dataframes to make it easier to sort and manipulate data before saving it to json

Adding in Ethnologue data, adding code to load full Bible data without loading all Bible data

Add verse tag in between verses; generate new tokens to add to LLM

Adding verse token to set of tokens; separated code for 103 languages and for 300 languages

Removing old hf access tokens

Loading models from s3 bucket, multilingual model finetuning

Refactoring, adding comments, adding support for loading unsloth models from file

Checkpointing during model training, no hardcoded
list of model files

Updated generation function (generate in batches, fixed new line error)

Ability to just save adapter, rather than merging into base model before saving

Updated imports, simplified code

Updated imports (froze some library versions), added structure for doing multiple training runs, changed adding tokens to the tokenizer to use the native unsloth code

Additional code for specific language families, etc.

Includes additional scoring and formatting options

Renaming files to differentiate from Crystal's code

Adding comments for functions, explaining what each file does

A few more experimental files, more comments

Removing language codes and names

Remove language names and language codes
Copy link
Collaborator

@isaac091 isaac091 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 9 files at r1, 7 of 7 files at r2, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @ddaspit, @LLMresearcher, and @mshannon-sil)

Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 9 files at r1, 7 of 7 files at r2, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @ddaspit and @LLMresearcher)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants