-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to fine-tune LLaMA 3.2 11B Vision using LoRA with the recent update? #1319
Comments
Hey! Oh hmm for now you need to have 1 image paired with text during finetuning - I'm working on allowing (text only) + (text + image) finetuning, but for now that'll require a custom data collator |
I see. My data collector is:
I would like to know how to put the image (image tokens) inside so I can hard code maybe and drop it raw dataset as I did before. Also, I would like to not use it in the beginning and maybe do multiple. Any suggestions? |
@yukiarimo if it's possible could you provide the dataset i could try coding a custom data collat and for this finetune_vision_layers = True, # False if not finetuning vision part
finetune_language_layers = True, # False if not finetuning language part
finetune_attention_modules = True, # False if not finetuning attention layers
finetune_mlp_modules = True, # False if not finetuning MLP layers The vision-language model usually has three components
So the paramters:
|
Thanks for the explanation! Sorry, but I cannot provide the dataset cause it is 100% sensitive personal data! But here’s what the text dataset looks like. The chunks of dialogs that look like this (split by 16K per JSONL row. By the way, you can just make word and then in Python split it into JSON lines (so it will be easier and won't cut dialog in between):
For the images, I have the same format but like this:
JSON lines look like this:
So, you said:
Okay, I got it, but what about these:
Do they exist in the vision model too? Cause I need to do a triple fine-tuning here:
|
@yukiarimo forgive me for the late reply target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "embed_tokens", "lm_head"], So basically these are the target modules in a model specifically the proj are the attention q k and v and so on....
Do they exist in the vision model too? |
I saw you used something like this:
But for (not vision) LLaMA 3.1 8B I used something like that:
So, can I do the same and what are these new options?
Also, my (raw text only) dataset looks like this:
So, for the image how do I do that? Can I make something like:
Note:
<yuki>, </yuki>, <yuna>, </yuna>, <data>, </data>, <kanojo>, </kanojo>, and <dialog>
are custom special tokens added by me in the vocabulary!The text was updated successfully, but these errors were encountered: