How to fine-tune LLaMA 3.2 11B Vision using LoRA with the recent update? #1319

yukiarimo · 2024-11-21T19:46:47Z

I saw you used something like this:

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # False if not finetuning vision part
    finetune_language_layers   = True, # False if not finetuning language part
    finetune_attention_modules = True, # False if not finetuning attention layers
    finetune_mlp_modules       = True, # False if not finetuning MLP layers

    r = 16,           # The larger, the higher the accuracy, but might overfit
    lora_alpha = 16,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
    # target_modules = "all-linear", # Optional now! Can specify a list if needed
)

But for (not vision) LLaMA 3.1 8B I used something like that:

model = FastLanguageModel.get_peft_model(
    model,
    r = 256, # 128 or 256
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "embed_tokens", "lm_head"
                      ],
    lora_alpha = 256, # 128 or 256
    lora_dropout = 0.1,
    bias = "all", # "all"
    use_gradient_checkpointing = "unsloth", # True - don't use False
    random_state = 42,
    use_rslora = True,
    loftq_config = None, # And LoftQ
)

So, can I do the same and what are these new options?

finetune_vision_layers     = True, # False if not finetuning vision part
finetune_language_layers   = True, # False if not finetuning language part
finetune_attention_modules = True, # False if not finetuning attention layers
finetune_mlp_modules       = True, # False if not finetuning MLP layers

Also, my (raw text only) dataset looks like this:

<|begin_of_text|>
<dialog>
<kanojo>You're a know-it-all girl.</kanojo>
<yuki>How are you?</yuki>
<yuna>I'm fine</yuna>
<yuki>Who is Elon Musk?</yuki>
<yuna>He's a cool guy</yuna>

So, for the image how do I do that? Can I make something like:

<|begin_of_text|>
<dialog>
<kanojo>You're a know-it-all girl.</kanojo>
<yuki>How are you?</yuki>
<yuna>I'm fine</yuna>
<yuki>Please describe how do I look like? <data>{image_tokens}</data></yuki>
<yuna>You're adorable!</yuna>

Note: <yuki>, </yuki>, <yuna>, </yuna>, <data>, </data>, <kanojo>, </kanojo>, and <dialog> are custom special tokens added by me in the vocabulary!

The text was updated successfully, but these errors were encountered:

danielhanchen · 2024-11-22T00:32:21Z

Hey! Oh hmm for now you need to have 1 image paired with text during finetuning - I'm working on allowing (text only) + (text + image) finetuning, but for now that'll require a custom data collator

yukiarimo · 2024-11-22T02:07:39Z

I see. My data collector is:

def formatting_prompts_func(examples):
    texts = examples["text"]
    return {"text": texts}

dataset = load_dataset("json", data_files="/content/drive/MyDrive/datasets/all.jsonl")

I would like to know how to put the image (image tokens) inside so I can hard code maybe and drop it raw dataset as I did before. Also, I would like to not use it in the beginning and maybe do multiple. Any suggestions?

dame-cell · 2024-11-22T12:11:59Z

@yukiarimo if it's possible could you provide the dataset i could try coding a custom data collat

and for this

finetune_vision_layers     = True, # False if not finetuning vision part
finetune_language_layers   = True, # False if not finetuning language part
finetune_attention_modules = True, # False if not finetuning attention layers
finetune_mlp_modules       = True, # False if not finetuning MLP layers

The vision-language model usually has three components

A Vision encoder
A Text encoder
A MLP projector ( that connects the embeds from vision with the text encoder)

So the paramters:

The finetune vision layers if true will finetune the vision model or encoder
The finetune language layers if true will finetune the text model or encoder
The finetune_attention_modules if true will finetune the attentions from both the text model and vision model (I assume)
The finetune_mlp_modules if true will finetune the mlp projector I think

yukiarimo · 2024-11-22T17:03:22Z

Thanks for the explanation!

Sorry, but I cannot provide the dataset cause it is 100% sensitive personal data! But here’s what the text dataset looks like. The chunks of dialogs that look like this (split by 16K per JSONL row. By the way, you can just make word and then in Python split it into JSON lines (so it will be easier and won't cut dialog in between):

<|begin_of_text|>
<dialog>
<kanojo>You're a know-it-all girl.</kanojo>
<yuki>How are you?</yuki>
<yuna>I'm fine</yuna>
<yuki>Who is Elon Musk?</yuki>
<yuna>He's a cool guy</yuna>

For the images, I have the same format but like this:

<|begin_of_text|>
<dialog>
<kanojo>You're a know-it-all girl.</kanojo>
<yuki>How are you?</yuki>
<yuna>I'm fine</yuna>
<yuki>Please describe how do I look like? <data>{img_3472.jpg}</data></yuki>
<yuna>You're adorable!</yuna>

JSON lines look like this:

{"text": "this is row one"}
{"text": "this is row two"}
{"text": "this is row three"}

Note: yes, I do write all datasets myself!
Note 2: we can try making script for this format for now, and then adapt it later! I just wanna know how to work with images, cause I never done this before!

So, you said:

"The vision-language model usually has three components..."

Okay, I got it, but what about these:

target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "embed_tokens", "lm_head"],

Do they exist in the vision model too? Cause I need to do a triple fine-tuning here:

text only without lm_head+embed_tokens
text only with lm_head+embed_tokens
text+image with lm_head+embed_tokens

dame-cell · 2024-11-23T14:56:12Z

@yukiarimo forgive me for the late reply

target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "embed_tokens", "lm_head"],

So basically these are the target modules in a model specifically the proj are the attention q k and v and so on....

embed_tokens: The layer that converts token IDs into dense vectors (used at the input of the model).
lm_head: The layer that converts the model's hidden states into vocabulary logits for token prediction (used at the output of the model).

Do they exist in the vision model too?
I'm not sure what vision model Llama3.2 vision is using so I'm can't say for sure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to fine-tune LLaMA 3.2 11B Vision using LoRA with the recent update? #1319

How to fine-tune LLaMA 3.2 11B Vision using LoRA with the recent update? #1319

yukiarimo commented Nov 21, 2024 •

edited

Loading

danielhanchen commented Nov 22, 2024

yukiarimo commented Nov 22, 2024 •

edited

Loading

dame-cell commented Nov 22, 2024

yukiarimo commented Nov 22, 2024 •

edited

Loading

dame-cell commented Nov 23, 2024 •

edited

Loading

How to fine-tune LLaMA 3.2 11B Vision using LoRA with the recent update? #1319

How to fine-tune LLaMA 3.2 11B Vision using LoRA with the recent update? #1319

Comments

yukiarimo commented Nov 21, 2024 • edited Loading

danielhanchen commented Nov 22, 2024

yukiarimo commented Nov 22, 2024 • edited Loading

dame-cell commented Nov 22, 2024

yukiarimo commented Nov 22, 2024 • edited Loading

dame-cell commented Nov 23, 2024 • edited Loading

yukiarimo commented Nov 21, 2024 •

edited

Loading

yukiarimo commented Nov 22, 2024 •

edited

Loading

yukiarimo commented Nov 22, 2024 •

edited

Loading

dame-cell commented Nov 23, 2024 •

edited

Loading