Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to fine-tune LLaMA 3.2 11B Vision using LoRA with the recent update? #1319

Open
yukiarimo opened this issue Nov 21, 2024 · 5 comments
Open

Comments

@yukiarimo
Copy link

yukiarimo commented Nov 21, 2024

I saw you used something like this:

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # False if not finetuning vision part
    finetune_language_layers   = True, # False if not finetuning language part
    finetune_attention_modules = True, # False if not finetuning attention layers
    finetune_mlp_modules       = True, # False if not finetuning MLP layers

    r = 16,           # The larger, the higher the accuracy, but might overfit
    lora_alpha = 16,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
    # target_modules = "all-linear", # Optional now! Can specify a list if needed
)

But for (not vision) LLaMA 3.1 8B I used something like that:

model = FastLanguageModel.get_peft_model(
    model,
    r = 256, # 128 or 256
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "embed_tokens", "lm_head"
                      ],
    lora_alpha = 256, # 128 or 256
    lora_dropout = 0.1,
    bias = "all", # "all"
    use_gradient_checkpointing = "unsloth", # True - don't use False
    random_state = 42,
    use_rslora = True,
    loftq_config = None, # And LoftQ
)

So, can I do the same and what are these new options?

finetune_vision_layers     = True, # False if not finetuning vision part
finetune_language_layers   = True, # False if not finetuning language part
finetune_attention_modules = True, # False if not finetuning attention layers
finetune_mlp_modules       = True, # False if not finetuning MLP layers

Also, my (raw text only) dataset looks like this:

<|begin_of_text|>
<dialog>
<kanojo>You're a know-it-all girl.</kanojo>
<yuki>How are you?</yuki>
<yuna>I'm fine</yuna>
<yuki>Who is Elon Musk?</yuki>
<yuna>He's a cool guy</yuna>

So, for the image how do I do that? Can I make something like:

<|begin_of_text|>
<dialog>
<kanojo>You're a know-it-all girl.</kanojo>
<yuki>How are you?</yuki>
<yuna>I'm fine</yuna>
<yuki>Please describe how do I look like? <data>{image_tokens}</data></yuki>
<yuna>You're adorable!</yuna>

Note: <yuki>, </yuki>, <yuna>, </yuna>, <data>, </data>, <kanojo>, </kanojo>, and <dialog> are custom special tokens added by me in the vocabulary!

@danielhanchen
Copy link
Contributor

Hey! Oh hmm for now you need to have 1 image paired with text during finetuning - I'm working on allowing (text only) + (text + image) finetuning, but for now that'll require a custom data collator

@yukiarimo
Copy link
Author

yukiarimo commented Nov 22, 2024

I see. My data collector is:

def formatting_prompts_func(examples):
    texts = examples["text"]
    return {"text": texts}

dataset = load_dataset("json", data_files="/content/drive/MyDrive/datasets/all.jsonl")

I would like to know how to put the image (image tokens) inside so I can hard code maybe and drop it raw dataset as I did before. Also, I would like to not use it in the beginning and maybe do multiple. Any suggestions?

@dame-cell
Copy link

@yukiarimo if it's possible could you provide the dataset i could try coding a custom data collat

and for this

finetune_vision_layers     = True, # False if not finetuning vision part
finetune_language_layers   = True, # False if not finetuning language part
finetune_attention_modules = True, # False if not finetuning attention layers
finetune_mlp_modules       = True, # False if not finetuning MLP layers

The vision-language model usually has three components

  • A Vision encoder
  • A Text encoder
  • A MLP projector ( that connects the embeds from vision with the text encoder)

So the paramters:

  • The finetune vision layers if true will finetune the vision model or encoder
  • The finetune language layers if true will finetune the text model or encoder
  • The finetune_attention_modules if true will finetune the attentions from both the text model and vision model (I assume)
  • The finetune_mlp_modules if true will finetune the mlp projector I think

@yukiarimo
Copy link
Author

yukiarimo commented Nov 22, 2024

Thanks for the explanation!

Sorry, but I cannot provide the dataset cause it is 100% sensitive personal data! But here’s what the text dataset looks like. The chunks of dialogs that look like this (split by 16K per JSONL row. By the way, you can just make word and then in Python split it into JSON lines (so it will be easier and won't cut dialog in between):

<|begin_of_text|>
<dialog>
<kanojo>You're a know-it-all girl.</kanojo>
<yuki>How are you?</yuki>
<yuna>I'm fine</yuna>
<yuki>Who is Elon Musk?</yuki>
<yuna>He's a cool guy</yuna>

For the images, I have the same format but like this:

<|begin_of_text|>
<dialog>
<kanojo>You're a know-it-all girl.</kanojo>
<yuki>How are you?</yuki>
<yuna>I'm fine</yuna>
<yuki>Please describe how do I look like? <data>{img_3472.jpg}</data></yuki>
<yuna>You're adorable!</yuna>

JSON lines look like this:

{"text": "this is row one"}
{"text": "this is row two"}
{"text": "this is row three"}

Note: yes, I do write all datasets myself!
Note 2: we can try making script for this format for now, and then adapt it later! I just wanna know how to work with images, cause I never done this before!

So, you said:

"The vision-language model usually has three components..."

Okay, I got it, but what about these:

target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "embed_tokens", "lm_head"],

Do they exist in the vision model too? Cause I need to do a triple fine-tuning here:

  1. text only without lm_head+embed_tokens
  2. text only with lm_head+embed_tokens
  3. text+image with lm_head+embed_tokens

@dame-cell
Copy link

dame-cell commented Nov 23, 2024

@yukiarimo forgive me for the late reply

target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "embed_tokens", "lm_head"],

So basically these are the target modules in a model specifically the proj are the attention q k and v and so on....

  • embed_tokens: The layer that converts token IDs into dense vectors (used at the input of the model).
  • lm_head: The layer that converts the model's hidden states into vocabulary logits for token prediction (used at the output of the model).

Do they exist in the vision model too?
I'm not sure what vision model Llama3.2 vision is using so I'm can't say for sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants