rhymes-ai · aria-hacker · Oct 3, 2024 · Oct 3, 2024 · Oct 3, 2024 · Oct 3, 2024
diff --git a/aria/model/processing_aria.py b/aria/model/processing_aria.py
@@ -228,9 +228,12 @@ def from_pretrained(
             image_processor_path,
             **cls._extract_kwargs(AriaVisionProcessor.from_pretrained, **kwargs),
         )
+        if "use_fast" in kwargs:
+            kwargs.pop("use_fast")
         try:
             tokenizer = AutoTokenizer.from_pretrained(
                 tokenizer_path,
+                use_fast=False,
                 **cls._extract_kwargs(AutoTokenizer.from_pretrained, **kwargs),
             )
             chat_template = tokenizer.chat_template

diff --git a/assets/nextqa_loss_lora.png b/assets/nextqa_loss_lora.png
diff --git a/assets/nlvr2_loss_490_lora.png b/assets/nlvr2_loss_490_lora.png
diff --git a/assets/nlvr2_loss_980_lora.png b/assets/nlvr2_loss_980_lora.png
diff --git a/assets/refcoco_loss_lora.png b/assets/refcoco_loss_lora.png
diff --git a/examples/nextqa/README.md b/examples/nextqa/README.md
@@ -14,13 +14,13 @@ unzip NExTVideo.zip
 # Training Configuration and Commands
 
 ## LoRA
-The LoRA training configuration is shown in [config_lora.yaml](../../examples/nextqa/config_lora.yaml). Please modify your customized path of Aria model, Aria tokenizer and the nlvr2 dataset. This setting can run well on A100 80GB using 4k sequence length due to longer visual context. We set the `max_image_size` to 490 for video datasets.
+The LoRA training configuration is shown in [config_lora.yaml](../../examples/nextqa/config_lora.yaml). Please modify your customized path of Aria model, Aria tokenizer and the nlvr2 dataset. This setting can run well on single A100 80GB using 4k sequence length due to longer visual context. We set the `max_image_size` to 490 for video datasets.
 
 > *Note:* In this configuration, we add LoRA on all modules in the LLM of Aria, without the vit and projector. If you want to add LoRA on vit/projector, you can adjust the `freeze_vit` or `freeze_projector`. You can also adjust `lora_target_modules` to choose the sub-modules of LLM blocks and `freeze_llm_layers` to set the layers where you don't want to add LoRA.
 
-Command (on two 80GB A100s):
+Command (on single 80GB A100):
 ```bash
-accelerate launch --config_file recipes/accelerate_configs/zero2.yaml --num_processes 2 aria/train.py --config examples/nextqa/config_lora.yaml --output_dir [YOUR_OUT_DIR]
+CUDA_VISIBLE_DEVICES=0 python aria/train.py --config examples/nextqa/config_lora.yaml --output_dir [YOUR_OUT_DIR]
 ```
 
 ## Full Params
@@ -43,7 +43,7 @@ CUDA_VISIBLE_DEVICES=0 python examples/nextqa/evaluation.py \
 The `Accuracy`:
 | Aria                           | LoRA SFT               | Full Params SFT  |
 |:-------------------------------------:|:-------------------------:|:-------:|
-|76.02 | 79.08 | 81.42 |
+|78.14 | 80.80 | 81.42 |
 
 ## Loss Curve
 These are the loss curves of `LoRA SFT` and `Full Params SFT`:

diff --git a/examples/nextqa/evaluation.py b/examples/nextqa/evaluation.py
@@ -69,6 +69,7 @@ def load_model_and_tokenizer(args):
     model = AriaForConditionalGeneration.from_pretrained(
         args.base_model_path, device_map="auto", torch_dtype=torch.bfloat16
     ).eval()
+    model.pad_token_id = tokenizer.pad_token_id
 
     if args.peft_model_path:
         peft_config = PeftConfig.from_pretrained(args.peft_model_path)
@@ -134,7 +135,7 @@ def collate_fn(batch, processor, tokenizer):
         padding="longest",
         max_image_size=args.image_size,
     )
-    return inputs, batch, messages
+    return inputs, batch, texts
 
 
 def main():

diff --git a/examples/nlvr2/README.md b/examples/nlvr2/README.md
@@ -18,18 +18,18 @@ unzip train.part3.zip
 # Training Configuration and Commands
 
 ## LoRA
-The LoRA training configuration is shown in [config_lora.yaml](../../examples/nlvr2/config_lora.yaml). Please modify your customized path of Aria model, Aria tokenizer and the nlvr2 dataset. This setting can run well on A100s (80GB) with 2k input sequence length. You can specify the `max_image_size` (e.g., 980 or 490) in the command line.
+The LoRA training configuration is shown in [config_lora.yaml](../../examples/nlvr2/config_lora.yaml). Please modify your customized path of Aria model, Aria tokenizer and the nlvr2 dataset. This setting can run well on single A100 (80GB) with 2k input sequence length. You can specify the `max_image_size` (e.g., 980 or 490) in the command line.
 
 > *Note:* In this configuration, we add LoRA on all modules in the LLM of Aria, without the vit and projector. If you want to add LoRA on vit/projector, you can adjust the `freeze_vit` or `freeze_projector`. You can also adjust `lora_target_modules` to choose the sub-modules of LLM blocks and `freeze_llm_layers` to set the layers where you don't want to add LoRA.
 
-Command (on two 80GB A100s):
+Command (on single 80GB A100):
 ```bash
-accelerate launch --config_file recipes/accelerate_configs/zero2.yaml --num_processes 2 aria/train.py --config examples/nlvr2/config_lora.yaml --max_image_size 980 --output_dir [YOUR_OUT_DIR]
+CUDA_VISIBLE_DEVICES=0 python aria/train.py --config examples/nlvr2/config_lora.yaml --max_image_size 980 --output_dir [YOUR_OUT_DIR]
 ```
 
 You can change the `max_image_size` to 490:
 ```bash
-accelerate launch --config_file recipes/accelerate_configs/zero2.yaml --num_processes 2 aria/train.py --config examples/nlvr2/config_lora.yaml --max_image_size 490 --output_dir [YOUR_OUT_DIR]
+CUDA_VISIBLE_DEVICES=0 python aria/train.py --config examples/nlvr2/config_lora.yaml --max_image_size 490 --output_dir [YOUR_OUT_DIR]
 ```
 
 ## Full Params
@@ -57,8 +57,8 @@ CUDA_VISIBLE_DEVICES=0 python examples/nlvr2/evaluation.py \
 The `Accuracy`:
 |        | Aria                           | LoRA SFT               | Full Params SFT  |
 |:--------:|:-------------------------------------:|:-------------------------:|:-------:|
-|490 |88.09 | 91.27 | 92.24 |
-|980 |88.08 | 91.50 | 92.33 |
+|490 |86.56 | 91.32 | 92.24 |
+|980 |87.03 | 91.61 | 92.33 |
 
 # Loss Curve
 These are the loss curves of `LoRA Finetuning` (left) and `Full Params Finetuning` (right) with 490 and 980 `max_image_size`:

diff --git a/examples/nlvr2/evaluation.py b/examples/nlvr2/evaluation.py
@@ -69,6 +69,7 @@ def load_model_and_tokenizer(args):
     model = AriaForConditionalGeneration.from_pretrained(
         args.base_model_path, device_map="auto", torch_dtype=torch.bfloat16
     ).eval()
+    model.pad_token_id = tokenizer.pad_token_id
 
     if args.peft_model_path:
         peft_config = PeftConfig.from_pretrained(args.peft_model_path)

diff --git a/examples/refcoco/README.md b/examples/refcoco/README.md
@@ -13,13 +13,13 @@ unzip images.zip
 # Training Configuration and Commands
 
 ## LoRA
-The LoRA training configuration is shown in [config_lora.yaml](../../examples/refcoco/config_lora.yaml). Please modify your customized path of Aria model, Aria tokenizer and the refcoco dataset. This setting can run well on A100s (80GB) with 2k input sequence length. `max_image_size` is set to **980**.
+The LoRA training configuration is shown in [config_lora.yaml](../../examples/refcoco/config_lora.yaml). Please modify your customized path of Aria model, Aria tokenizer and the refcoco dataset. This setting can run well on single A100 (80GB) with 2k input sequence length. `max_image_size` is set to **980**.
 
 > *Note:* In this configuration, we add LoRA on all modules in the LLM of Aria, without the vit and projector. If you want to add LoRA on vit/projector, you can adjust the `freeze_vit` or `freeze_projector`. You can also adjust `lora_target_modules` to choose the sub-modules of LLM blocks and `freeze_llm_layers` to set the layers where you don't want to add LoRA.
 
-Command (on two 80GB A100s):
+Command (on single 80GB A100):
 ```bash
-accelerate launch --config_file recipes/accelerate_configs/zero2.yaml --num_processes 2 aria/train.py --config examples/refcoco/config_lora.yaml --output_dir [YOUR_OUT_DIR]
+CUDA_VISIBLE_DEVICES=0 python aria/train.py --config examples/refcoco/config_lora.yaml --output_dir [YOUR_OUT_DIR]
 ```
 
 ## Full Params
@@ -58,7 +58,7 @@ CUDA_VISIBLE_DEVICES=0 python examples/refcoco/evaluation.py \
 The `Precision@1`:
 | Aria                           | LoRA SFT               | Full Params SFT  |
 |:-------------------------------------:|:-------------------------:|:-------:|
-|41.77 | 88.92 | 88.85 |
+|2.27 | 88.68 | 88.85 |
 
 # Loss Curve
 These are the loss curves of `LoRA Finetuning` (left) and `Full Params Finetuning` (right):

diff --git a/examples/refcoco/evaluation.py b/examples/refcoco/evaluation.py
@@ -71,6 +71,7 @@ def load_model_and_tokenizer(args):
     model = AriaForConditionalGeneration.from_pretrained(
         args.base_model_path, device_map="auto", torch_dtype=torch.bfloat16
     ).eval()
+    model.pad_token_id = tokenizer.pad_token_id
 
     if args.peft_model_path:
         peft_config = PeftConfig.from_pretrained(args.peft_model_path)
@@ -128,7 +129,7 @@ def collate_fn(batch, processor, tokenizer):
         padding="longest",
         max_image_size=args.image_size,
     )
-    return inputs, batch, messages
+    return inputs, batch, texts
 
 
 def main():