Comprehending Multimodal Content via prior-LLM Context Fusion
Peng Li2†, Ming Yan3, Fei Huang3†, Maosong Sun1, Yang Liu1,2
2 Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
3 Institute of Intelligent Computing, Alibaba Group
† Corresponding authors
With the bloom of Multimodal Large Language Models, the paradigm that extending Large Language Models with pre-trained vision encoders has shown remarkable abilities in visual reasoning and visual instruction-following tasks. However, this paradigm neglects essential crossmodality and inter-image interactions, leading to the LLM being presented with isolate visual and textual features without recognition of interleaved multimodal inputs, to which we refer as prior-LLM modality isolation and it obscures a deeper understanding of multi-image and interleaved inputs.
To mitigate the issue, we propose a novel paradigm named Browse-and-Concentrate (Brote). This paradigm begins with a browsing phase to generate a condition context vector, serving as a collection of browsing insights, encapsulating the main intent and visual information derived from images. Subsequently, a concentrating phase is employed to comprehend multimodal inputs, guided by the condition context vector. Our paradigm exhibits notable advancements, improving the average accuracy on 7 multi-image benchmarks by 2.13% and 7.60% against strong baselines with 3B and 11B LLMs, respectively.
Our paradigm progressively comprehends images via two phases, browsing and concentrating. In the browsing phase, the MLLM browses the entire input and generates a condition context as the browsing result, denoted as C. Then, in the concentrating phase, the model comprehends multimodal inputs under the guidance of C. We refer to the model of browsing phase as MB and the model of concentrating phase as MC.
Moreover, our proposed Brote can be further divided into two modes, explicit and implicit, regarding the distinct approaches of incorporating browsing result C. The explicit version, denoted as Brote-EX, operates with separated parameters (MB ≠ MC). This explicit mode first generates C using MB, followed by MC to infer the final outcomes. In contrast, for the implicit version, denoted as BroteIM, employs shared parameters for both phases (MB = MC), permitting MC to directly predict the answer without the need to explicitly produce intermediate vectors from the other model.
To encourage further exploration of information from C for VL tasks, we propose a new training strategy named context-dropping training. The strategy intentionally omits particular inputs yet requiring the model to infer for answers solely with the assistant of C. It motivates the model to compensate for the missing information from the provided condition context C. We propose three different dropping strategies:
- Drop images: This involves two approaches, removing certain images (Context Dropping (IMG-N)), and replacing original images by blank placeholders (Context Dropping (IMG-B)).
- Drop text: We remove the text before the last image (Context Dropping (TXT)).
- Drop ALL: A combination of the above settings denoted as ALL, applied with the same probabilities.
We report our results in the following tables:
Model | #Param LLM | In-context Learning | Multi-image / Video Tasks | AVG | |||||
---|---|---|---|---|---|---|---|---|---|
VQA2 | A-OKVQA | NLVR2 | DEMON | SEED | MSVD QA | MSRVTT QA | |||
KOSMOS-1 | 1.3B | 51.8 | - | - | - | - | - | - | - |
InstructBLIP-XL | 3B | 31.76* | 39.13* | 52.59* | 32.59* | 52.7 | 43.40* | 12.12* | 37.77 |
MMICL-XL | 3B | 69.16 | 53.43* | 71.48* | 38.14* | 54.69* | 53.68 | 42.36* | 54.71 |
Otter | 7B | 45.39* | 38.42* | 49.54* | 24.51 | 39.7 | 25.87* | 9.78* | - |
VPG-C-LLaMA2 | 7B | - | - | - | 37.22 | - | - | - | - |
Flamingo-9B | 7B | 56.3 | - | - | - | - | 30.2 | 13.7 | - |
Brote-EX-XL | 3B | 69.97 | 56.00 | 71.41 | 37.33 | 57.51 | 53.02 | 43.14 | 55.48 |
Brote-IM-XL | 3B | 68.94 | 56.43 | 76.02 | 37.34 | 57.86 | 56.06 | 45.08 | 56.84 |
InstructBlip-XXL | 11B | 48.21* | 45.92* | 64.54* | 33.00* | 50.81* | 44.30* | 15.49* | 43.18 |
MMICL-XXL | 11B | 70.56 | 54.85* | 56.16* | 36.30* | 56.66* | 52.19 | 39.46* | 52.18 |
EMU-2 | 33B | 67.0 | - | - | - | 62.8 | 49.0 | 31.4 | - |
Flamingo-80B | 70B | 63.1 | - | - | - | - | 35.6 | 17.4 | - |
Brote-EX-XXL | 11B | 70.86 | 59.94 | 70.42 | 38.70 | 59.31 | 54.42 | 45.24 | 57.00 |
Brote-IM-XXL | 11B | 71.71 | 60.31 | 80.71 | 38.94 | 61.64 | 57.29 | 45.94 | 59.78 |
- The best results for models larger/smaller than 10B are separately bolded and the seconds are underlined.
- VQAv2 and A-OKVQA are conducted under four-shot setting.
- SEED refers to SEED-Bench that contains both images and videos.
- For video benchmarks, we uniformly extract eight frames from the given video clips for answering the questions.
- For "AVG", we first average the MME scores over its subtasks, then calculate the average scores of all benchmarks in this table.
Model | #Param LLM | VQAv2 | A-OKVQA | ScienceQA-IMG | MME Perception | MME Cognition | MMBench | AVG |
---|---|---|---|---|---|---|---|---|
InstructBLIP-XL | 3B | 36.77 | 54.57 | 70.40 | 1093.70* | 281.43* | 69.68* | 68.52 |
MMICL-XL | 3B | 69.13 | 52.12* | 72.58* | 1184.54* | 277.86* | 73.11* | 75.81 |
LLaVA | 7B | - | - | - | 457.82 | 214.64 | 36.2 | - |
Otter | 7B | 57.89* | 41.92* | 63.10 | 1292.26 | 306.43 | 48.3 | 69.51 |
Brote-EX-XL | 3B | 69.90 | 52.93 | 71.15 | 1203.87 | 301.79 | 73.27 | 77.18 |
Brote-IM-XL | 3B | 70.24 | 53.40 | 72.58 | 1181.95 | 266.79 | 74.29 | 75.90 |
InstructBlip-XXL | 11B | 63.69 | 57.10 | 70.60 | 1212.82* | 291.79* | 70.34* | 75.99 |
MMICL-XXL | 11B | 70.30 | 51.35* | 74.92* | 1313.88* | 311.79* | 76.58* | 80.41 |
MMICL-XXL (BLIP2) | 11B | 69.99 | - | - | 1381.74 | 428.93 | 65.24 | - |
Brote-EX-XXL | 11B | 71.58 | 56.47 | 77.69 | 1279.73 | 310.01 | 76.67 | 81.31 |
Brote-IM-XXL | 11B | 73.02 | 57.83 | 78.38 | 1284.13 | 300.00 | 77.34 | 81.66 |
- The best results for models larger/smaller than 10B are separately bolded and the seconds are underlined.
- VQAv2 and A-OKVQA are conducted under zero-shot setting.
- ScienceQA is conducted under zero-shot CoT (ZS-CoT) setting.
- For "AVG", we first average the MME scores over its subtasks, then calculate the average scores of all benchmarks in this table.
📑 If you find our project helpful to your research, please consider citing:
@article{wang2024browse, title={Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion}, author={Wang, Ziyue and Chen, Chi and Zhu, Yiqi and Luo, Fuwen and Li, Peng and Yan, Ming and Zhang, Ji and Huang, Fei and Sun, Maosong and Liu, Yang}, journal={arXiv preprint arXiv:2402.12195}, year={2024} }