Skip to content

Latest commit

 

History

History
196 lines (173 loc) · 78.4 KB

Image.md

File metadata and controls

196 lines (173 loc) · 78.4 KB

Image

Table of Contents

Image Understanding

Reading List

NOTEs: INST=Instruction, FT=Finetune, PT=Pretraining, ICL=In Context Learning, ZS=ZeroShot, FS=FewShot, RTr=Retrieval

Paper Base Language Model Framework Data Code Publication Preprint Affiliation
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Phi3 PT + FT (interleaved resampler) self-curation 2408.08872 Salesforce
LLaVA-OneVision: Easy Visual Task Transfer Qwen-2 PT + FT (knowledge+v-inst) self-curation (one-vision) LLaVA-OneVision 2408.03326 ByteDance
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs vicuna PT + FT self-construct (cambrian) cambrian 2406.16860 Meta
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models Vicuna/Mixtral/Yi PT+FT self-construct (mimi-gemini) MGM 2403.18814 CUHK
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training ? PT + FT self-construct + mixture - 2403.09611 Apple
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models Llama/Qwen Prune - FastV 2403.06764 Alibaba
DeepSeek-VL: Towards Real-World Vision-Language Understanding DeepSeekLLM PT+FT mixture DeepSeek-VL 2403.05525 Deepseek
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models LLaVA FT mixture-HR LLaVA-HR 2403.03003 XMU
Efficient Multimodal Learning from Data-centric Perspective Phi, StableLM PT + FT LAION-2B Bunny 2402.11530 BAAI
Efficient Visual Representation Learning with Bidirectional State Space Model SSM efficient Vim 2401.09417 HUST
AIM: Autoregressive Image Models ViT Scale ml-aim 2401.08541 Apple
LEGO:Language Enhanced Multi-modal Grounding Model Vicuna PT + SFT mixture + self-construct LEGO 2401.06071 ByteDance
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training gated cross-attn + latent array PT + FT mixture cosmo 2401.00849 NUS
Tracking with Human-Intent Reasoning LLaMA (LLaVA) PT+FT mixture TrackGPT 2312.17448 Alibaba
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks Vicuna PT + FT mixture InternVL CVPR 2024 2312.14238 Shanghai AI Lab
VCoder: Versatile Vision Encoders for Multimodal Large Language Models LLaMA (LLaVA-1.5) FT (depth encoder + segment encoder) COCO Segmentation Text (COST) VCoder CVPR 2024 2312.14233 Gatech
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs Vicuna-7B FT + obj. refine (search) mixture+self-construct(object) vstar 2312.14135 NYU
Osprey: Pixel Understanding with Visual Instruction Tuning Vicuna PT+FT mixture Osprey 2312.10032 ZJU
Tokenize Anything via Prompting SAM PT mixture (mainly SA-1B) tokenize-anything 2312.09128 BAAI
Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens LLaVA INST video-chatgpt Vista-LLaMA (web) 2312.08870 ByteDance
Gemini: A Family of Highly Capable Multimodal Models Transformer-Decoder FT (language decoder + image decoder) ? ? - 2312.blog Google
VILA: On Pre-training for Visual Language Models Llama PT + FT self-construct + llava-1.5 VILA 2312.07533 NVIDIA
Honeybee: Locality-enhanced Projector for Multimodal LLM LLaMA/Vicuna PT+INST (projector) mixture Honeybee CVPR 2024 2312.06742 KakaoBrain
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models vocabulary network + LLM PT mixture(doc,chart + opendomain) Vary 2312.06109 MEGVII
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models LLaMA (LLaVA) FT(FT grounding model + INSTFT) RefCOCO + Flickr30K + LLaVA LLaVA-Grounding 2312.02949 MSR
Making Large Multimodal Models Understand Arbitrary Visual Prompts LLaMA (LLaVA) PT+INSTFT BLIP + LLaVA-1.5 ViP-LLaVA 2312.00784 Wisconsin-Madison
Sequential Modeling Enables Scalable Learning for Large Vision Models LLaMA PT (Visual Tokenizer) mixture (430B visual tokens, 50 dataset, mainly from LAION) LVM 2312.00785 UCB
Compositional Chain-of-Thought Prompting for Large Multimodal Models LLaVA CoT (scene graph) 2311.17076 UCB
GLaMM: Pixel Grounding Large Multimodal Model Vicuna-1.5 FT self-construct (grounding-anything-dataset) GLaMM 2311.03356 MBZU
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models GPT-3.5 + LLMs FT (hallucination) mixture LURE ICLR 2024 2310.00754 UNC
CogVLM: Visual Expert For Large Language Models Vicuna PT + FT self-construct + mixture CogVLM 2309.github Zhipu AI
GPT-4V(ision) System Card GPT4 - 2309.blog OpenAI
Demystifying CLIP Data CLIP PT curated & transparent CLIP dataset MetaCLIP 2309.16671 Meta
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition InternLM PT + FT mixture InternLM-XComposer 2309.15112 Shanghai AI Lab.
DreamLLM: Synergistic Multimodal Comprehension and Creation LLaMA PT + FT mixture DreamLLM 2309.11499 MEGVII
LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization LLaMA PT + FT (Visual Tokenizer) mixture LaVIT 2309.04669 Kuaishou
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond QWen PT + FT Qwen-VL 2308.12966 Alibaba
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions Vicuna-7B/Flan-t5-xxl FT same as InstructBLIP BLIVA 2308.09936 UCSD
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World Husky-7b PT AS-1B all-seeing 2308.01907 Shanghai AI Lab
LISA: Reasoning Segmentation via Large Language Model LLaMA PT mixture LISA 2308.00692 SmartMore
Generative Pretraining in Multimodality, v2 LLaMA,Diffusion PT, Visual Decoder mixture Emu 2307.05222 BAAI
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? Vicuna-7b PT + FT mixture (emperical) lynx 2307.02469 Bytedance
Visual Instruction Tuning with Polite Flamingo Flamingo FT + (rewrite instruction) PF-1M, LLaVA-instruciton-177k Polite Flamingo 2307.01003 Xiaobing
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Vicuna-13B FT + MM-INST self-construct (text-rich image) LLaVAR 2306.17107 Gatech
Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic Vicuna-7B/13B FT + MM-INST self-constuct (referential dialogue) Shikra 2306.15195 SenseTime
KOSMOS-2: Grounding Multimodal Large Language Models to the World Magneto PT + obj Grit (90M images) Kosmos-2 2306.14824 Microsoft
Aligning Large Multi-Modal Model with Robust Instruction Tuning Vicuna (MiniGPT4-like) FT + MM-INST LRV-Instruction (150K INST, robust), GAVIE (evaluate) LRV-Instruction 2306.14565 UMD
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark Vicuna-7B/13B FT + MM-INST LAMM-Dataset (186K INST), LAMM-Benchmark LAMM 2306.06687 Shanghai AI Lab
Improving CLIP Training with Language Rewrites CLIP + ChatGPT FT + Data-aug mixture LaCLIP 2305.20088 Google
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst Vicuna-13B FT + MM-INST ChatBridge 2305.16103 CAS
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models LLaMA-7B/13B FT adapter + MM-INST self-construc (INST) LaVIN 2305.15023 Xiamen Univ.
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models ChatGPT iterative, compositional (que, ans, rea) ZS IdeaGPT 2305.14985 Columbia
DetGPT: Detect What You Need via Reasoning Robin, Vicuna FT + MM-INST + detector self-construct DetGPT 2305.14167 HKUST
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks Alpaca VisionLLM 2305.11175 Shanghai AI Lab.
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning Vicuna InstructBLIP 2305.06500 Salesforce
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans Flamingo FT + MM-INST, LoRA mixture Multimodal-GPT 2305.04790 NUS
Otter: A Multi-Modal Model with In-Context Instruction Tuning Flamingo Otter 2305.03726 NTU
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages ChatGPT X-LLM 2305.04160 CAS
LMEye: An Interactive Perception Network for Large Language Models OPT,Bloomz,BLIP2 PT, FT + MM-INST self-construct LingCloud 2305.03701 HIT
Caption anything: Interactive image description with diverse multimodal controls BLIP2, ChatGPT ZS Caption Anything 2305.02677 SUSTech
Multimodal Procedural Planning via Dual Text-Image Prompting OFA, BLIP, GPT3 TIP 2305.01795 UCSB
Transfer Visual Prompt Generator across LLMs FlanT5, OPT projecter + transfer strategy VPGTrans 2305.01278 CUHK
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model LLaMA LLaMA-Adapter 2304.15010 Shanghai AI Lab.
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality,mPLUG, mPLUG-2 LLaMA mPLUG-Owl 2304.14178 DAMO Academy
MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models Vicunna MiniGPT4 2304.10592 KAUST
Visual Instruction Tuning LLaMA full-param. + INST tuning LLaVA-Instruct-150K (150K INST by GPT4) LLaVA 2304.08485 Microsoft
Chain of Thought Prompt Tuning in Vision Language Models - Visual CoT - 2304.07919 PKU
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action ChatGPT MM-REACT 2303.11381 Microsoft
ViperGPT: Visual Inference via Python Execution for Reasoning Codex ViperGPT ICCV 2023 2303.08128 Columbia
Scaling Vision-Language Models with Sparse Mixture of Experts (MOE + Scaling) 2303.07226 Microsoft
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions ChatGPT, Flan-T5 (BLIP2) ChatCaptioner 2303.06594 KAUST
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models ChatGPT Visual ChatGPT 2303.04671 Microsoft
PaLM-E: An Embodied Multimodal Language Model PaLM 2303.03378 Google
Prismer: A Vision-Language Model with An Ensemble of Experts RoBERTa, OPT, BLOOM Prismer 2303.02506 Nvidia
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners GPT3, CLIP, DINO, DALLE FS, evaluate: img-cls CaFo CVPR 2023 2303.02151 CAS
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering GPT3 RTr+candidates, ICL evaluate: OKVQA, A-OKVQA Prophet CVPR 2023 2303.01903 HDU
Language Is Not All You Need: Aligning Perception with Language Models Magneto KOSMOS-1 2302.14045 Microsoft
Scaling Vision Transformers to 22 Billion Parameters (CLIP + Scaling) 2302.05442 Google
Multimodal Chain-of-Thought Reasoning in Language Models T5 FT + MM-CoT MM-COT 2302.00923 Amazon
Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Caption RETRO 2302.04858 Nvidia
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Flan-T5 / qformer BLIP2 ICML 2023 2301.12597 Salesforce
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning OPT 2301.05226 MIT-IBM
Generalized Decoding for Pixel, Image, and Language GPT3 X-GPT 2212.11270 Microsoft
From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models OPT Img2LLM CVPR 2023 2212.10846 Salesforce
Visual Programming: Compositional visual reasoning without training GPT3 Compositional/Tool-Learning VisProg CVPR 2023
best paper
2211.11559 AI2
Language Models are General-Purpose Interfaces DeepNorm Semi-Causal METALM 2206.06336 Microsoft
Language Models Can See: Plugging Visual Controls in Text Generation GPT2 MAGIC 2205.02655 Tencent
Flamingo: a Visual Language Model for Few-Shot Learning Chinchilla / adapter Flamingo Neurips 2022 2204.14198 DeepMind
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language GPT3, RoBERTa Socratic Models ICLR 2023 2204.00598 Google
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA GPT3 LLM as KB, ICL evaluate: OKVQA PICa AAAI 2022 2109.05014 Microsoft
Multimodal few-shot learning with frozen language models Transforemr-LM-7b (PT on C4) ICL ConceptualCaptions Frozen (unofficial) Neurips 2021 2106.13884 Deepmind
Perceiver: General Perception with Iterative Attention Perceiver latent array ICML 2021 2103.03206 DeepMind
Learning Transferable Visual Models From Natural Language Supervision Bert / contrastive learning CLIP ICML 2021 2103.00020 OpenAI

Datasets & Benchmarks

Datasets

Dataset Source Format Paper Preprint Publication Affiliation
combrian mixture INST instruciton (10M) Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs 2406.16860 Meta
MINT-1T mixture html/pdf/arxiv corpora (3.4B imgs) MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens 2406.11271 Salesforce
OmniCorpus mixture html corpora (10B imgs) OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text 2406.08418 Shanghai AI Lab
MGM mixture INST instruciton (1.2M) Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models 2403.18814 CUHK
ALLaVA-4V LAION/Vision-FLAN (GPT4V) instruciton (505k/203k) ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model 2401.06209 CUHKSZ
M3IT mixture INST self-construct INSTs (2.4M) M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning 2306.04387 HKU
OBELICS I-T pairs corpora (353M imgs) OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents 2306.16527 Huggingface
LLaVA mixture INST instruciton ( 675k) LLaVA: Large Language and Vision Assistant 2304.08485 Microsoft
LAION I-T pairs corpora (2.32b) LAION-5B: An open large-scale dataset for training next generation image-text models 2210.08402 UCB

Benchmarks

Benchmark Task Data Paper Preprint Publication Affiliation
MMVP QA (pattern error) human-annotated (300) Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs 2401.06209 NYU
MMMU QA (general domain) human collected (11.5K) MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI 2311.16502 OSU, UWaterloo
MLLM-Bench General INST human collected MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V 2311.13951 CUHK
HallusionBench QA (hallucination) human annotated (1129) HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models 2310.14566 UMD
MathVista QA (math: IQTest, FuctionQA, PaperQA) self-construct + mixture QA pairs (6K) MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts 2310.02255 Microsoft
VisIT-Bench QA (general domain) self-construct (592) VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use 2308.06595 LAION
SEED-Bench QA (general domain) self-construct (19K) SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension 2307.16125 Tencent
MMBench QA (general domain) mixture (2.9K) MMBench: Is Your Multi-modal Model an All-around Player? 2307.06281 Shanghai AI Lab.
MME QA (general domain) self-construct (2.1K) MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models 2306.13394 XMU
POPE General (object hallucination) POPE: Polling-based Object Probing Evaluation for Object Hallucination 2305.10355 RUC
DataComp Curate I-T Pairs 12.8M I-T pairs DataComp: In search of the next generation of multimodal datasets 2304.14108 DataComp.AI
MM-Vet General mm-vet.zip MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
INFOSEEK VQA OVEN (open domain image) + Human Anno. Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? 2302.11713 Google
MultiInstruct General INST (Grounded Caption, Text Localization,</br> Referring Expression Selection, Question-Image Matching) self-construct INSTs (62 * (5+5)) MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning 2212.10773 ACL 2023 Virginia Tech
ScienceQA QA (elementary and high school science curricula) self-construct QA-pairs (21K) Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering 2209.09513 NeurIPS 2022 AI2

Evaluation Toolkits

  • VLMEvalKit, VLMEvalKit (the python package name is vlmeval) is an open-source evaluation toolkit of large vision-language models (LVLMs).

Data Collection Tools

  • VisionDatasets, Scripts and logic to create high quality pre-training and finetuning datasets for multi-modal models!
  • Visual-Instruction-Tuning, Scale up visual instruction tuning to millions by GPT-4.

Tools

  • EasyOCR, Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Image Generation

Reading List

Include some insightful works except for LLM

Paper Base Language Model Code Publication Preprint Affiliation
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations JPEG-LM (codec-based LM) 2408.08459 Meta
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction GPT-2,VQ-GAN VAR 2404.02905 Bytedance
InstantID: Zero-shot Identity-Preserving Generation in Seconds Unet InstantID 2401.07519 Instant
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation LLaMA, IP-Adapter (Diffusion) VL-GPT Tencent
LLMGA: Multimodal Large Language Model based Generation Assistant LLaVA,Unet LLMGA 2311.16500 CUHK
AnyText: Multilingual Visual Text Generation And Editing ControlNet (OCR) AnyText 2311.03054 Alibaba
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens Unet MiniGPT-5 2310.02239 UCSC
NExT-GPT: Any-to-Any Multimodal LLM Vicuna-7B,Diffusion NExT-GPT 2309.05519 NUS
Generative Pretraining in Multimodality LLaMA,Diffusion Emu 2307.05222 BAAI
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs PaLM2, GPT3.5 2306.17842 Google
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models ChatGPT LayoutGPT 2305.15393 UCSB
BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing Blip2,u-net BLIP-Diffusion 2305.14720 Salesforce
CoDi: Any-to-Any Generation via Composable Diffusion Diffusion CoDi 2305.11846 Microsoft & UNC
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation ChatGPT,VAE Accountable Textual Visual Chat 2303.05983 CUHK
Denoising Diffusion Probabilistic Models Diffusion diffusion 2006.11239 UCB

Open-source Projects

  • open-prompts, open-source prompts for text-to-image models.
  • LLaMA2-Accessory, LLaMA2-Accessory is an open-source toolkit for pretraining, finetuning and deployment of Large Language Models (LLMs) and multimodal LLMs
  • Gemini-vs-GPT4V, This paper presents an in-depth qualitative comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision).
  • multimodal-maestro, Effective prompting for Large Multimodal Models like GPT-4 Vision or LLaVA.
    • by roboflow, 2023.11
  • VisCPM, VisCPM is a family of open-source large multimodal models, which support multimodal conversational capabilities (VisCPM-Chat model) and text-to-image generation capabilities (VisCPM-Paint model) in both Chinese and English