Image

Table of Contents

Image Understanding
- Reading List
- Datasets & Benchmarks
Image Generation
- Reading List
Open Source Projects

Image Understanding

Reading List

NOTEs: INST=Instruction, FT=Finetune, PT=Pretraining, ICL=In Context Learning, ZS=ZeroShot, FS=FewShot, RTr=Retrieval

Paper	Base Language Model	Framework	Data	Code	Publication	Preprint	Affiliation
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models	Phi3	PT + FT (interleaved resampler)	self-curation			2408.08872	Salesforce
LLaVA-OneVision: Easy Visual Task Transfer	Qwen-2	PT + FT (knowledge+v-inst)	self-curation (one-vision)	LLaVA-OneVision		2408.03326	ByteDance
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs	vicuna	PT + FT	self-construct (cambrian)	cambrian		2406.16860	Meta
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	Vicuna/Mixtral/Yi	PT+FT	self-construct (mimi-gemini)	MGM		2403.18814	CUHK
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training	?	PT + FT	self-construct + mixture	-		2403.09611	Apple
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models	Llama/Qwen	Prune	-	FastV		2403.06764	Alibaba
DeepSeek-VL: Towards Real-World Vision-Language Understanding	DeepSeekLLM	PT+FT	mixture	DeepSeek-VL		2403.05525	Deepseek
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models	LLaVA	FT	mixture-HR	LLaVA-HR		2403.03003	XMU
Efficient Multimodal Learning from Data-centric Perspective	Phi, StableLM	PT + FT	LAION-2B	Bunny		2402.11530	BAAI
Efficient Visual Representation Learning with Bidirectional State Space Model	SSM	efficient		Vim		2401.09417	HUST
AIM: Autoregressive Image Models	ViT	Scale		ml-aim		2401.08541	Apple
LEGO:Language Enhanced Multi-modal Grounding Model	Vicuna	PT + SFT	mixture + self-construct	LEGO		2401.06071	ByteDance
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training	gated cross-attn + latent array	PT + FT	mixture	cosmo		2401.00849	NUS
Tracking with Human-Intent Reasoning	LLaMA (LLaVA)	PT+FT	mixture	TrackGPT		2312.17448	Alibaba
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	Vicuna	PT + FT	mixture	InternVL	CVPR 2024	2312.14238	Shanghai AI Lab
VCoder: Versatile Vision Encoders for Multimodal Large Language Models	LLaMA (LLaVA-1.5)	FT (depth encoder + segment encoder)	COCO Segmentation Text (COST)	VCoder	CVPR 2024	2312.14233	Gatech
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs	Vicuna-7B	FT + obj. refine (search)	mixture+self-construct(object)	vstar		2312.14135	NYU
Osprey: Pixel Understanding with Visual Instruction Tuning	Vicuna	PT+FT	mixture	Osprey		2312.10032	ZJU
Tokenize Anything via Prompting	SAM	PT	mixture (mainly SA-1B)	tokenize-anything		2312.09128	BAAI
Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens	LLaVA	INST	video-chatgpt	Vista-LLaMA (web)		2312.08870	ByteDance
Gemini: A Family of Highly Capable Multimodal Models	Transformer-Decoder	FT (language decoder + image decoder)	?	?	-	2312.blog	Google
VILA: On Pre-training for Visual Language Models	Llama	PT + FT	self-construct + llava-1.5	VILA		2312.07533	NVIDIA
Honeybee: Locality-enhanced Projector for Multimodal LLM	LLaMA/Vicuna	PT+INST (projector)	mixture	Honeybee	CVPR 2024	2312.06742	KakaoBrain
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models	vocabulary network + LLM	PT	mixture(doc,chart + opendomain)	Vary		2312.06109	MEGVII
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models	LLaMA (LLaVA)	FT(FT grounding model + INSTFT)	RefCOCO + Flickr30K + LLaVA	LLaVA-Grounding		2312.02949	MSR
Making Large Multimodal Models Understand Arbitrary Visual Prompts	LLaMA (LLaVA)	PT+INSTFT	BLIP + LLaVA-1.5	ViP-LLaVA		2312.00784	Wisconsin-Madison
Sequential Modeling Enables Scalable Learning for Large Vision Models	LLaMA	PT (Visual Tokenizer)	mixture (430B visual tokens, 50 dataset, mainly from LAION)	LVM		2312.00785	UCB
Compositional Chain-of-Thought Prompting for Large Multimodal Models	LLaVA	CoT (scene graph)				2311.17076	UCB
GLaMM: Pixel Grounding Large Multimodal Model	Vicuna-1.5	FT	self-construct (grounding-anything-dataset)	GLaMM		2311.03356	MBZU
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models	GPT-3.5 + LLMs	FT (hallucination)	mixture	LURE	ICLR 2024	2310.00754	UNC
CogVLM: Visual Expert For Large Language Models	Vicuna	PT + FT	self-construct + mixture	CogVLM		2309.github	Zhipu AI
GPT-4V(ision) System Card	GPT4	？	？	？	-	2309.blog	OpenAI
Demystifying CLIP Data	CLIP	PT	curated & transparent CLIP dataset	MetaCLIP		2309.16671	Meta
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition	InternLM	PT + FT	mixture	InternLM-XComposer		2309.15112	Shanghai AI Lab.
DreamLLM: Synergistic Multimodal Comprehension and Creation	LLaMA	PT + FT	mixture	DreamLLM		2309.11499	MEGVII
LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization	LLaMA	PT + FT (Visual Tokenizer)	mixture	LaVIT		2309.04669	Kuaishou
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	QWen	PT + FT		Qwen-VL		2308.12966	Alibaba
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions	Vicuna-7B/Flan-t5-xxl	FT	same as InstructBLIP	BLIVA		2308.09936	UCSD
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World	Husky-7b	PT	AS-1B	all-seeing		2308.01907	Shanghai AI Lab
LISA: Reasoning Segmentation via Large Language Model	LLaMA	PT	mixture	LISA		2308.00692	SmartMore
Generative Pretraining in Multimodality, v2	LLaMA，Diffusion	PT, Visual Decoder	mixture	Emu		2307.05222	BAAI
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?	Vicuna-7b	PT + FT	mixture (emperical)	lynx		2307.02469	Bytedance
Visual Instruction Tuning with Polite Flamingo	Flamingo	FT + (rewrite instruction)	PF-1M, LLaVA-instruciton-177k	Polite Flamingo		2307.01003	Xiaobing
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding	Vicuna-13B	FT + MM-INST	self-construct (text-rich image)	LLaVAR		2306.17107	Gatech
Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic	Vicuna-7B/13B	FT + MM-INST	self-constuct (referential dialogue)	Shikra		2306.15195	SenseTime
KOSMOS-2: Grounding Multimodal Large Language Models to the World	Magneto	PT + obj	Grit (90M images)	Kosmos-2		2306.14824	Microsoft
Aligning Large Multi-Modal Model with Robust Instruction Tuning	Vicuna (MiniGPT4-like)	FT + MM-INST	LRV-Instruction (150K INST, robust), GAVIE (evaluate)	LRV-Instruction		2306.14565	UMD
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	Vicuna-7B/13B	FT + MM-INST	LAMM-Dataset (186K INST), LAMM-Benchmark	LAMM		2306.06687	Shanghai AI Lab
Improving CLIP Training with Language Rewrites	CLIP + ChatGPT	FT + Data-aug	mixture	LaCLIP		2305.20088	Google
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst	Vicuna-13B	FT + MM-INST		ChatBridge		2305.16103	CAS
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models	LLaMA-7B/13B	FT adapter + MM-INST	self-construc (INST)	LaVIN		2305.15023	Xiamen Univ.
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models	ChatGPT	iterative, compositional (que, ans, rea)	ZS	IdeaGPT		2305.14985	Columbia
DetGPT: Detect What You Need via Reasoning	Robin, Vicuna	FT + MM-INST + detector	self-construct	DetGPT		2305.14167	HKUST
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks	Alpaca			VisionLLM		2305.11175	Shanghai AI Lab.
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	Vicuna			InstructBLIP		2305.06500	Salesforce
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans	Flamingo	FT + MM-INST, LoRA	mixture	Multimodal-GPT		2305.04790	NUS
Otter: A Multi-Modal Model with In-Context Instruction Tuning	Flamingo			Otter		2305.03726	NTU
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	ChatGPT			X-LLM		2305.04160	CAS
LMEye: An Interactive Perception Network for Large Language Models	OPT,Bloomz,BLIP2	PT, FT + MM-INST	self-construct	LingCloud		2305.03701	HIT
Caption anything: Interactive image description with diverse multimodal controls	BLIP2, ChatGPT	ZS		Caption Anything		2305.02677	SUSTech
Multimodal Procedural Planning via Dual Text-Image Prompting	OFA, BLIP, GPT3			TIP		2305.01795	UCSB
Transfer Visual Prompt Generator across LLMs	FlanT5, OPT	projecter + transfer strategy		VPGTrans		2305.01278	CUHK
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model	LLaMA			LLaMA-Adapter		2304.15010	Shanghai AI Lab.
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality,mPLUG, mPLUG-2	LLaMA			mPLUG-Owl		2304.14178	DAMO Academy
MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models	Vicunna			MiniGPT4		2304.10592	KAUST
Visual Instruction Tuning	LLaMA	full-param. + INST tuning	LLaVA-Instruct-150K (150K INST by GPT4)	LLaVA		2304.08485	Microsoft
Chain of Thought Prompt Tuning in Vision Language Models	-	Visual CoT	-			2304.07919	PKU
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action	ChatGPT			MM-REACT		2303.11381	Microsoft
ViperGPT: Visual Inference via Python Execution for Reasoning	Codex			ViperGPT	ICCV 2023	2303.08128	Columbia
Scaling Vision-Language Models with Sparse Mixture of Experts	(MOE + Scaling)					2303.07226	Microsoft
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions	ChatGPT, Flan-T5 (BLIP2)			ChatCaptioner		2303.06594	KAUST
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models	ChatGPT			Visual ChatGPT		2303.04671	Microsoft
PaLM-E: An Embodied Multimodal Language Model	PaLM					2303.03378	Google
Prismer: A Vision-Language Model with An Ensemble of Experts	RoBERTa, OPT, BLOOM			Prismer		2303.02506	Nvidia
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners	GPT3, CLIP, DINO, DALLE		FS, evaluate: img-cls	CaFo	CVPR 2023	2303.02151	CAS
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering	GPT3	RTr+candidates, ICL	evaluate: OKVQA, A-OKVQA	Prophet	CVPR 2023	2303.01903	HDU
Language Is Not All You Need: Aligning Perception with Language Models	Magneto			KOSMOS-1		2302.14045	Microsoft
Scaling Vision Transformers to 22 Billion Parameters	(CLIP + Scaling)					2302.05442	Google
Multimodal Chain-of-Thought Reasoning in Language Models	T5	FT + MM-CoT		MM-COT		2302.00923	Amazon
Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Caption	RETRO					2302.04858	Nvidia
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	Flan-T5 / qformer			BLIP2	ICML 2023	2301.12597	Salesforce
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning	OPT					2301.05226	MIT-IBM
Generalized Decoding for Pixel, Image, and Language	GPT3			X-GPT		2212.11270	Microsoft
From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models	OPT			Img2LLM	CVPR 2023	2212.10846	Salesforce
Visual Programming: Compositional visual reasoning without training	GPT3	Compositional/Tool-Learning		VisProg	CVPR 2023 best paper	2211.11559	AI2
Language Models are General-Purpose Interfaces	DeepNorm	Semi-Causal		METALM		2206.06336	Microsoft
Language Models Can See: Plugging Visual Controls in Text Generation	GPT2			MAGIC		2205.02655	Tencent
Flamingo: a Visual Language Model for Few-Shot Learning	Chinchilla / adapter			Flamingo	Neurips 2022	2204.14198	DeepMind
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language	GPT3, RoBERTa			Socratic Models	ICLR 2023	2204.00598	Google
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA	GPT3	LLM as KB, ICL	evaluate: OKVQA	PICa	AAAI 2022	2109.05014	Microsoft
Multimodal few-shot learning with frozen language models	Transforemr-LM-7b (PT on C4)	ICL	ConceptualCaptions	Frozen (unofficial)	Neurips 2021	2106.13884	Deepmind
Perceiver: General Perception with Iterative Attention	Perceiver	latent array			ICML 2021	2103.03206	DeepMind
Learning Transferable Visual Models From Natural Language Supervision	Bert / contrastive learning			CLIP	ICML 2021	2103.00020	OpenAI

Datasets & Benchmarks

Datasets

Dataset	Source	Format	Paper	Preprint	Affiliation
combrian	mixture INST	instruciton (10M)	Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs	2406.16860	Meta
MINT-1T	mixture html/pdf/arxiv	corpora (3.4B imgs)	MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens	2406.11271	Salesforce
OmniCorpus	mixture html	corpora (10B imgs)	OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text	2406.08418	Shanghai AI Lab
MGM	mixture INST	instruciton (1.2M)	Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	2403.18814	CUHK
ALLaVA-4V	LAION/Vision-FLAN (GPT4V)	instruciton (505k/203k)	ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model	2401.06209	CUHKSZ
M3IT	mixture INST	self-construct INSTs (2.4M)	M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	2306.04387	HKU
OBELICS	I-T pairs	corpora (353M imgs)	OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents	2306.16527	Huggingface
LLaVA	mixture INST	instruciton ( 675k)	LLaVA: Large Language and Vision Assistant	2304.08485	Microsoft
LAION	I-T pairs	corpora (2.32b)	LAION-5B: An open large-scale dataset for training next generation image-text models	2210.08402	UCB

Benchmarks

Benchmark	Task	Data	Paper	Preprint	Publication	Affiliation
MMVP	QA (pattern error)	human-annotated (300)	Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs	2401.06209		NYU
MMMU	QA (general domain)	human collected (11.5K)	MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI	2311.16502		OSU, UWaterloo
MLLM-Bench	General INST	human collected	MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V	2311.13951		CUHK
HallusionBench	QA (hallucination)	human annotated (1129)	HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models	2310.14566		UMD
MathVista	QA (math: IQTest, FuctionQA, PaperQA)	self-construct + mixture QA pairs (6K)	MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts	2310.02255		Microsoft
VisIT-Bench	QA (general domain)	self-construct (592)	VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use	2308.06595		LAION
SEED-Bench	QA (general domain)	self-construct (19K)	SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	2307.16125		Tencent
MMBench	QA (general domain)	mixture (2.9K)	MMBench: Is Your Multi-modal Model an All-around Player?	2307.06281		Shanghai AI Lab.
MME	QA (general domain)	self-construct (2.1K)	MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models	2306.13394		XMU
POPE	General (object hallucination)		POPE: Polling-based Object Probing Evaluation for Object Hallucination	2305.10355		RUC
DataComp	Curate I-T Pairs	12.8M I-T pairs	DataComp: In search of the next generation of multimodal datasets	2304.14108		DataComp.AI
MM-Vet	General	mm-vet.zip	MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
INFOSEEK	VQA	OVEN (open domain image) + Human Anno.	Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?	2302.11713		Google
MultiInstruct	General INST (Grounded Caption, Text Localization,`</br>` Referring Expression Selection, Question-Image Matching)	self-construct INSTs (62 * (5+5))	MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning	2212.10773	ACL 2023	Virginia Tech
ScienceQA	QA (elementary and high school science curricula)	self-construct QA-pairs (21K)	Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering	2209.09513	NeurIPS 2022	AI2

Evaluation Toolkits

VLMEvalKit, VLMEvalKit (the python package name is vlmeval) is an open-source evaluation toolkit of large vision-language models (LVLMs).

Data Collection Tools

VisionDatasets, Scripts and logic to create high quality pre-training and finetuning datasets for multi-modal models!
Visual-Instruction-Tuning, Scale up visual instruction tuning to millions by GPT-4.

Tools

EasyOCR, Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Image Generation

Reading List

Include some insightful works except for LLM

Paper	Base Language Model	Code	Preprint	Affiliation
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations	JPEG-LM (codec-based LM)		2408.08459	Meta
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction	GPT-2,VQ-GAN	VAR	2404.02905	Bytedance
InstantID: Zero-shot Identity-Preserving Generation in Seconds	Unet	InstantID	2401.07519	Instant
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation	LLaMA, IP-Adapter (Diffusion)	VL-GPT		Tencent
LLMGA: Multimodal Large Language Model based Generation Assistant	LLaVA,Unet	LLMGA	2311.16500	CUHK
AnyText: Multilingual Visual Text Generation And Editing	ControlNet (OCR)	AnyText	2311.03054	Alibaba
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens	Unet	MiniGPT-5	2310.02239	UCSC
NExT-GPT: Any-to-Any Multimodal LLM	Vicuna-7B,Diffusion	NExT-GPT	2309.05519	NUS
Generative Pretraining in Multimodality	LLaMA，Diffusion	Emu	2307.05222	BAAI
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs	PaLM2, GPT3.5		2306.17842	Google
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models	ChatGPT	LayoutGPT	2305.15393	UCSB
BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing	Blip2,u-net	BLIP-Diffusion	2305.14720	Salesforce
CoDi: Any-to-Any Generation via Composable Diffusion	Diffusion	CoDi	2305.11846	Microsoft & UNC
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation	ChatGPT,VAE	Accountable Textual Visual Chat	2303.05983	CUHK
Denoising Diffusion Probabilistic Models	Diffusion	diffusion	2006.11239	UCB

Open-source Projects

open-prompts, open-source prompts for text-to-image models.
LLaMA2-Accessory, LLaMA2-Accessory is an open-source toolkit for pretraining, finetuning and deployment of Large Language Models (LLMs) and multimodal LLMs
Gemini-vs-GPT4V, This paper presents an in-depth qualitative comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision).
multimodal-maestro, Effective prompting for Large Multimodal Models like GPT-4 Vision or LLaVA.
- by roboflow, 2023.11
VisCPM, VisCPM is a family of open-source large multimodal models, which support multimodal conversational capabilities (VisCPM-Chat model) and text-to-image generation capabilities (VisCPM-Paint model) in both Chinese and English
- by THU, 2023.07
- model: VisCPM-Chat, VisCPM-Paint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image.md

Image.md

Image

Image Understanding

Reading List

Datasets & Benchmarks

Tools

Image Generation

Reading List

Open-source Projects

Files

Image.md

Latest commit

History

Image.md

File metadata and controls

Image

Image Understanding

Reading List

Datasets & Benchmarks

Tools

Image Generation

Reading List

Open-source Projects