Table of Contents
NOTEs: INST=Instruction, FT=Finetune, PT=Pretraining, ICL=In Context Learning, ZS=ZeroShot, FS=FewShot, RTr=Retrieval
Paper | Base Language Model | Framework | Data | Code | Publication | Preprint | Affiliation |
---|---|---|---|---|---|---|---|
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models | Phi3 | PT + FT (interleaved resampler) | self-curation | 2408.08872 | Salesforce | ||
LLaVA-OneVision: Easy Visual Task Transfer | Qwen-2 | PT + FT (knowledge+v-inst) | self-curation (one-vision) | LLaVA-OneVision | 2408.03326 | ByteDance | |
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs | vicuna | PT + FT | self-construct (cambrian) | cambrian | 2406.16860 | Meta | |
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models | Vicuna/Mixtral/Yi | PT+FT | self-construct (mimi-gemini) | MGM | 2403.18814 | CUHK | |
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training | ? | PT + FT | self-construct + mixture | - | 2403.09611 | Apple | |
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models | Llama/Qwen | Prune | - | FastV | 2403.06764 | Alibaba | |
DeepSeek-VL: Towards Real-World Vision-Language Understanding | DeepSeekLLM | PT+FT | mixture | DeepSeek-VL | 2403.05525 | Deepseek | |
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models | LLaVA | FT | mixture-HR | LLaVA-HR | 2403.03003 | XMU | |
Efficient Multimodal Learning from Data-centric Perspective | Phi, StableLM | PT + FT | LAION-2B | Bunny | 2402.11530 | BAAI | |
Efficient Visual Representation Learning with Bidirectional State Space Model | SSM | efficient | Vim | 2401.09417 | HUST | ||
AIM: Autoregressive Image Models | ViT | Scale | ml-aim | 2401.08541 | Apple | ||
LEGO:Language Enhanced Multi-modal Grounding Model | Vicuna | PT + SFT | mixture + self-construct | LEGO | 2401.06071 | ByteDance | |
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training | gated cross-attn + latent array | PT + FT | mixture | cosmo | 2401.00849 | NUS | |
Tracking with Human-Intent Reasoning | LLaMA (LLaVA) | PT+FT | mixture | TrackGPT | 2312.17448 | Alibaba | |
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | Vicuna | PT + FT | mixture | InternVL | CVPR 2024 | 2312.14238 | Shanghai AI Lab |
VCoder: Versatile Vision Encoders for Multimodal Large Language Models | LLaMA (LLaVA-1.5) | FT (depth encoder + segment encoder) | COCO Segmentation Text (COST) | VCoder | CVPR 2024 | 2312.14233 | Gatech |
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs | Vicuna-7B | FT + obj. refine (search) | mixture+self-construct(object) | vstar | 2312.14135 | NYU | |
Osprey: Pixel Understanding with Visual Instruction Tuning | Vicuna | PT+FT | mixture | Osprey | 2312.10032 | ZJU | |
Tokenize Anything via Prompting | SAM | PT | mixture (mainly SA-1B) | tokenize-anything | 2312.09128 | BAAI | |
Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens | LLaVA | INST | video-chatgpt | Vista-LLaMA (web) | 2312.08870 | ByteDance | |
Gemini: A Family of Highly Capable Multimodal Models | Transformer-Decoder | FT (language decoder + image decoder) | ? | ? | - | 2312.blog | |
VILA: On Pre-training for Visual Language Models | Llama | PT + FT | self-construct + llava-1.5 | VILA | 2312.07533 | NVIDIA | |
Honeybee: Locality-enhanced Projector for Multimodal LLM | LLaMA/Vicuna | PT+INST (projector) | mixture | Honeybee | CVPR 2024 | 2312.06742 | KakaoBrain |
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models | vocabulary network + LLM | PT | mixture(doc,chart + opendomain) | Vary | 2312.06109 | MEGVII | |
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models | LLaMA (LLaVA) | FT(FT grounding model + INSTFT) | RefCOCO + Flickr30K + LLaVA | LLaVA-Grounding | 2312.02949 | MSR | |
Making Large Multimodal Models Understand Arbitrary Visual Prompts | LLaMA (LLaVA) | PT+INSTFT | BLIP + LLaVA-1.5 | ViP-LLaVA | 2312.00784 | Wisconsin-Madison | |
Sequential Modeling Enables Scalable Learning for Large Vision Models | LLaMA | PT (Visual Tokenizer) | mixture (430B visual tokens, 50 dataset, mainly from LAION) | LVM | 2312.00785 | UCB | |
Compositional Chain-of-Thought Prompting for Large Multimodal Models | LLaVA | CoT (scene graph) | 2311.17076 | UCB | |||
GLaMM: Pixel Grounding Large Multimodal Model | Vicuna-1.5 | FT | self-construct (grounding-anything-dataset) | GLaMM | 2311.03356 | MBZU | |
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models | GPT-3.5 + LLMs | FT (hallucination) | mixture | LURE | ICLR 2024 | 2310.00754 | UNC |
CogVLM: Visual Expert For Large Language Models | Vicuna | PT + FT | self-construct + mixture | CogVLM | 2309.github | Zhipu AI | |
GPT-4V(ision) System Card | GPT4 | ? | ? | ? | - | 2309.blog | OpenAI |
Demystifying CLIP Data | CLIP | PT | curated & transparent CLIP dataset | MetaCLIP | 2309.16671 | Meta | |
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition | InternLM | PT + FT | mixture | InternLM-XComposer | 2309.15112 | Shanghai AI Lab. | |
DreamLLM: Synergistic Multimodal Comprehension and Creation | LLaMA | PT + FT | mixture | DreamLLM | 2309.11499 | MEGVII | |
LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization | LLaMA | PT + FT (Visual Tokenizer) | mixture | LaVIT | 2309.04669 | Kuaishou | |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | QWen | PT + FT | Qwen-VL | 2308.12966 | Alibaba | ||
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions | Vicuna-7B/Flan-t5-xxl | FT | same as InstructBLIP | BLIVA | 2308.09936 | UCSD | |
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World | Husky-7b | PT | AS-1B | all-seeing | 2308.01907 | Shanghai AI Lab | |
LISA: Reasoning Segmentation via Large Language Model | LLaMA | PT | mixture | LISA | 2308.00692 | SmartMore | |
Generative Pretraining in Multimodality, v2 | LLaMA,Diffusion | PT, Visual Decoder | mixture | Emu | 2307.05222 | BAAI | |
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? | Vicuna-7b | PT + FT | mixture (emperical) | lynx | 2307.02469 | Bytedance | |
Visual Instruction Tuning with Polite Flamingo | Flamingo | FT + (rewrite instruction) | PF-1M, LLaVA-instruciton-177k | Polite Flamingo | 2307.01003 | Xiaobing | |
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding | Vicuna-13B | FT + MM-INST | self-construct (text-rich image) | LLaVAR | 2306.17107 | Gatech | |
Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic | Vicuna-7B/13B | FT + MM-INST | self-constuct (referential dialogue) | Shikra | 2306.15195 | SenseTime | |
KOSMOS-2: Grounding Multimodal Large Language Models to the World | Magneto | PT + obj | Grit (90M images) | Kosmos-2 | 2306.14824 | Microsoft | |
Aligning Large Multi-Modal Model with Robust Instruction Tuning | Vicuna (MiniGPT4-like) | FT + MM-INST | LRV-Instruction (150K INST, robust), GAVIE (evaluate) | LRV-Instruction | 2306.14565 | UMD | |
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | Vicuna-7B/13B | FT + MM-INST | LAMM-Dataset (186K INST), LAMM-Benchmark | LAMM | 2306.06687 | Shanghai AI Lab | |
Improving CLIP Training with Language Rewrites | CLIP + ChatGPT | FT + Data-aug | mixture | LaCLIP | 2305.20088 | ||
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst | Vicuna-13B | FT + MM-INST | ChatBridge | 2305.16103 | CAS | ||
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models | LLaMA-7B/13B | FT adapter + MM-INST | self-construc (INST) | LaVIN | 2305.15023 | Xiamen Univ. | |
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models | ChatGPT | iterative, compositional (que, ans, rea) | ZS | IdeaGPT | 2305.14985 | Columbia | |
DetGPT: Detect What You Need via Reasoning | Robin, Vicuna | FT + MM-INST + detector | self-construct | DetGPT | 2305.14167 | HKUST | |
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks | Alpaca | VisionLLM | 2305.11175 | Shanghai AI Lab. | |||
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning | Vicuna | InstructBLIP | 2305.06500 | Salesforce | |||
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans | Flamingo | FT + MM-INST, LoRA | mixture | Multimodal-GPT | 2305.04790 | NUS | |
Otter: A Multi-Modal Model with In-Context Instruction Tuning | Flamingo | Otter | 2305.03726 | NTU | |||
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | ChatGPT | X-LLM | 2305.04160 | CAS | |||
LMEye: An Interactive Perception Network for Large Language Models | OPT,Bloomz,BLIP2 | PT, FT + MM-INST | self-construct | LingCloud | 2305.03701 | HIT | |
Caption anything: Interactive image description with diverse multimodal controls | BLIP2, ChatGPT | ZS | Caption Anything | 2305.02677 | SUSTech | ||
Multimodal Procedural Planning via Dual Text-Image Prompting | OFA, BLIP, GPT3 | TIP | 2305.01795 | UCSB | |||
Transfer Visual Prompt Generator across LLMs | FlanT5, OPT | projecter + transfer strategy | VPGTrans | 2305.01278 | CUHK | ||
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model | LLaMA | LLaMA-Adapter | 2304.15010 | Shanghai AI Lab. | |||
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality,mPLUG, mPLUG-2 | LLaMA | mPLUG-Owl | 2304.14178 | DAMO Academy | |||
MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models | Vicunna | MiniGPT4 | 2304.10592 | KAUST | |||
Visual Instruction Tuning | LLaMA | full-param. + INST tuning | LLaVA-Instruct-150K (150K INST by GPT4) | LLaVA | 2304.08485 | Microsoft | |
Chain of Thought Prompt Tuning in Vision Language Models | - | Visual CoT | - | 2304.07919 | PKU | ||
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | ChatGPT | MM-REACT | 2303.11381 | Microsoft | |||
ViperGPT: Visual Inference via Python Execution for Reasoning | Codex | ViperGPT | ICCV 2023 | 2303.08128 | Columbia | ||
Scaling Vision-Language Models with Sparse Mixture of Experts | (MOE + Scaling) | 2303.07226 | Microsoft | ||||
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions | ChatGPT, Flan-T5 (BLIP2) | ChatCaptioner | 2303.06594 | KAUST | |||
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models | ChatGPT | Visual ChatGPT | 2303.04671 | Microsoft | |||
PaLM-E: An Embodied Multimodal Language Model | PaLM | 2303.03378 | |||||
Prismer: A Vision-Language Model with An Ensemble of Experts | RoBERTa, OPT, BLOOM | Prismer | 2303.02506 | Nvidia | |||
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners | GPT3, CLIP, DINO, DALLE | FS, evaluate: img-cls | CaFo | CVPR 2023 | 2303.02151 | CAS | |
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering | GPT3 | RTr+candidates, ICL | evaluate: OKVQA, A-OKVQA | Prophet | CVPR 2023 | 2303.01903 | HDU |
Language Is Not All You Need: Aligning Perception with Language Models | Magneto | KOSMOS-1 | 2302.14045 | Microsoft | |||
Scaling Vision Transformers to 22 Billion Parameters | (CLIP + Scaling) | 2302.05442 | |||||
Multimodal Chain-of-Thought Reasoning in Language Models | T5 | FT + MM-CoT | MM-COT | 2302.00923 | Amazon | ||
Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Caption | RETRO | 2302.04858 | Nvidia | ||||
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | Flan-T5 / qformer | BLIP2 | ICML 2023 | 2301.12597 | Salesforce | ||
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning | OPT | 2301.05226 | MIT-IBM | ||||
Generalized Decoding for Pixel, Image, and Language | GPT3 | X-GPT | 2212.11270 | Microsoft | |||
From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models | OPT | Img2LLM | CVPR 2023 | 2212.10846 | Salesforce | ||
Visual Programming: Compositional visual reasoning without training | GPT3 | Compositional/Tool-Learning | VisProg | CVPR 2023 best paper |
2211.11559 | AI2 | |
Language Models are General-Purpose Interfaces | DeepNorm | Semi-Causal | METALM | 2206.06336 | Microsoft | ||
Language Models Can See: Plugging Visual Controls in Text Generation | GPT2 | MAGIC | 2205.02655 | Tencent | |||
Flamingo: a Visual Language Model for Few-Shot Learning | Chinchilla / adapter | Flamingo | Neurips 2022 | 2204.14198 | DeepMind | ||
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language | GPT3, RoBERTa | Socratic Models | ICLR 2023 | 2204.00598 | |||
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA | GPT3 | LLM as KB, ICL | evaluate: OKVQA | PICa | AAAI 2022 | 2109.05014 | Microsoft |
Multimodal few-shot learning with frozen language models | Transforemr-LM-7b (PT on C4) | ICL | ConceptualCaptions | Frozen (unofficial) | Neurips 2021 | 2106.13884 | Deepmind |
Perceiver: General Perception with Iterative Attention | Perceiver | latent array | ICML 2021 | 2103.03206 | DeepMind | ||
Learning Transferable Visual Models From Natural Language Supervision | Bert / contrastive learning | CLIP | ICML 2021 | 2103.00020 | OpenAI |
Datasets
Dataset | Source | Format | Paper | Preprint | Publication | Affiliation |
---|---|---|---|---|---|---|
combrian | mixture INST | instruciton (10M) | Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs | 2406.16860 | Meta | |
MINT-1T | mixture html/pdf/arxiv | corpora (3.4B imgs) | MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens | 2406.11271 | Salesforce | |
OmniCorpus | mixture html | corpora (10B imgs) | OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text | 2406.08418 | Shanghai AI Lab | |
MGM | mixture INST | instruciton (1.2M) | Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models | 2403.18814 | CUHK | |
ALLaVA-4V | LAION/Vision-FLAN (GPT4V) | instruciton (505k/203k) | ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model | 2401.06209 | CUHKSZ | |
M3IT | mixture INST | self-construct INSTs (2.4M) | M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning | 2306.04387 | HKU | |
OBELICS | I-T pairs | corpora (353M imgs) | OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents | 2306.16527 | Huggingface | |
LLaVA | mixture INST | instruciton ( 675k) | LLaVA: Large Language and Vision Assistant | 2304.08485 | Microsoft | |
LAION | I-T pairs | corpora (2.32b) | LAION-5B: An open large-scale dataset for training next generation image-text models | 2210.08402 | UCB |
Benchmarks
Benchmark | Task | Data | Paper | Preprint | Publication | Affiliation |
---|---|---|---|---|---|---|
MMVP | QA (pattern error) | human-annotated (300) | Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs | 2401.06209 | NYU | |
MMMU | QA (general domain) | human collected (11.5K) | MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI | 2311.16502 | OSU, UWaterloo | |
MLLM-Bench | General INST | human collected | MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V | 2311.13951 | CUHK | |
HallusionBench | QA (hallucination) | human annotated (1129) | HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models | 2310.14566 | UMD | |
MathVista | QA (math: IQTest, FuctionQA, PaperQA) | self-construct + mixture QA pairs (6K) | MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts | 2310.02255 | Microsoft | |
VisIT-Bench | QA (general domain) | self-construct (592) | VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use | 2308.06595 | LAION | |
SEED-Bench | QA (general domain) | self-construct (19K) | SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension | 2307.16125 | Tencent | |
MMBench | QA (general domain) | mixture (2.9K) | MMBench: Is Your Multi-modal Model an All-around Player? | 2307.06281 | Shanghai AI Lab. | |
MME | QA (general domain) | self-construct (2.1K) | MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models | 2306.13394 | XMU | |
POPE | General (object hallucination) | POPE: Polling-based Object Probing Evaluation for Object Hallucination | 2305.10355 | RUC | ||
DataComp | Curate I-T Pairs | 12.8M I-T pairs | DataComp: In search of the next generation of multimodal datasets | 2304.14108 | DataComp.AI | |
MM-Vet | General | mm-vet.zip | MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities | |||
INFOSEEK | VQA | OVEN (open domain image) + Human Anno. | Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? | 2302.11713 | ||
MultiInstruct | General INST (Grounded Caption, Text Localization,</br> Referring Expression Selection, Question-Image Matching) |
self-construct INSTs (62 * (5+5)) | MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning | 2212.10773 | ACL 2023 | Virginia Tech |
ScienceQA | QA (elementary and high school science curricula) | self-construct QA-pairs (21K) | Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | 2209.09513 | NeurIPS 2022 | AI2 |
Evaluation Toolkits
- VLMEvalKit, VLMEvalKit (the python package name is vlmeval) is an open-source evaluation toolkit of large vision-language models (LVLMs).
Data Collection Tools
- VisionDatasets, Scripts and logic to create high quality pre-training and finetuning datasets for multi-modal models!
- Visual-Instruction-Tuning, Scale up visual instruction tuning to millions by GPT-4.
- EasyOCR, Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
Include some insightful works except for LLM
Paper | Base Language Model | Code | Publication | Preprint | Affiliation |
---|---|---|---|---|---|
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations | JPEG-LM (codec-based LM) | 2408.08459 | Meta | ||
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction | GPT-2,VQ-GAN | VAR | 2404.02905 | Bytedance | |
InstantID: Zero-shot Identity-Preserving Generation in Seconds | Unet | InstantID | 2401.07519 | Instant | |
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation | LLaMA, IP-Adapter (Diffusion) | VL-GPT | Tencent | ||
LLMGA: Multimodal Large Language Model based Generation Assistant | LLaVA,Unet | LLMGA | 2311.16500 | CUHK | |
AnyText: Multilingual Visual Text Generation And Editing | ControlNet (OCR) | AnyText | 2311.03054 | Alibaba | |
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens | Unet | MiniGPT-5 | 2310.02239 | UCSC | |
NExT-GPT: Any-to-Any Multimodal LLM | Vicuna-7B,Diffusion | NExT-GPT | 2309.05519 | NUS | |
Generative Pretraining in Multimodality | LLaMA,Diffusion | Emu | 2307.05222 | BAAI | |
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs | PaLM2, GPT3.5 | 2306.17842 | |||
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models | ChatGPT | LayoutGPT | 2305.15393 | UCSB | |
BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing | Blip2,u-net | BLIP-Diffusion | 2305.14720 | Salesforce | |
CoDi: Any-to-Any Generation via Composable Diffusion | Diffusion | CoDi | 2305.11846 | Microsoft & UNC | |
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | ChatGPT,VAE | Accountable Textual Visual Chat | 2303.05983 | CUHK | |
Denoising Diffusion Probabilistic Models | Diffusion | diffusion | 2006.11239 | UCB |
- open-prompts, open-source prompts for text-to-image models.
- LLaMA2-Accessory, LLaMA2-Accessory is an open-source toolkit for pretraining, finetuning and deployment of Large Language Models (LLMs) and multimodal LLMs
- Gemini-vs-GPT4V, This paper presents an in-depth qualitative comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision).
- multimodal-maestro, Effective prompting for Large Multimodal Models like GPT-4 Vision or LLaVA.
- by roboflow, 2023.11
- VisCPM, VisCPM is a family of open-source large multimodal models, which support multimodal conversational capabilities (VisCPM-Chat model) and text-to-image generation capabilities (VisCPM-Paint model) in both Chinese and English
- by THU, 2023.07
- model: VisCPM-Chat, VisCPM-Paint