- a significant amount of credit and thanks to Haotian Liu and the LLaVa project team for establishing key patterns for both open source Vision Language Models (VLMs) and multi-modal language models (MMLMs)
- only a sample focusing on models with less than 10 billion parameters
Model | # of Parameters | Architecture | Vision Encoder | Pooling | Modality Projection | Language Model Backbone | Fine-tuning Method | Flash Attention | Hugging Face Model Card URL |
---|---|---|---|---|---|---|---|---|---|
Idefics2-8b-base | 8B | Fully autoregressive | SigLIP-SO400M | Learned pooling | Modality projection layer | Mistral-7B | LoRA (Low-Rank Adaptation) | No | Idefics2-8b-base |
Idefics2-8b | 8B | Fully autoregressive | SigLIP-SO400M | Learned pooling | Modality projection layer | Mistral-7B | LoRA (Low-Rank Adaptation) | No | Idefics2-8b |
Idefics2-8b-chatty | 8B | Fully autoregressive | SigLIP-SO400M | Learned pooling | Modality projection layer | Mistral-7B | LoRA (Low-Rank Adaptation) | No | Idefics2-8b-chatty |
LLaVA-v1.6-mistral-7B | 7B | Auto-regressive | openai/clip-vit-large-patch14-336 | Not specified | Text-image interleaving | Mistral-7B | Multimodal instruction data | No | LLaVA-v1.6-mistral-7b |
LLaVA-v1.6-vicuna-7B | 7B | Auto-regressive | openai/clip-vit-large-patch14-336 | Not specified | Text-image interleaving | Vicuna-7B | Multimodal instruction data | No | LLaVA-v1.6-vicuna-7b |
Mantis-8B-clip-llama3 | 8B | Sequence-based | openai/clip-vit-large-patch14-336 | Not specified | Text-image interleaving | Meta-Llama-3-8B-Instruct | Instruction Tuning | Optional | Mantis-8B-clip-llama3 |
Mantis-8B-siglip-llama3 | 8B | Sequence-based | siglip_vision_model | Not specified | Text-image interleaving | Meta-Llama-3-8B-Instruct | Instruction Tuning | Optional | Mantis-8B-siglip-llama3 |
Phi-3-vision-128k-instruct | 4.2B | Hybrid auto-regressive | custom-vision-transformer-128k | Not specified | Flash Attention v2 | Custom LLM | Hybrid instruction tuning | Yes | Phi-3-vision-128k-instruct |
Model Name | Model Class | Vision Model | Vision Layers | Vision Embeddings | Vision Self Attention | Vision MLP | Vision LayerNorm | Multi-Modal Projector | Language Model | Language Layers | Language Self Attention | Language MLP | Language LayerNorm | LM Head |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TIGER-Lab/Mantis-8B-clip-llama3 |
LlavaNextForConditionalGeneration |
CLIPVisionModel |
24 layers | Conv2d(3, 1024) + Embedding(577) | CLIPAttention |
CLIPMLP (1024 -> 4096 -> 1024) |
LayerNorm((1024,), eps=1e-05) | Linear(1024 -> 4096) + GELU + Linear | LlamaForCausalLM |
32 layers | LlamaSdpaAttention |
LlamaMLP (4096 -> 14336 -> 4096) |
LlamaRMSNorm |
Linear(4096 -> 128258) |
TIGER-Lab/Mantis-8B-siglip-llama3 |
LlavaNextForConditionalGeneration |
SiglipVisionModel |
27 layers | Conv2d(3, 1152) + Embedding(729) | SiglipAttention |
SiglipMLP (1152 -> 4304 -> 1152) |
LayerNorm((1152,), eps=1e-06) | Linear(1152 -> 4096) + GELU + Linear | LlamaForCausalLM |
32 layers | LlamaSdpaAttention |
LlamaMLP (4096 -> 14336 -> 4096) |
LlamaRMSNorm |
Linear(4096 -> 128258) |
llava-hf/llava-1.5-7b-hf |
LlavaForConditionalGeneration |
CLIPVisionModel |
24 layers | Conv2d(3, 1024) + Embedding(577) | CLIPAttention |
CLIPMLP (1024 -> 4096 -> 1024) |
LayerNorm((1024,), eps=1e-05) | Linear(1024 -> 4096) + GELU + Linear | LlamaForCausalLM |
32 layers | LlamaSdpaAttention |
LlamaMLP (4096 -> 11008 -> 4096) |
LlamaRMSNorm |
Linear(4096 -> 32064) |
microsoft/Phi-3-vision-128k-instruct |
Phi3VForCausalLM |
CLIPVisionModel |
24 layers | Conv2d(3, 1024) + Embedding(577) | CLIPAttention |
CLIPMLP (1024 -> 4096 -> 1024) |
LayerNorm((1024,), eps=1e-05) | Linear(4096 -> 3072) + GELU + Linear | Phi3DecoderLayer |
32 layers | Phi3FlashAttention2 |
Phi3MLP (3072 -> 16384 -> 3072) |
Phi3RMSNorm |
Linear(3072 -> 32064) |
llava-hf/llava-v1.6-mistral-7b-hf |
LlavaNextForConditionalGeneration |
CLIPVisionModel |
24 layers | Conv2d(3, 1024) + Embedding(577) | CLIPAttention |
CLIPMLP (1024 -> 4096 -> 1024) |
LayerNorm((1024,), eps=1e-05) | Linear(1024 -> 4096) + GELU + Linear | MistralForCausalLM |
32 layers | MistralSdpaAttention |
MistralMLP (4096 -> 14336 -> 4096) |
MistralRMSNorm |
Linear(4096 -> 32064) |
HuggingFaceM4/idefics2-8b-base |
Idefics2ForConditionalGeneration |
Idefics2VisionTransformer |
27 layers | Conv2d(3, 1152) + Embedding(4900) | Idefics2VisionAttention |
Idefics2VisionMLP (1152 -> 4304 -> 1152) |
LayerNorm((1152,), eps=1e-06) | Linear(1152 -> 4096) + GELU + Linear | MistralModel |
32 layers | MistralAttention |
MistralMLP (4096 -> 14336 -> 4096) |
MistralRMSNorm |
Linear(4096 -> 32002) |