Skip to content

donbcolab/vlm_architecture

Repository files navigation

vlm_architecture

  • a significant amount of credit and thanks to Haotian Liu and the LLaVa project team for establishing key patterns for both open source Vision Language Models (VLMs) and multi-modal language models (MMLMs) image

VLM Comparison

  • only a sample focusing on models with less than 10 billion parameters
Model # of Parameters Architecture Vision Encoder Pooling Modality Projection Language Model Backbone Fine-tuning Method Flash Attention Hugging Face Model Card URL
Idefics2-8b-base 8B Fully autoregressive SigLIP-SO400M Learned pooling Modality projection layer Mistral-7B LoRA (Low-Rank Adaptation) No Idefics2-8b-base
Idefics2-8b 8B Fully autoregressive SigLIP-SO400M Learned pooling Modality projection layer Mistral-7B LoRA (Low-Rank Adaptation) No Idefics2-8b
Idefics2-8b-chatty 8B Fully autoregressive SigLIP-SO400M Learned pooling Modality projection layer Mistral-7B LoRA (Low-Rank Adaptation) No Idefics2-8b-chatty
LLaVA-v1.6-mistral-7B 7B Auto-regressive openai/clip-vit-large-patch14-336 Not specified Text-image interleaving Mistral-7B Multimodal instruction data No LLaVA-v1.6-mistral-7b
LLaVA-v1.6-vicuna-7B 7B Auto-regressive openai/clip-vit-large-patch14-336 Not specified Text-image interleaving Vicuna-7B Multimodal instruction data No LLaVA-v1.6-vicuna-7b
Mantis-8B-clip-llama3 8B Sequence-based openai/clip-vit-large-patch14-336 Not specified Text-image interleaving Meta-Llama-3-8B-Instruct Instruction Tuning Optional Mantis-8B-clip-llama3
Mantis-8B-siglip-llama3 8B Sequence-based siglip_vision_model Not specified Text-image interleaving Meta-Llama-3-8B-Instruct Instruction Tuning Optional Mantis-8B-siglip-llama3
Phi-3-vision-128k-instruct 4.2B Hybrid auto-regressive custom-vision-transformer-128k Not specified Flash Attention v2 Custom LLM Hybrid instruction tuning Yes Phi-3-vision-128k-instruct

Model Architectures

Model Name Model Class Vision Model Vision Layers Vision Embeddings Vision Self Attention Vision MLP Vision LayerNorm Multi-Modal Projector Language Model Language Layers Language Self Attention Language MLP Language LayerNorm LM Head
TIGER-Lab/Mantis-8B-clip-llama3 LlavaNextForConditionalGeneration CLIPVisionModel 24 layers Conv2d(3, 1024) + Embedding(577) CLIPAttention CLIPMLP (1024 -> 4096 -> 1024) LayerNorm((1024,), eps=1e-05) Linear(1024 -> 4096) + GELU + Linear LlamaForCausalLM 32 layers LlamaSdpaAttention LlamaMLP (4096 -> 14336 -> 4096) LlamaRMSNorm Linear(4096 -> 128258)
TIGER-Lab/Mantis-8B-siglip-llama3 LlavaNextForConditionalGeneration SiglipVisionModel 27 layers Conv2d(3, 1152) + Embedding(729) SiglipAttention SiglipMLP (1152 -> 4304 -> 1152) LayerNorm((1152,), eps=1e-06) Linear(1152 -> 4096) + GELU + Linear LlamaForCausalLM 32 layers LlamaSdpaAttention LlamaMLP (4096 -> 14336 -> 4096) LlamaRMSNorm Linear(4096 -> 128258)
llava-hf/llava-1.5-7b-hf LlavaForConditionalGeneration CLIPVisionModel 24 layers Conv2d(3, 1024) + Embedding(577) CLIPAttention CLIPMLP (1024 -> 4096 -> 1024) LayerNorm((1024,), eps=1e-05) Linear(1024 -> 4096) + GELU + Linear LlamaForCausalLM 32 layers LlamaSdpaAttention LlamaMLP (4096 -> 11008 -> 4096) LlamaRMSNorm Linear(4096 -> 32064)
microsoft/Phi-3-vision-128k-instruct Phi3VForCausalLM CLIPVisionModel 24 layers Conv2d(3, 1024) + Embedding(577) CLIPAttention CLIPMLP (1024 -> 4096 -> 1024) LayerNorm((1024,), eps=1e-05) Linear(4096 -> 3072) + GELU + Linear Phi3DecoderLayer 32 layers Phi3FlashAttention2 Phi3MLP (3072 -> 16384 -> 3072) Phi3RMSNorm Linear(3072 -> 32064)
llava-hf/llava-v1.6-mistral-7b-hf LlavaNextForConditionalGeneration CLIPVisionModel 24 layers Conv2d(3, 1024) + Embedding(577) CLIPAttention CLIPMLP (1024 -> 4096 -> 1024) LayerNorm((1024,), eps=1e-05) Linear(1024 -> 4096) + GELU + Linear MistralForCausalLM 32 layers MistralSdpaAttention MistralMLP (4096 -> 14336 -> 4096) MistralRMSNorm Linear(4096 -> 32064)
HuggingFaceM4/idefics2-8b-base Idefics2ForConditionalGeneration Idefics2VisionTransformer 27 layers Conv2d(3, 1152) + Embedding(4900) Idefics2VisionAttention Idefics2VisionMLP (1152 -> 4304 -> 1152) LayerNorm((1152,), eps=1e-06) Linear(1152 -> 4096) + GELU + Linear MistralModel 32 layers MistralAttention MistralMLP (4096 -> 14336 -> 4096) MistralRMSNorm Linear(4096 -> 32002)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published