Papers
#Interaction+Generation:
Paper | Base Model | Framework | Data | Code | Publication | Preprint | Affiliation |
---|---|---|---|---|---|---|---|
Genie 2: A large-scale foundation world model | genie2 | DeepMind |
#Multimodal #End2end Understanding+Generation:
Paper | Base Model | Framework | Data | Code | Publication | Preprint | Affiliation |
---|---|---|---|---|---|---|---|
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions | LLaMA3 | PT + FT | mixture (image, speech) | emova | 2409.18042 | Huawei | |
One Single Transformer to Unify Multimodal Understanding and Generation | Phi + MagViT2 | PT + FT (LM-loss + MAE-loss) | mixture (image) | Show-o | 2408.12528 | NUS | |
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model | - (transfusion) | PT (LM-loss + DDPM-loss) | self-collect (image) | 2408.11039 | Meta | ||
Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation | chameleon | INST (interleaved) | mixture (image) | anole | 2407.06135 | SJTU | |
Explore the Limits of Omni-modal Pretraining at Scale | vicuna | PT + INST | mixture (image, video, audio, depth -> text) | MiCo | 2406.09412 | Shanghai AI Lab | |
X-VILA: Cross-Modality Alignment for Large Language Model | vicuna + SD | INST + Diffusion Decoder | mixture (image, video, audio) | 2405.19335 | NVIDIA | ||
Chameleon: Mixed-Modal Early-Fusion Foundation Models | - (chameleon) | PT + FT (AR + image detokenizer) | mixture (image) | chameleon | 2405.09818 | Meta | |
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation | LLaMA + SD | PT + INST (interleaved) | mixture (image) | SEED-X | 2404.14396 | Tencent | |
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling | LLaMA2 + SD | INST + NAR-decoder | mixture (image, speech, music) | AnyGPT | 2402.12226 | FDU | |
World Model on Million-Length Video And Language With Blockwise RingAttention | LLaMA + VQGAN (LWM) | PT (long-context) | mixture (image, video) | LWM | 2402.08268 | UCB | |
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer | Vicuna + SD | PT + INST | mixture (image) | MM-Interleaved | 2401.10208 | Shanghai AI Lab | |
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action | T5X + VQGAN | PT + INST | mixture (image, audio, video, 3d) | unified-io-2 | 2312.17172 | AI2 | |
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation | LLaMA + SD | PT + INST (interleaved) | mixture (image) | VL-GPT | 2312.09251 | Tencent | |
OneLLM: One Framework to Align All Modalities with Language | LLaMA2 | PT + INST (universal encoder + moe projector) | mixture (image, audio, point, depth, IMU, fMRI -> text) | OneLLM | CVPR2024 | 2312.03700 | Shanghai AI lab |
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment | - | INST | mixture (video, infrared, depth, audio -> text) | LanguageBind | ICLR2024 | 2310.01852 | PKU |
DreamLLM: Synergistic Multimodal Comprehension and Creation | Vicuna + SD | PT + INST with projector (interleaved) | mixture (image) | DreamLLM | ICLR2024 | 2309.11499 | MEGVII |
NExT-GPT: Any-to-Any Multimodal LLM | Vicuna + SD | INST with projector | mixture (text -> audio/image/video) | NExT-GPT | ICML2024 | 2309.05519 | NUS |
LaVIT: Empower the Large Language Model to Understand and Generate Visual Content,video version | LLaMA + SD | PT + INST (vector quantization: CE + regression) | mixture (image) | LaVIT | ICLR2024 | 2309.04669 | Kuaishou |
Emu: Generative Pretraining in Multimodality,v2 | LLaMA + SD | PT (AR: CE + regression ) | mixture (image) | Emu | ICLR2024 | 2307.05222 | BAAI |
Any-to-Any Generation via Composable Diffusion | SD-1.5 | individual diffusion -> latent attention | mixture (text -> audio/image/video; image -> audio/video) | CoDi | NeurIPS2023 | 2305.11846 | Microsoft |
ImageBind: One Embedding Space To Bind Them All | CLIP | Contrastive + Diffusion Decoder | mixture(image, video, audio, depth) | ImageBind | 2305.05665 | Meta |
#Streaming #Real-Time #Online
Paper | Base Model | Framework | Data | Code | Publication | Preprint | Affiliation |
---|---|---|---|---|---|---|---|
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions | Qwen2-1.8B | audio instruction + memory compression | self-construct | InternLM-XComposer-2.5-OmniLive | 2412.09596 | Shanghai AI lab | |
StreamChat: Chatting with Streaming Video | Qwen2.5 | kv-cache for streaming generation | self-construct | 2412.08646 | CUHK | ||
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format | llava-ov | grounding head | self-construct | MMDuet | 2411.17991 | PKU | |
Westlake-Omni: Open-Source Chinese Emotional Speech Interaction Large Language Model with Unified Discrete Sequence Modeling | qwen2 | LLM + AR Decoder | self-construct (Chinese) | Westlake-Omni | xinchen-ai | ||
Moshi: a speech-text foundation model for real time dialogue | Helium-7B | RQ-Tansformer | self-construct (7m hr (pt) + 2k hr (inst) + 160 hr (tts)) | moshi | 2409.pdf | kyutai | |
LLaMA-Omni: Seamless Speech Interaction with Large Language Models | LLaMA3 | speech-to-speech | self-construct(InstructS2S-200K) | LLaMA-Omni | 2409.06666 | CAS | |
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming,v2 | Qwen2 | audio generation with text instruction + parallel generation | self-construct (VoiceAssistant-400K) | mini-omni | 2408.16725 | THU | |
VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges | Vicuna | scene segment + recurrent | videochat2 | VideoLLaMB | 2409.01071 | BIGAI | |
VITA: Towards Open-Source Interactive Omni Multimodal LLM | Mixtral-8x7B | special tokens (<1>: audio; <2>: EOS; <3> text) | mixture | VITA | 2408.05211 | Tencent | |
VideoLLM-online: Online Large Language Model for Streaming Video | Llama2/3 | Multi-turn dialogue + streaming loss | Ego4D | videollm-online | 2406.11816 | NUS | |
RT-DETR: DETRs Beat YOLOs on Real-time Object Detection | Dino + DETR | anchor-free | COCO | RT-DETR | 2304.08069 | Baidu | |
Streaming Dense Video Captioning | GIT/VidSeq + T5 | cluster visual token (memory) | streaming_dvc | CVPR2024 | 2304.08069 | ||
Deformable DETR: Deformable Transformers for End-to-End Object Detection | ResNet+DETR | deformable-attention | COCO | Deformable-DETR | ICLR2021 | 2010.04159 | SenseTime |
#Interactive #Duplex
Paper | Base Model | Framework | Data | Code | Publication | Preprint | Affiliation |
---|---|---|---|---|---|---|---|
Enabling Real-Time Conversations with Minimal Training Costs | MiniCPM | AR + special token | self-curation (Ultra-Chat) | 2409.11727 | HiT | ||
Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models | MiniCPM | AR + time-slice <idle> |
self-curation (Ultra-Chat) | duplex-model | 2406.15718 | thunlp |
Projects:
- [2024.09] Open-Training-Moshi, The reproduce training process for Moshi
- [2024.07] SAM2, Introducing Meta Segment Anything Model 2 (SAM 2)
- [2024.08] segment-anything-2-real-time, Run Segment Anything Model 2 on a live video stream
- [2024.06] LLaVA-Magvit2, LLaVA MagVit2: Combines MLLM Understanding and Generation with MagVit2
- [2024.05] GPT-4o system card, We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.
#omininou-modality
- [2024.06] ShareGPT4Omni Dataset, ShareGPT4Omni: Towards Building Omni Large Multi-modal Models with Comprehensive Multi-modal Annotations.
#streaming-data
- [2024.06] VideoLLM-online: Online Large Language Model for Streaming Video
- [2024.05] Streaming Long Video Understanding with Large Language Models
#streaming
- [2024.11] StreamingBench, StreamingBench evaluates Multimodal Large Language Models (MLLMs) in real-time, streaming video understanding tasks.
#timestampQA
- [2024.06] VStream-QA, Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
#state#episodic