Skip to content

Latest commit

 

History

History
97 lines (72 loc) · 20.5 KB

ominous.md

File metadata and controls

97 lines (72 loc) · 20.5 KB

Ominous

Reading List

Papers

#Interaction+Generation:

Paper Base Model Framework Data Code Publication Preprint Affiliation
Genie 2: A large-scale foundation world model genie2 DeepMind

#Multimodal #End2end Understanding+Generation:

Paper Base Model Framework Data Code Publication Preprint Affiliation
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions LLaMA3 PT + FT mixture (image, speech) emova 2409.18042 Huawei
One Single Transformer to Unify Multimodal Understanding and Generation Phi + MagViT2 PT + FT (LM-loss + MAE-loss) mixture (image) Show-o 2408.12528 NUS
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model - (transfusion) PT (LM-loss + DDPM-loss) self-collect (image) 2408.11039 Meta
Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation chameleon INST (interleaved) mixture (image) anole 2407.06135 SJTU
Explore the Limits of Omni-modal Pretraining at Scale vicuna PT + INST mixture (image, video, audio, depth -> text) MiCo 2406.09412 Shanghai AI Lab
X-VILA: Cross-Modality Alignment for Large Language Model vicuna + SD INST + Diffusion Decoder mixture (image, video, audio) 2405.19335 NVIDIA
Chameleon: Mixed-Modal Early-Fusion Foundation Models - (chameleon) PT + FT (AR + image detokenizer) mixture (image) chameleon 2405.09818 Meta
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation LLaMA + SD PT + INST (interleaved) mixture (image) SEED-X 2404.14396 Tencent
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling LLaMA2 + SD INST + NAR-decoder mixture (image, speech, music) AnyGPT 2402.12226 FDU
World Model on Million-Length Video And Language With Blockwise RingAttention LLaMA + VQGAN (LWM) PT (long-context) mixture (image, video) LWM 2402.08268 UCB
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer Vicuna + SD PT + INST mixture (image) MM-Interleaved 2401.10208 Shanghai AI Lab
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action T5X + VQGAN PT + INST mixture (image, audio, video, 3d) unified-io-2 2312.17172 AI2
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation LLaMA + SD PT + INST (interleaved) mixture (image) VL-GPT 2312.09251 Tencent
OneLLM: One Framework to Align All Modalities with Language LLaMA2 PT + INST (universal encoder + moe projector) mixture (image, audio, point, depth, IMU, fMRI -> text) OneLLM CVPR2024 2312.03700 Shanghai AI lab
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment - INST mixture (video, infrared, depth, audio -> text) LanguageBind ICLR2024 2310.01852 PKU
DreamLLM: Synergistic Multimodal Comprehension and Creation Vicuna + SD PT + INST with projector (interleaved) mixture (image) DreamLLM ICLR2024 2309.11499 MEGVII
NExT-GPT: Any-to-Any Multimodal LLM Vicuna + SD INST with projector mixture (text -> audio/image/video) NExT-GPT ICML2024 2309.05519 NUS
LaVIT: Empower the Large Language Model to Understand and Generate Visual Content,video version LLaMA  + SD PT + INST (vector quantization: CE + regression) mixture (image) LaVIT ICLR2024 2309.04669 Kuaishou
Emu: Generative Pretraining in Multimodality,v2 LLaMA + SD PT (AR: CE + regression ) mixture (image) Emu ICLR2024 2307.05222 BAAI
Any-to-Any Generation via Composable Diffusion SD-1.5 individual diffusion -> latent attention mixture (text -> audio/image/video; image -> audio/video) CoDi NeurIPS2023 2305.11846 Microsoft
ImageBind: One Embedding Space To Bind Them All CLIP Contrastive + Diffusion Decoder mixture(image, video, audio, depth) ImageBind 2305.05665 Meta

#Streaming #Real-Time #Online

Paper Base Model Framework Data Code Publication Preprint Affiliation
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions Qwen2-1.8B audio instruction + memory compression self-construct InternLM-XComposer-2.5-OmniLive 2412.09596 Shanghai AI lab
StreamChat: Chatting with Streaming Video Qwen2.5 kv-cache for streaming generation self-construct 2412.08646 CUHK
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format llava-ov grounding head self-construct MMDuet 2411.17991 PKU
Westlake-Omni: Open-Source Chinese Emotional Speech Interaction Large Language Model with Unified Discrete Sequence Modeling qwen2 LLM + AR Decoder self-construct (Chinese) Westlake-Omni xinchen-ai
Moshi: a speech-text foundation model for real time dialogue Helium-7B RQ-Tansformer self-construct (7m hr (pt) + 2k hr (inst) + 160 hr (tts)) moshi 2409.pdf kyutai
LLaMA-Omni: Seamless Speech Interaction with Large Language Models LLaMA3 speech-to-speech self-construct(InstructS2S-200K) LLaMA-Omni 2409.06666 CAS
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming,v2 Qwen2 audio generation with text instruction + parallel generation self-construct (VoiceAssistant-400K) mini-omni 2408.16725 THU
VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges Vicuna scene segment + recurrent videochat2 VideoLLaMB 2409.01071 BIGAI
VITA: Towards Open-Source Interactive Omni Multimodal LLM Mixtral-8x7B special tokens (<1>: audio; <2>: EOS; <3> text) mixture VITA 2408.05211 Tencent
VideoLLM-online: Online Large Language Model for Streaming Video Llama2/3 Multi-turn dialogue + streaming loss Ego4D videollm-online 2406.11816 NUS
RT-DETR: DETRs Beat YOLOs on Real-time Object Detection Dino + DETR anchor-free COCO RT-DETR 2304.08069 Baidu
Streaming Dense Video Captioning GIT/VidSeq + T5 cluster visual token (memory) streaming_dvc CVPR2024 2304.08069 Google
Deformable DETR: Deformable Transformers for End-to-End Object Detection ResNet+DETR deformable-attention COCO Deformable-DETR ICLR2021 2010.04159 SenseTime

#Interactive #Duplex

Paper Base Model Framework Data Code Publication Preprint Affiliation
Enabling Real-Time Conversations with Minimal Training Costs MiniCPM AR + special token self-curation (Ultra-Chat) 2409.11727 HiT
Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models MiniCPM AR + time-slice <idle> self-curation (Ultra-Chat) duplex-model 2406.15718 thunlp

Projects:

  • [2024.09] Open-Training-Moshi, The reproduce training process for Moshi
  • [2024.07] SAM2, Introducing Meta Segment Anything Model 2 (SAM 2)
  • [2024.06] LLaVA-Magvit2, LLaVA MagVit2: Combines MLLM Understanding and Generation with MagVit2
  • [2024.05] GPT-4o system card, We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.

Dataset

#omininou-modality

  • [2024.06] ShareGPT4Omni Dataset, ShareGPT4Omni: Towards Building Omni Large Multi-modal Models with Comprehensive Multi-modal Annotations.

#streaming-data

  • [2024.06] VideoLLM-online: Online Large Language Model for Streaming Video
  • [2024.05] Streaming Long Video Understanding with Large Language Models

Benchmark

#streaming

  • [2024.11] StreamingBench, StreamingBench evaluates Multimodal Large Language Models (MLLMs) in real-time, streaming video understanding tasks.

#timestampQA

  • [2024.06] VStream-QA, Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

#state#episodic

  • [2024.04] OpenEQA, OpenEQA: Embodied Question Answering in the Era of Foundation Models
  • [2021.10] Env-QA, Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments