Ominous

Reading List

Papers

#Interaction+Generation:

Paper	Base Model	Framework	Data	Code	Publication	Preprint	Affiliation
Genie 2: A large-scale foundation world model						genie2	DeepMind

#Multimodal #End2end Understanding+Generation:

Paper	Base Model	Framework	Data	Code	Publication	Preprint	Affiliation
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions	LLaMA3	PT + FT	mixture (image, speech)	emova		2409.18042	Huawei
One Single Transformer to Unify Multimodal Understanding and Generation	Phi + MagViT2	PT + FT (LM-loss + MAE-loss)	mixture (image)	Show-o		2408.12528	NUS
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model	- (transfusion)	PT (LM-loss + DDPM-loss)	self-collect (image)			2408.11039	Meta
Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation	chameleon	INST (interleaved)	mixture (image)	anole		2407.06135	SJTU
Explore the Limits of Omni-modal Pretraining at Scale	vicuna	PT + INST	mixture (image, video, audio, depth -> text)	MiCo		2406.09412	Shanghai AI Lab
X-VILA: Cross-Modality Alignment for Large Language Model	vicuna + SD	INST + Diffusion Decoder	mixture (image, video, audio)			2405.19335	NVIDIA
Chameleon: Mixed-Modal Early-Fusion Foundation Models	- (chameleon)	PT + FT (AR + image detokenizer)	mixture (image)	chameleon		2405.09818	Meta
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation	LLaMA + SD	PT + INST (interleaved)	mixture (image)	SEED-X		2404.14396	Tencent
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling	LLaMA2 + SD	INST + NAR-decoder	mixture (image, speech, music)	AnyGPT		2402.12226	FDU
World Model on Million-Length Video And Language With Blockwise RingAttention	LLaMA + VQGAN (LWM)	PT (long-context)	mixture (image, video)	LWM		2402.08268	UCB
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer	Vicuna + SD	PT + INST	mixture (image)	MM-Interleaved		2401.10208	Shanghai AI Lab
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action	T5X + VQGAN	PT + INST	mixture (image, audio, video, 3d)	unified-io-2		2312.17172	AI2
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation	LLaMA + SD	PT + INST (interleaved)	mixture (image)	VL-GPT		2312.09251	Tencent
OneLLM: One Framework to Align All Modalities with Language	LLaMA2	PT + INST (universal encoder + moe projector)	mixture (image, audio, point, depth, IMU, fMRI -> text)	OneLLM	CVPR2024	2312.03700	Shanghai AI lab
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment	-	INST	mixture (video, infrared, depth, audio -> text)	LanguageBind	ICLR2024	2310.01852	PKU
DreamLLM: Synergistic Multimodal Comprehension and Creation	Vicuna + SD	PT + INST with projector (interleaved)	mixture (image)	DreamLLM	ICLR2024	2309.11499	MEGVII
NExT-GPT: Any-to-Any Multimodal LLM	Vicuna + SD	INST with projector	mixture (text -> audio/image/video)	NExT-GPT	ICML2024	2309.05519	NUS
LaVIT: Empower the Large Language Model to Understand and Generate Visual Content,video version	LLaMA + SD	PT + INST (vector quantization: CE + regression)	mixture (image)	LaVIT	ICLR2024	2309.04669	Kuaishou
Emu: Generative Pretraining in Multimodality,v2	LLaMA + SD	PT (AR: CE + regression )	mixture (image)	Emu	ICLR2024	2307.05222	BAAI
Any-to-Any Generation via Composable Diffusion	SD-1.5	individual diffusion -> latent attention	mixture (text -> audio/image/video; image -> audio/video)	CoDi	NeurIPS2023	2305.11846	Microsoft
ImageBind: One Embedding Space To Bind Them All	CLIP	Contrastive + Diffusion Decoder	mixture(image, video, audio, depth)	ImageBind		2305.05665	Meta

#Streaming #Real-Time #Online

Paper	Base Model	Framework	Data	Code	Publication	Preprint	Affiliation
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions	Qwen2-1.8B	audio instruction + memory compression	self-construct	InternLM-XComposer-2.5-OmniLive		2412.09596	Shanghai AI lab
StreamChat: Chatting with Streaming Video	Qwen2.5	kv-cache for streaming generation	self-construct			2412.08646	CUHK
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format	llava-ov	grounding head	self-construct	MMDuet		2411.17991	PKU
Westlake-Omni: Open-Source Chinese Emotional Speech Interaction Large Language Model with Unified Discrete Sequence Modeling	qwen2	LLM + AR Decoder	self-construct (Chinese)	Westlake-Omni			xinchen-ai
Moshi: a speech-text foundation model for real time dialogue	Helium-7B	RQ-Tansformer	self-construct (7m hr (pt) + 2k hr (inst) + 160 hr (tts))	moshi		2409.pdf	kyutai
LLaMA-Omni: Seamless Speech Interaction with Large Language Models	LLaMA3	speech-to-speech	self-construct(InstructS2S-200K)	LLaMA-Omni		2409.06666	CAS
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming,v2	Qwen2	audio generation with text instruction + parallel generation	self-construct (VoiceAssistant-400K)	mini-omni		2408.16725	THU
VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges	Vicuna	scene segment + recurrent	videochat2	VideoLLaMB		2409.01071	BIGAI
VITA: Towards Open-Source Interactive Omni Multimodal LLM	Mixtral-8x7B	special tokens (<1>: audio; <2>: EOS; <3> text)	mixture	VITA		2408.05211	Tencent
VideoLLM-online: Online Large Language Model for Streaming Video	Llama2/3	Multi-turn dialogue + streaming loss	Ego4D	videollm-online		2406.11816	NUS
RT-DETR: DETRs Beat YOLOs on Real-time Object Detection	Dino + DETR	anchor-free	COCO	RT-DETR		2304.08069	Baidu
Streaming Dense Video Captioning	GIT/VidSeq + T5	cluster visual token (memory)		streaming_dvc	CVPR2024	2304.08069	Google
Deformable DETR: Deformable Transformers for End-to-End Object Detection	ResNet+DETR	deformable-attention	COCO	Deformable-DETR	ICLR2021	2010.04159	SenseTime

#Interactive #Duplex

Paper	Base Model	Framework	Data	Code	Publication	Preprint	Affiliation
Enabling Real-Time Conversations with Minimal Training Costs	MiniCPM	AR + special token	self-curation (Ultra-Chat)			2409.11727	HiT
Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models	MiniCPM	AR + time-slice `<idle>`	self-curation (Ultra-Chat)	duplex-model		2406.15718	thunlp

Projects:

[2024.09] Open-Training-Moshi, The reproduce training process for Moshi
[2024.07] SAM2, Introducing Meta Segment Anything Model 2 (SAM 2)
- [2024.08] segment-anything-2-real-time, Run Segment Anything Model 2 on a live video stream
[2024.06] LLaVA-Magvit2, LLaVA MagVit2: Combines MLLM Understanding and Generation with MagVit2
[2024.05] GPT-4o system card, We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.

Dataset

#omininou-modality

[2024.06] ShareGPT4Omni Dataset, ShareGPT4Omni: Towards Building Omni Large Multi-modal Models with Comprehensive Multi-modal Annotations.

#streaming-data

[2024.06] VideoLLM-online: Online Large Language Model for Streaming Video
[2024.05] Streaming Long Video Understanding with Large Language Models

Benchmark

#streaming

[2024.11] StreamingBench, StreamingBench evaluates Multimodal Large Language Models (MLLMs) in real-time, streaming video understanding tasks.

#timestampQA

[2024.06] VStream-QA, Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

#state#episodic

[2024.04] OpenEQA, OpenEQA: Embodied Question Answering in the Era of Foundation Models
[2021.10] Env-QA, Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ominous.md

ominous.md

Ominous

Reading List

Dataset

Benchmark

Files

ominous.md

Latest commit

History

ominous.md

File metadata and controls

Ominous

Reading List

Dataset

Benchmark