Video

Table of Contents

Video Understanding
Video Generation

Reading List

This reading list additionally collect video-language pretraining works before LLM

NOTEs: FT=Finetune, VidL=Video-Language, MM=Multimodal, INST=Instruction

Paper	Base Language Model	Framework	Data	Code	Publication	Preprint	Affiliation
LongVILA: Scaling Long-Context Visual Language Models for Long Videos	LLaMA3	INST (LLM-extend)	Mixture	LongVILA		2408.10188	NVIDIA
Long Context Transfer from Language to Vision	Qwen2	INST (LLM-extend)	llava-next	LongVA		2406.16852	NTU
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models	Vicuna	INST	Mixture (llava, videochat, ego4d, how2)	SALMONN		2406.15704	Bytedance
VideoLLM-online: Online Large Language Model for Streaming Video	Llama2/3	INST (+timestamp)		videollm-online			NUS
Streaming Long Video Understanding with Large Language Models	Phi2, Vicuna	INST	Mixture (conceptual caption, howto100m, panda-700m, movieqa, msrvtt, star)				ShanghaiAI Lab
LLaVA-NeXT: A Strong Zero-shot Video Understanding Model	Vicuna, Yi	INST	LLaVA+	LLaVA-NeXT		2404-blog	Bytedance
PLLaVA: Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning	Vicuna, Yi	INST	VideoChat2 insts.	PLLaVA		2404.16994	Bytedance
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	Vicuna	FT (memory retrieval)	- (task)	MA-LMM		2404.05726	Meta
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens	Llama, Mistral	PT+FT	mixture	MiniGPT4-video		2404.03413	KAUST
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding	X+GPT4	Video Agent	-	VideoAgent		2403.11481	BIGAI
VideoAgent: Long-form Video Understanding with Large Language Model as Agent	LaViLa + GPT4	Video Agent (Caption)	-			2403.10517	Stanford
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding	-	-		Video Mamba Suite		2403.09626	Shanghai AI Lab
VideoMamba: State Space Model for Efficient Video Understanding	-	-		VideoMamba		2403.06977	Shanghai AI Lab
LLMs Meet Long Video: Advancing Long Video Comprehension with An Interactive Visual Adapter in LLMs	LLaMA	adapter	mixture			2402.13546	HIT
Video ReCap: Recursive Captioning of Hour-Long Videos	BLIP2, LaVila	Caption + dataset	mixture	VideoRecap	CVPR2024	2402.13250	UNC
VideoPrism: A Foundational Visual Encoder for Video Understanding	(PaLM)	PT	mixture			2402.13217	Google
LVCHAT: Facilitating Long Video Comprehension	LLaMA	FT + position interleaved	VideoChat2	LVChat		2402.12079	UCSD
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning	llama	MMINST+temporal prompt	self-construct(Moment-10M)decode	Momentor		2402.11435	ZJU
World Model on Million-Length Video And Language With RingAttention	LLaMA2	PT+FT	mixture	LWM		2402.08268	UCB
Memory Consolidation Enables Long-Context Video Understanding	Bert	FT + memory(ViT)				2402.05861	DeepMind
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization	LLaMA2	PT+FT	mixture	LaVIT		2402.03161	PKU
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models	StableLM, Qwen, Phi2	MoE	mixture (MM-INST)	MoE-LLaVA		2401.15947	PKU
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models	X + GPT3.5	Video Agent	-	DoraemonGPT		2401.08392	ZJU
A Simple LLM Framework for Long-Range Video Question-Answering	Cap + GPT4	Video Agent (Caption)	-	LLoVi		2312.17235	UNC
Text-Conditioned Resampler For Long Form Video Understanding	BLIP2	FT Resampler (blip2 on video)	-			2312.11897	Google
Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens	LLaVA	FT+Recur. Qformer	VideoChatGPT			2312.08870	ByteDance
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames	PaLI, Bard	PT (vivit+adapter)	mixture			2312.07395	Google
LifelongMemory: Leveraging LLMs for Answering Queries in Egocentric Videos	LLaVA, GPT3.5	Video Agent (Caption)	-			2312.05269	NYU
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding	LLaMA2	MM-INST	mixture (additional Transcribed Speeck)	TimeChat	CVPR2024	2312.02051	PKU
Zero-Shot Video Question Answering with Procedural Programs	GPT+X	Video Agent	-			2312.00937	CMU
VTimeLLM: Empower LLM to Grasp Video Moments	Vicuna	INST+temporal	mixture	VTimeLLM	CVPR2024	2311.18445	THU
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	Vicuna	MM-INST	self-construct	LLaMA-VID		2311.17043	CUHK
Vamos: Versatile Action Models for Video Understanding	GPT4, X	Video Agent (Caption)	-	Vamos		2311.13627	Brown
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	Vicuna 1.5	PT+FT	mixture	Video-LLaVA		2311.10122	PKU
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding	Vicuna	PT+FT	mixture	Chat-UniVi		2311.08046	PKU
UniVTG: Towards Unified Video-Language Temporal Grounding	CLIP	PT	mixture	UniVTG	ICCV 2023	2307.16715	NTU
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	Vicuna	FT		MovieChat	CVPR2024	2307.16449	Microsoft
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models	DeBerta	FT, RTr-Augmented				2306.11732	CUHK
Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration	LLaMA			Macaw-LLM		2306.09093	Tencent
Valley: Video assistant with large language model enhanced ability	Vicuna	PT, FT + MM-INST	mixture	Valley		2306.07207	ByteDance
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Vicuna			Video-ChatGPT		2306.05424	MBZUAI
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding	LLaMA			Video-LLaMA		2306.02858	Alibaba
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	Bert	PT	mixture (audio,video,image)	VAST	Neurips2023	2305.18500	CAS
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst	Vicuna-13B	FT + MM-INST		ChatBridge		2305.16103	CAS
Self-Chained Image-Language Model for Video Localization and Question Answering	BLIP2	2-stage: localizer(LM) + answer	QVHighlights, FT VidL	SeViLA	Neurips2023	2305.06988	UNC
VideoChat: Chat-Centric Video Understanding	Blip2			VideoChat		2305.06355	Shanghai AI Lab
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	ChatGPT			X-LLM		2305.04160	CAS
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	Bert			VALOR		2304.08345	CAS
Verbs in Action: Improving verb understanding in video-language models	PaLM					2304.06708	Google
Video ChatCaptioner: Towards the Enriched Spatiotemporal Descriptions	ChatGPT, Flan-T5 (BLIP2)			ChatCaptioner		2304.04227	KAUST
Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering	GPT2, GPT-Neo, GPT3				CVPR2023 workshop	2304.03754	Columbia Univ.
Unmasked Teacher: Towards Training-Efficient Video Foundation Models	Bert	PT	mixture	Unmasked Teacher	ICCV 2023	2303.16058	Shanghai AI Lab
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning	T5			Vid2Seq		2302.14115	Google
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training	Bert					2212.14546	Alibaba
VindLU: A Recipe for Effective Video-and-Language Pretraining	Bert			VindLU		2212.05051	UNC
Learning Video Representations from Large Language Models	GPT2	PT (data-augment)	Ego4D/HowTo100M	LaViLa	CVPR2023	2212.04501	Meta
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training	Bert					2211.11446	UW
CLOP: Video-and-Language Pre-Training with Knowledge Regularizations	Roberta				MM 2022	2211.03314	Baidu
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning	Bert				NIPS 2022	2210.06031	Microsoft
OmniVL: One Foundation Model for Image-Language and Video-Language Tasks	Bert				NIPS 2022	2209.07526	Microsoft
Clover: Towards A Unified Video-Language Alignment and Fusion Model	Bert			Clover		2207.07885	Bytedance
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling	Bert-like			LAVENDER	CVPR 2023	2206.07160	Microsoft
Revealing Single Frame Bias for Video-and-Language Learning	Bert			Singularity		2206.03428	UNC
Label-Efficient Online Continual Object Detection in Streaming Video	-		(continual)	Efficient-CLS	ICCV 2023	2206.00309	NUS
Flamingo: a Visual Language Model for Few-Shot Learning	Chinchilla			Flamingo	NIPS 2022	2204.14198	DeepMind
All in One: Exploring Unified Video-Language Pre-training	Bert-like			All-In-One	CVPR 2023	2203.07303	NUS
End-to-end Generative Pretraining for Multimodal Video Captioning	Bert+GPT2				CVPR 2022	2201.08264	Google
Align and Prompt: Video-and-Language Pre-training with Entity Prompts	Bert-like			ALPRO	CVPR 2022	2112.09583	Salesforce
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling,V2	Bert			VIOLET		2111.12681	Microsoft
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding	Bert			VideoCLIP	EMNLP 2021	2109.14084	Facebook
MERLOT: Multimodal Neural Script Knowledge Models,V2	Roberta			MERLOT	NIPS 2021	2106.02636	AI2
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding	Bert			VLP	ACL Findings 2021	2105.09996	Facebook
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text	Bert-like				NIPS 2021	2104.11178	Google
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval	Bert-like			CLIP4Clip	Neurocomputing 2022	2104.08860	Microsoft
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval	Bert			Frozen-in-Time	ICCV 2021	2104.00650	Oxford
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling	Bert			ClipBert	CVPR 2021	2102.06183	Microsoft
ActBERT: Learning Global-Local Video-Text Representations	Bert			ActBert	CVPR 2020	2011.07231	Baidu
Video Understanding as Machine Translation	T5					2006.07203	Facebook
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training	Bert			HERO	EMNLP 2020	2005.00200	Microsoft
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation	Bert			UniVL		2002.06353	Microsoft
Learning Video Representations using Contrastive Bidirectional Transformer	Bert					1906.05743	Google
VideoBERT: A Joint Model for Video and Language Representation Learning	Bert			VideoBert (non-official)	ICCV 2019	1904.01766	Google

Pretraining Tasks

Commmonly Used Pretraining Tasks

Masked Language Modeling (MLM)
Causal Language Modeling (LM)
Masked Vision Modeling (MLM)
- Vision = Frame
- Vision = Patch
- VIsion = Object
Video Language Matching (VLM)
Video Language Contrastive (VLC)

Datasets

Pretraining Corpora

Paper	Video Clips	Duration	Sentences	Domain	Download Link
(❗NOT AVAILABLE, 23 Feb 2024) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval	2.5M (2M)	18s	2.5M	open (web)	WebVid-2M, WebVid-10M
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips	136M (1.2M)	4s	136M	instruction (YouTube)	HowTo100M
Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing	180M (6M)	-20m	~720M	open (YouTube)	YT-Temporal-180M
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions	100M (3.3M)	13.4s	100M	open (YouTube)	HD-VILA-100M
Learning audio-video modalities from image captions	10.3M (6.3M)	10s		open (web)	VideoCC
CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos	18M	60s		open (YouTube)	YTD-18M
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	10 M	54.2s	10 M	open (YOUKU)	Youku-mPLUG
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation	234M (7.1M)	11.7s	234 M	open (YouTube)	InternVid
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers	70M	8.5s	70M	open	Panda-70M `from HD-VILA-100M`

Video Instructions

Dataset	Statistics	Source
Video-ChatGPT	100k INST/10k videos	ActivityNet + (Human+GPT) Annotation
Valley	65k INST/100k videos	(VATEX + JukinMedia )+ (Human+GPT) Annotation
VideoChat	11k INST/11k videos	WebVid + GPT Annotation
TimeIT	125k INST	Mixture + GPT Annotation

Others

[Neurips23 D&B] VidChapters, a large-scale dataset of user-chaptered videos. We study three tasks on top of this dataset and show that video chapter generation models trained on VidChapters-7M transfer well to dense video captioning.

Benchmarks

Common Downstream Tasks

Task	Paper	Download Link	Publication
Retrieval	Collecting Highly Parallel Data for Paraphrase Evaluation	MSVD	ACL 2011
Retrieval	A Dataset for Movie Description	LSMDC	CVPR 2015
Retrieval	MSR-VTT: A Large Video Description Dataset for Bridging Video and Language	MSR-VTT	CVPR 2016
Retrieval	Localizing Moments in Video with Natural Language	DiDeMo	ICCV 2017
Retrieval	Dense-Captioning Events in Videos	ActivityNet Caption	ICCV 2017
Retrieval	Towards Automatic Learning of Procedures from Web Instructional Videos	YouCook2	AAAI 2018
OE QA	TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering	TGIF-Frame	CVPR 2017
OE QA	A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering	LSMDC-FiB	CVPR 2017
OE QA	Video Question Answering via Gradually Refined Attention over Appearance and Motion	MSRVTT-QA,MSVD-QA	MM 2017
OE QA	ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering	ActivityNet-QA	AAAI 2019
MC QA	Learning Language-Visual Embedding for Movie Understanding with Natural-Language	LSMDC-MC
MC QA	TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering	TGIF-Action, TGIF-Transition	CVPR 2017
MC QA	A Joint Sequence Fusion Model for Video Question Answering and Retrieval	MSRVTT-MC	ECCV 2018
Caption	Collecting Highly Parallel Data for Paraphrase Evaluation	MSVD	ACL 2011
Caption	MSR-VTT: A Large Video Description Dataset for Bridging Video and Language	MSR-VTT	CVPR 2016
Caption	VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research	VATEX	ICCV 2019
Dense Caption	Dense-Captioning Events in Videos	ActivityNet Caption	ICCV 2017
Dense Caption	Towards Automatic Learning of Procedures from Web Instructional Videos	YouCook2	AAAI 2018
Dense Caption	Multimodal Pretraining for Dense Video Captioning	ViTT	AACL 2020
Action	HMDB: A large video database for human motion recognition	HMDB-51	ICCV 2021
Action	UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild	UCF-101	ICCV 2013
Action	ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding	ActivityNet-200	CVPR 2015
Action	Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding	Charades-157	ECCV 2016
Action	The Kinetics Human Action Video Dataset	Kinetics-400/600/700

Advanced Downstream Tasks

Task-Specific Benchmarks

paper	task	duration	domain	link	publication
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	Video QA	~8m	movie	MovieChat	CVPR 2024
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding	Video QA	~8m	movie	MoVQA
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding	Video QA	~3m	open (ego)	EgoSchema
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models	Temporal Grounding	~60s (mix.)	open	ViLMA	ICLR 2024
From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering	Video QA	9s	open	Causal-VidQA	CVPR 2022
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference	Video Language Inference	35.2s	movie	VIOLIN	CVPR 2020
TVQA: Localized, Compositional Video Question Answering	Video QA	60-90s	movie	TVQA	EMNLP 2018
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning	Video QA	30s	open	AGQA	CVPR 2021
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions	Video QA	44s	open	NExT-QA-MC, NExT-QA-OE	CVPR 2021
Towards Long-Form Video Understanding	Classification	1-3m	movie	LVU	CVPR 2021
STAR: A Benchmark for Situated Reasoning in Real-World Videos	Video QA	12s	open	Star	NIPS 2021
Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments	Video QA	20s	virtual env.	Env-QA	ICCV 2021
COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis	Localization/Action Seg.	3.36m	open (instruct)	COIN	CVPR 2019
Cross-task weakly supervised learning from instructional videos	Localization	4m57s	open (instruct)	CrossTask	CVPR 2019
Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence	Video QA	60s	open	Social-IQ	CVPR 2019

Multifaceted Benchmarks

Benchmark	Task	Data	Paper	Preprint	Publication	Affiliation
Video-Bench	MC (general domain)	mixture

Metrics

Common Metrics on Video Quality, You can easily calculate FVD, PSNR, SSIM, LPIPS for evaluating the quality of generated or predicted videos.

Projects & Tools

projects

(Video Agent) VLog, Transform Video as a Document with ChatGPT, CLIP, BLIP2, GRIT, Whisper, LangChain.

tools

VideoDB, It enables developers to: 1) Upload multiple videos to create a library or collection; 2) Search across these videos and get real-time video responses or compilations; 3) Publish your searchable collection on the ChatGPT store; 4) Receive summarized text answers (RAG); 5) Gain key insights from specific videos (e.g. "Top points from episode 31").
video2dataset, Easily create large video dataset from video urls. Can download and package 10M videos in 12h on a single 16 core machine.
Match cutting, A match cut is a transition between a pair of shots that uses similar framing, composition, or action to fluidly bring the viewer from one scene to the next
Awesome-Video-Object-Segmentation, A curated list of video object segmentation (vos) papers, datasets, and projects.
pytube, A lightweight, dependency-free Python library (and command-line utility) for downloading YouTube Videos.
movienet-tools, Movie toolbox provides many basic tools and functions for the researches on movie understanding, with which you can get started with your research easily.
PySceneDetect, Video Scene Cut Detection and Analysis Tool

Video Generation

Reading List

Survey

(2023-10) A Survey on Video Diffusion Models paper repo

Reading List

Paper	Base Structure	Data	Code	Publication	Preprint	Affiliation
Video generation models as world simulators	Transformer	-	-	-	2402.blog	OpenAI
Vlogger: Make Your Dream A Vlog	Diffusion		Vlogger		2401.09414	Shanghai AI Lab
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis	ControlNet		FlowVid		2312.17681	Meta
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction	Diffusion		SEINE		2310.20700	Shanghai AI Lab
MotionDirector: Motion Customization of Text-to-Video Diffusion Models	Diffusion		MotionDirector		2310.08465	NUS
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning	GPT4 +UNet		VideoDirectorGPT		2309.15091	UNC
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers	Transformer +VQVAE	self-construct	CogVideo		2205.15868	THU

Metrics

T2VScore, T2VScore: Towards A Better Metric for Text-to-Video Generation

Projects

Open Sora, Open-Sora: Democratizing Efficient Video Production for All
Open Chat Video Editor, Open source short video automatic generation tool

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Video.md

Video.md

Video

Reading List

Pretraining Tasks

Datasets

Pretraining Corpora

Video Instructions

Others

Benchmarks

Common Downstream Tasks

Advanced Downstream Tasks

Task-Specific Benchmarks

Multifaceted Benchmarks

Metrics

Projects & Tools

Video Generation

Reading List

Metrics

Projects

Files

Video.md

Latest commit

History

Video.md

File metadata and controls

Video

Reading List

Pretraining Tasks

Datasets

Pretraining Corpora

Video Instructions

Others

Benchmarks

Common Downstream Tasks

Advanced Downstream Tasks

Task-Specific Benchmarks

Multifaceted Benchmarks

Metrics

Projects & Tools

Video Generation

Reading List

Metrics

Projects