Table of Contents
This reading list additionally collect video-language pretraining works before LLM
NOTEs: FT=Finetune, VidL=Video-Language, MM=Multimodal, INST=Instruction
Paper | Base Language Model | Framework | Data | Code | Publication | Preprint | Affiliation |
---|---|---|---|---|---|---|---|
LongVILA: Scaling Long-Context Visual Language Models for Long Videos | LLaMA3 | INST (LLM-extend) | Mixture | LongVILA | 2408.10188 | NVIDIA | |
Long Context Transfer from Language to Vision | Qwen2 | INST (LLM-extend) | llava-next | LongVA | 2406.16852 | NTU | |
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models | Vicuna | INST | Mixture (llava, videochat, ego4d, how2) | SALMONN | 2406.15704 | Bytedance | |
VideoLLM-online: Online Large Language Model for Streaming Video | Llama2/3 | INST (+timestamp) | videollm-online | NUS | |||
Streaming Long Video Understanding with Large Language Models | Phi2, Vicuna | INST | Mixture (conceptual caption, howto100m, panda-700m, movieqa, msrvtt, star) | ShanghaiAI Lab | |||
LLaVA-NeXT: A Strong Zero-shot Video Understanding Model | Vicuna, Yi | INST | LLaVA+ | LLaVA-NeXT | 2404-blog | Bytedance | |
PLLaVA: Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning | Vicuna, Yi | INST | VideoChat2 insts. | PLLaVA | 2404.16994 | Bytedance | |
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding | Vicuna | FT (memory retrieval) | - (task) | MA-LMM | 2404.05726 | Meta | |
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens | Llama, Mistral | PT+FT | mixture | MiniGPT4-video | 2404.03413 | KAUST | |
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding | X+GPT4 | Video Agent | - | VideoAgent | 2403.11481 | BIGAI | |
VideoAgent: Long-form Video Understanding with Large Language Model as Agent | LaViLa + GPT4 | Video Agent (Caption) | - | 2403.10517 | Stanford | ||
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding | - | - | Video Mamba Suite | 2403.09626 | Shanghai AI Lab | ||
VideoMamba: State Space Model for Efficient Video Understanding | - | - | VideoMamba | 2403.06977 | Shanghai AI Lab | ||
LLMs Meet Long Video: Advancing Long Video Comprehension with An Interactive Visual Adapter in LLMs | LLaMA | adapter | mixture | 2402.13546 | HIT | ||
Video ReCap: Recursive Captioning of Hour-Long Videos | BLIP2, LaVila | Caption + dataset | mixture | VideoRecap | CVPR2024 | 2402.13250 | UNC |
VideoPrism: A Foundational Visual Encoder for Video Understanding | (PaLM) | PT | mixture | 2402.13217 | |||
LVCHAT: Facilitating Long Video Comprehension | LLaMA | FT + position interleaved | VideoChat2 | LVChat | 2402.12079 | UCSD | |
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning | llama | MMINST+temporal prompt | self-construct(Moment-10M)decode | Momentor | 2402.11435 | ZJU | |
World Model on Million-Length Video And Language With RingAttention | LLaMA2 | PT+FT | mixture | LWM | 2402.08268 | UCB | |
Memory Consolidation Enables Long-Context Video Understanding | Bert | FT + memory(ViT) | 2402.05861 | DeepMind | |||
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization | LLaMA2 | PT+FT | mixture | LaVIT | 2402.03161 | PKU | |
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models | StableLM, Qwen, Phi2 | MoE | mixture (MM-INST) | MoE-LLaVA | 2401.15947 | PKU | |
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models | X + GPT3.5 | Video Agent | - | DoraemonGPT | 2401.08392 | ZJU | |
A Simple LLM Framework for Long-Range Video Question-Answering | Cap + GPT4 | Video Agent (Caption) | - | LLoVi | 2312.17235 | UNC | |
Text-Conditioned Resampler For Long Form Video Understanding | BLIP2 | FT Resampler (blip2 on video) | - | 2312.11897 | |||
Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens | LLaVA | FT+Recur. Qformer | VideoChatGPT | 2312.08870 | ByteDance | ||
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames | PaLI, Bard | PT (vivit+adapter) | mixture | 2312.07395 | |||
LifelongMemory: Leveraging LLMs for Answering Queries in Egocentric Videos | LLaVA, GPT3.5 | Video Agent (Caption) | - | 2312.05269 | NYU | ||
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | LLaMA2 | MM-INST | mixture (additional Transcribed Speeck) | TimeChat | CVPR2024 | 2312.02051 | PKU |
Zero-Shot Video Question Answering with Procedural Programs | GPT+X | Video Agent | - | 2312.00937 | CMU | ||
VTimeLLM: Empower LLM to Grasp Video Moments | Vicuna | INST+temporal | mixture | VTimeLLM | CVPR2024 | 2311.18445 | THU |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | Vicuna | MM-INST | self-construct | LLaMA-VID | 2311.17043 | CUHK | |
Vamos: Versatile Action Models for Video Understanding | GPT4, X | Video Agent (Caption) | - | Vamos | 2311.13627 | Brown | |
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | Vicuna 1.5 | PT+FT | mixture | Video-LLaVA | 2311.10122 | PKU | |
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding | Vicuna | PT+FT | mixture | Chat-UniVi | 2311.08046 | PKU | |
UniVTG: Towards Unified Video-Language Temporal Grounding | CLIP | PT | mixture | UniVTG | ICCV 2023 | 2307.16715 | NTU |
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | Vicuna | FT | MovieChat | CVPR2024 | 2307.16449 | Microsoft | |
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models | DeBerta | FT, RTr-Augmented | 2306.11732 | CUHK | |||
Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration | LLaMA | Macaw-LLM | 2306.09093 | Tencent | |||
Valley: Video assistant with large language model enhanced ability | Vicuna | PT, FT + MM-INST | mixture | Valley | 2306.07207 | ByteDance | |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Vicuna | Video-ChatGPT | 2306.05424 | MBZUAI | |||
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding | LLaMA | Video-LLaMA | 2306.02858 | Alibaba | |||
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | Bert | PT | mixture (audio,video,image) | VAST | Neurips2023 | 2305.18500 | CAS |
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst | Vicuna-13B | FT + MM-INST | ChatBridge | 2305.16103 | CAS | ||
Self-Chained Image-Language Model for Video Localization and Question Answering | BLIP2 | 2-stage: localizer(LM) + answer | QVHighlights, FT VidL | SeViLA | Neurips2023 | 2305.06988 | UNC |
VideoChat: Chat-Centric Video Understanding | Blip2 | VideoChat | 2305.06355 | Shanghai AI Lab | |||
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | ChatGPT | X-LLM | 2305.04160 | CAS | |||
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | Bert | VALOR | 2304.08345 | CAS | |||
Verbs in Action: Improving verb understanding in video-language models | PaLM | 2304.06708 | |||||
Video ChatCaptioner: Towards the Enriched Spatiotemporal Descriptions | ChatGPT, Flan-T5 (BLIP2) | ChatCaptioner | 2304.04227 | KAUST | |||
Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering | GPT2, GPT-Neo, GPT3 | CVPR2023 workshop | 2304.03754 | Columbia Univ. | |||
Unmasked Teacher: Towards Training-Efficient Video Foundation Models | Bert | PT | mixture | Unmasked Teacher | ICCV 2023 | 2303.16058 | Shanghai AI Lab |
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning | T5 | Vid2Seq | 2302.14115 | ||||
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | Bert | 2212.14546 | Alibaba | ||||
VindLU: A Recipe for Effective Video-and-Language Pretraining | Bert | VindLU | 2212.05051 | UNC | |||
Learning Video Representations from Large Language Models | GPT2 | PT (data-augment) | Ego4D/HowTo100M | LaViLa | CVPR2023 | 2212.04501 | Meta |
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training | Bert | 2211.11446 | UW | ||||
CLOP: Video-and-Language Pre-Training with Knowledge Regularizations | Roberta | MM 2022 | 2211.03314 | Baidu | |||
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning | Bert | NIPS 2022 | 2210.06031 | Microsoft | |||
OmniVL: One Foundation Model for Image-Language and Video-Language Tasks | Bert | NIPS 2022 | 2209.07526 | Microsoft | |||
Clover: Towards A Unified Video-Language Alignment and Fusion Model | Bert | Clover | 2207.07885 | Bytedance | |||
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling | Bert-like | LAVENDER | CVPR 2023 | 2206.07160 | Microsoft | ||
Revealing Single Frame Bias for Video-and-Language Learning | Bert | Singularity | 2206.03428 | UNC | |||
Label-Efficient Online Continual Object Detection in Streaming Video | - | (continual) | Efficient-CLS | ICCV 2023 | 2206.00309 | NUS | |
Flamingo: a Visual Language Model for Few-Shot Learning | Chinchilla | Flamingo | NIPS 2022 | 2204.14198 | DeepMind | ||
All in One: Exploring Unified Video-Language Pre-training | Bert-like | All-In-One | CVPR 2023 | 2203.07303 | NUS | ||
End-to-end Generative Pretraining for Multimodal Video Captioning | Bert+GPT2 | CVPR 2022 | 2201.08264 | ||||
Align and Prompt: Video-and-Language Pre-training with Entity Prompts | Bert-like | ALPRO | CVPR 2022 | 2112.09583 | Salesforce | ||
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling,V2 | Bert | VIOLET | 2111.12681 | Microsoft | |||
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding | Bert | VideoCLIP | EMNLP 2021 | 2109.14084 | |||
MERLOT: Multimodal Neural Script Knowledge Models,V2 | Roberta | MERLOT | NIPS 2021 | 2106.02636 | AI2 | ||
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding | Bert | VLP | ACL Findings 2021 | 2105.09996 | |||
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text | Bert-like | NIPS 2021 | 2104.11178 | ||||
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval | Bert-like | CLIP4Clip | Neurocomputing 2022 | 2104.08860 | Microsoft | ||
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | Bert | Frozen-in-Time | ICCV 2021 | 2104.00650 | Oxford | ||
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling | Bert | ClipBert | CVPR 2021 | 2102.06183 | Microsoft | ||
ActBERT: Learning Global-Local Video-Text Representations | Bert | ActBert | CVPR 2020 | 2011.07231 | Baidu | ||
Video Understanding as Machine Translation | T5 | 2006.07203 | |||||
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training | Bert | HERO | EMNLP 2020 | 2005.00200 | Microsoft | ||
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation | Bert | UniVL | 2002.06353 | Microsoft | |||
Learning Video Representations using Contrastive Bidirectional Transformer | Bert | 1906.05743 | |||||
VideoBERT: A Joint Model for Video and Language Representation Learning | Bert | VideoBert (non-official) | ICCV 2019 | 1904.01766 |
Commmonly Used Pretraining Tasks
- Masked Language Modeling (MLM)
- Causal Language Modeling (LM)
- Masked Vision Modeling (MLM)
- Vision = Frame
- Vision = Patch
- VIsion = Object
- Video Language Matching (VLM)
- Video Language Contrastive (VLC)
Paper | Video Clips | Duration | Sentences | Domain | Download Link |
---|---|---|---|---|---|
(❗NOT AVAILABLE, 23 Feb 2024) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | 2.5M (2M) | 18s | 2.5M | open (web) | WebVid-2M, WebVid-10M |
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips | 136M (1.2M) | 4s | 136M | instruction (YouTube) | HowTo100M |
Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing | 180M (6M) | -20m | ~720M | open (YouTube) | YT-Temporal-180M |
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions | 100M (3.3M) | 13.4s | 100M | open (YouTube) | HD-VILA-100M |
Learning audio-video modalities from image captions | 10.3M (6.3M) | 10s | open (web) | VideoCC | |
CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos | 18M | 60s | open (YouTube) | YTD-18M | |
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks | 10 M | 54.2s | 10 M | open (YOUKU) | Youku-mPLUG |
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation | 234M (7.1M) | 11.7s | 234 M | open (YouTube) | InternVid |
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers | 70M | 8.5s | 70M | open | Panda-70M from HD-VILA-100M |
Dataset | Statistics | Source |
---|---|---|
Video-ChatGPT | 100k INST/10k videos | ActivityNet + (Human+GPT) Annotation |
Valley | 65k INST/100k videos | (VATEX + JukinMedia )+ (Human+GPT) Annotation |
VideoChat | 11k INST/11k videos | WebVid + GPT Annotation |
TimeIT | 125k INST | Mixture + GPT Annotation |
- [Neurips23 D&B] VidChapters, a large-scale dataset of user-chaptered videos. We study three tasks on top of this dataset and show that video chapter generation models trained on VidChapters-7M transfer well to dense video captioning.
Task | Paper | Download Link | Publication |
---|---|---|---|
Retrieval | Collecting Highly Parallel Data for Paraphrase Evaluation | MSVD | ACL 2011 |
Retrieval | A Dataset for Movie Description | LSMDC | CVPR 2015 |
Retrieval | MSR-VTT: A Large Video Description Dataset for Bridging Video and Language | MSR-VTT | CVPR 2016 |
Retrieval | Localizing Moments in Video with Natural Language | DiDeMo | ICCV 2017 |
Retrieval | Dense-Captioning Events in Videos | ActivityNet Caption | ICCV 2017 |
Retrieval | Towards Automatic Learning of Procedures from Web Instructional Videos | YouCook2 | AAAI 2018 |
OE QA | TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering | TGIF-Frame | CVPR 2017 |
OE QA | A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering | LSMDC-FiB | CVPR 2017 |
OE QA | Video Question Answering via Gradually Refined Attention over Appearance and Motion | MSRVTT-QA,MSVD-QA | MM 2017 |
OE QA | ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering | ActivityNet-QA | AAAI 2019 |
MC QA | Learning Language-Visual Embedding for Movie Understanding with Natural-Language | LSMDC-MC | |
MC QA | TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering | TGIF-Action, TGIF-Transition | CVPR 2017 |
MC QA | A Joint Sequence Fusion Model for Video Question Answering and Retrieval | MSRVTT-MC | ECCV 2018 |
Caption | Collecting Highly Parallel Data for Paraphrase Evaluation | MSVD | ACL 2011 |
Caption | MSR-VTT: A Large Video Description Dataset for Bridging Video and Language | MSR-VTT | CVPR 2016 |
Caption | VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research | VATEX | ICCV 2019 |
Dense Caption | Dense-Captioning Events in Videos | ActivityNet Caption | ICCV 2017 |
Dense Caption | Towards Automatic Learning of Procedures from Web Instructional Videos | YouCook2 | AAAI 2018 |
Dense Caption | Multimodal Pretraining for Dense Video Captioning | ViTT | AACL 2020 |
Action | HMDB: A large video database for human motion recognition | HMDB-51 | ICCV 2021 |
Action | UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild | UCF-101 | ICCV 2013 |
Action | ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding | ActivityNet-200 | CVPR 2015 |
Action | Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding | Charades-157 | ECCV 2016 |
Action | The Kinetics Human Action Video Dataset | Kinetics-400/600/700 |
paper | task | duration | domain | link | publication |
---|---|---|---|---|---|
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | Video QA | ~8m | movie | MovieChat | CVPR 2024 |
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding | Video QA | ~8m | movie | MoVQA | |
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Video QA | ~3m | open (ego) | EgoSchema | |
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models | Temporal Grounding | ~60s (mix.) | open | ViLMA | ICLR 2024 |
From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering | Video QA | 9s | open | Causal-VidQA | CVPR 2022 |
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference | Video Language Inference | 35.2s | movie | VIOLIN | CVPR 2020 |
TVQA: Localized, Compositional Video Question Answering | Video QA | 60-90s | movie | TVQA | EMNLP 2018 |
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning | Video QA | 30s | open | AGQA | CVPR 2021 |
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions | Video QA | 44s | open | NExT-QA-MC, NExT-QA-OE | CVPR 2021 |
Towards Long-Form Video Understanding | Classification | 1-3m | movie | LVU | CVPR 2021 |
STAR: A Benchmark for Situated Reasoning in Real-World Videos | Video QA | 12s | open | Star | NIPS 2021 |
Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments | Video QA | 20s | virtual env. | Env-QA | ICCV 2021 |
COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis | Localization/Action Seg. | 3.36m | open (instruct) | COIN | CVPR 2019 |
Cross-task weakly supervised learning from instructional videos | Localization | 4m57s | open (instruct) | CrossTask | CVPR 2019 |
Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence | Video QA | 60s | open | Social-IQ | CVPR 2019 |
Benchmark | Task | Data | Paper | Preprint | Publication | Affiliation |
---|---|---|---|---|---|---|
Video-Bench | MC (general domain) | mixture |
- Common Metrics on Video Quality, You can easily calculate FVD, PSNR, SSIM, LPIPS for evaluating the quality of generated or predicted videos.
projects
(Video Agent)
VLog, Transform Video as a Document with ChatGPT, CLIP, BLIP2, GRIT, Whisper, LangChain.
tools
- VideoDB, It enables developers to: 1) Upload multiple videos to create a library or collection; 2) Search across these videos and get real-time video responses or compilations; 3) Publish your searchable collection on the ChatGPT store; 4) Receive summarized text answers (RAG); 5) Gain key insights from specific videos (e.g. "Top points from episode 31").
- video2dataset, Easily create large video dataset from video urls. Can download and package 10M videos in 12h on a single 16 core machine.
- Match cutting, A match cut is a transition between a pair of shots that uses similar framing, composition, or action to fluidly bring the viewer from one scene to the next
- Awesome-Video-Object-Segmentation, A curated list of video object segmentation (vos) papers, datasets, and projects.
- pytube, A lightweight, dependency-free Python library (and command-line utility) for downloading YouTube Videos.
- movienet-tools, Movie toolbox provides many basic tools and functions for the researches on movie understanding, with which you can get started with your research easily.
- PySceneDetect, Video Scene Cut Detection and Analysis Tool
Survey
Reading List
Paper | Base Structure | Data | Code | Publication | Preprint | Affiliation |
---|---|---|---|---|---|---|
Video generation models as world simulators | Transformer | - | - | - | 2402.blog | OpenAI |
Vlogger: Make Your Dream A Vlog | Diffusion | Vlogger | 2401.09414 | Shanghai AI Lab | ||
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis | ControlNet | FlowVid | 2312.17681 | Meta | ||
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction | Diffusion | SEINE | 2310.20700 | Shanghai AI Lab | ||
MotionDirector: Motion Customization of Text-to-Video Diffusion Models | Diffusion | MotionDirector | 2310.08465 | NUS | ||
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning | GPT4 +UNet | VideoDirectorGPT | 2309.15091 | UNC | ||
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers | Transformer +VQVAE | self-construct | CogVideo | 2205.15868 | THU |
- T2VScore, T2VScore: Towards A Better Metric for Text-to-Video Generation
- Open Sora, Open-Sora: Democratizing Efficient Video Production for All
- Open Chat Video Editor, Open source short video automatic generation tool