Skip to content

Latest commit

 

History

History
262 lines (221 loc) · 62.9 KB

Video.md

File metadata and controls

262 lines (221 loc) · 62.9 KB

Video

Table of Contents

Reading List

This reading list additionally collect video-language pretraining works before LLM

NOTEs: FT=Finetune, VidL=Video-Language, MM=Multimodal, INST=Instruction

Paper Base Language Model Framework Data Code Publication Preprint Affiliation
LongVILA: Scaling Long-Context Visual Language Models for Long Videos LLaMA3 INST (LLM-extend) Mixture LongVILA 2408.10188 NVIDIA
Long Context Transfer from Language to Vision Qwen2 INST (LLM-extend) llava-next LongVA 2406.16852 NTU
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models Vicuna INST Mixture (llava, videochat, ego4d, how2) SALMONN 2406.15704 Bytedance
VideoLLM-online: Online Large Language Model for Streaming Video Llama2/3 INST (+timestamp) videollm-online NUS
Streaming Long Video Understanding with Large Language Models Phi2, Vicuna INST Mixture (conceptual caption, howto100m, panda-700m, movieqa, msrvtt, star) ShanghaiAI Lab
LLaVA-NeXT: A Strong Zero-shot Video Understanding Model Vicuna, Yi INST LLaVA+ LLaVA-NeXT 2404-blog Bytedance
PLLaVA: Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Vicuna, Yi INST VideoChat2 insts. PLLaVA 2404.16994 Bytedance
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding Vicuna FT (memory retrieval) - (task) MA-LMM 2404.05726 Meta
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens Llama, Mistral PT+FT mixture MiniGPT4-video 2404.03413 KAUST
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding X+GPT4 Video Agent - VideoAgent 2403.11481 BIGAI
VideoAgent: Long-form Video Understanding with Large Language Model as Agent LaViLa + GPT4 Video Agent (Caption) - 2403.10517 Stanford
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding - - Video Mamba Suite 2403.09626 Shanghai AI Lab
VideoMamba: State Space Model for Efficient Video Understanding - - VideoMamba 2403.06977 Shanghai AI Lab
LLMs Meet Long Video: Advancing Long Video Comprehension with An Interactive Visual Adapter in LLMs LLaMA adapter mixture 2402.13546 HIT
Video ReCap: Recursive Captioning of Hour-Long Videos BLIP2, LaVila Caption + dataset mixture VideoRecap CVPR2024 2402.13250 UNC
VideoPrism: A Foundational Visual Encoder for Video Understanding (PaLM) PT mixture 2402.13217 Google
LVCHAT: Facilitating Long Video Comprehension LLaMA FT + position interleaved VideoChat2 LVChat 2402.12079 UCSD
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning llama MMINST+temporal prompt self-construct(Moment-10M)decode Momentor 2402.11435 ZJU
World Model on Million-Length Video And Language With RingAttention LLaMA2 PT+FT mixture LWM 2402.08268 UCB
Memory Consolidation Enables Long-Context Video Understanding Bert FT + memory(ViT) 2402.05861 DeepMind
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization LLaMA2 PT+FT mixture LaVIT 2402.03161 PKU
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models StableLM, Qwen, Phi2 MoE mixture (MM-INST) MoE-LLaVA 2401.15947 PKU
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models X + GPT3.5 Video Agent - DoraemonGPT 2401.08392 ZJU
A Simple LLM Framework for Long-Range Video Question-Answering Cap + GPT4 Video Agent (Caption) - LLoVi 2312.17235 UNC
Text-Conditioned Resampler For Long Form Video Understanding BLIP2 FT Resampler (blip2 on video) - 2312.11897 Google
Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens LLaVA FT+Recur. Qformer VideoChatGPT 2312.08870 ByteDance
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames PaLI, Bard PT (vivit+adapter) mixture 2312.07395 Google
LifelongMemory: Leveraging LLMs for Answering Queries in Egocentric Videos LLaVA, GPT3.5 Video Agent (Caption) - 2312.05269 NYU
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding LLaMA2 MM-INST mixture (additional Transcribed Speeck) TimeChat CVPR2024 2312.02051 PKU
Zero-Shot Video Question Answering with Procedural Programs GPT+X Video Agent - 2312.00937 CMU
VTimeLLM: Empower LLM to Grasp Video Moments Vicuna INST+temporal mixture VTimeLLM CVPR2024 2311.18445 THU
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Vicuna MM-INST self-construct LLaMA-VID 2311.17043 CUHK
Vamos: Versatile Action Models for Video Understanding GPT4, X Video Agent (Caption) - Vamos 2311.13627 Brown
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Vicuna 1.5 PT+FT mixture Video-LLaVA 2311.10122 PKU
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding Vicuna PT+FT mixture Chat-UniVi 2311.08046 PKU
UniVTG: Towards Unified Video-Language Temporal Grounding CLIP PT mixture UniVTG ICCV 2023 2307.16715 NTU
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Vicuna FT MovieChat CVPR2024 2307.16449 Microsoft
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models DeBerta FT, RTr-Augmented 2306.11732 CUHK
Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration LLaMA Macaw-LLM 2306.09093 Tencent
Valley: Video assistant with large language model enhanced ability Vicuna PT, FT + MM-INST mixture Valley 2306.07207 ByteDance
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Vicuna Video-ChatGPT 2306.05424 MBZUAI
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding LLaMA Video-LLaMA 2306.02858 Alibaba
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset Bert PT mixture (audio,video,image) VAST Neurips2023 2305.18500 CAS
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst Vicuna-13B FT + MM-INST ChatBridge 2305.16103 CAS
Self-Chained Image-Language Model for Video Localization and Question Answering BLIP2 2-stage: localizer(LM) + answer QVHighlights, FT VidL SeViLA Neurips2023 2305.06988 UNC
VideoChat: Chat-Centric Video Understanding Blip2 VideoChat 2305.06355 Shanghai AI Lab
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages ChatGPT X-LLM 2305.04160 CAS
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Bert VALOR 2304.08345 CAS
Verbs in Action: Improving verb understanding in video-language models PaLM 2304.06708 Google
Video ChatCaptioner: Towards the Enriched Spatiotemporal Descriptions ChatGPT, Flan-T5 (BLIP2) ChatCaptioner 2304.04227 KAUST
Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering GPT2, GPT-Neo, GPT3 CVPR2023 workshop 2304.03754 Columbia Univ.
Unmasked Teacher: Towards Training-Efficient Video Foundation Models Bert PT mixture Unmasked Teacher ICCV 2023 2303.16058 Shanghai AI Lab
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning T5 Vid2Seq 2302.14115 Google
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training Bert 2212.14546 Alibaba
VindLU: A Recipe for Effective Video-and-Language Pretraining Bert VindLU 2212.05051 UNC
Learning Video Representations from Large Language Models GPT2 PT (data-augment) Ego4D/HowTo100M LaViLa CVPR2023 2212.04501 Meta
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training Bert 2211.11446 UW
CLOP: Video-and-Language Pre-Training with Knowledge Regularizations Roberta MM 2022 2211.03314 Baidu
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning Bert NIPS 2022 2210.06031 Microsoft
OmniVL: One Foundation Model for Image-Language and Video-Language Tasks Bert NIPS 2022 2209.07526 Microsoft
Clover: Towards A Unified Video-Language Alignment and Fusion Model Bert Clover 2207.07885 Bytedance
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling Bert-like LAVENDER CVPR 2023 2206.07160 Microsoft
Revealing Single Frame Bias for Video-and-Language Learning Bert Singularity 2206.03428 UNC
Label-Efficient Online Continual Object Detection in Streaming Video - (continual) Efficient-CLS ICCV 2023 2206.00309 NUS
Flamingo: a Visual Language Model for Few-Shot Learning Chinchilla Flamingo NIPS 2022 2204.14198 DeepMind
All in One: Exploring Unified Video-Language Pre-training Bert-like All-In-One CVPR 2023 2203.07303 NUS
End-to-end Generative Pretraining for Multimodal Video Captioning Bert+GPT2 CVPR 2022 2201.08264 Google
Align and Prompt: Video-and-Language Pre-training with Entity Prompts Bert-like ALPRO CVPR 2022 2112.09583 Salesforce
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling,V2 Bert VIOLET 2111.12681 Microsoft
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding Bert VideoCLIP EMNLP 2021 2109.14084 Facebook
MERLOT: Multimodal Neural Script Knowledge Models,V2 Roberta MERLOT NIPS 2021 2106.02636 AI2
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding Bert VLP ACL Findings 2021 2105.09996 Facebook
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text Bert-like NIPS 2021 2104.11178 Google
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval Bert-like CLIP4Clip Neurocomputing 2022 2104.08860 Microsoft
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Bert Frozen-in-Time ICCV 2021 2104.00650 Oxford
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling Bert ClipBert CVPR 2021 2102.06183 Microsoft
ActBERT: Learning Global-Local Video-Text Representations Bert ActBert CVPR 2020 2011.07231 Baidu
Video Understanding as Machine Translation T5 2006.07203 Facebook
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training Bert HERO EMNLP 2020 2005.00200 Microsoft
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation Bert UniVL 2002.06353 Microsoft
Learning Video Representations using Contrastive Bidirectional Transformer Bert 1906.05743 Google
VideoBERT: A Joint Model for Video and Language Representation Learning Bert VideoBert (non-official) ICCV 2019 1904.01766 Google

Pretraining Tasks

Commmonly Used Pretraining Tasks

  • Masked Language Modeling (MLM)
  • Causal Language Modeling (LM)
  • Masked Vision Modeling (MLM)
    • Vision = Frame
    • Vision = Patch
    • VIsion = Object
  • Video Language Matching (VLM)
  • Video Language Contrastive (VLC)

Datasets

Pretraining Corpora

Paper Video Clips Duration Sentences Domain Download Link
(❗NOT AVAILABLE, 23 Feb 2024) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval 2.5M (2M) 18s 2.5M open (web) WebVid-2M, WebVid-10M
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips 136M (1.2M) 4s 136M instruction (YouTube) HowTo100M
Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing 180M (6M) -20m ~720M open (YouTube) YT-Temporal-180M
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions 100M (3.3M) 13.4s 100M open (YouTube) HD-VILA-100M
Learning audio-video modalities from image captions 10.3M (6.3M) 10s open (web) VideoCC
CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos 18M 60s open (YouTube) YTD-18M
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks 10 M 54.2s 10 M open (YOUKU) Youku-mPLUG
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation 234M (7.1M) 11.7s 234 M open (YouTube) InternVid
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers 70M 8.5s 70M open Panda-70M from HD-VILA-100M

Video Instructions

Dataset Statistics Source
Video-ChatGPT 100k INST/10k videos ActivityNet + (Human+GPT) Annotation
Valley 65k INST/100k videos (VATEX + JukinMedia )+ (Human+GPT) Annotation
VideoChat 11k INST/11k videos WebVid + GPT Annotation
TimeIT 125k INST Mixture + GPT Annotation

Others

  • [Neurips23 D&B] VidChapters, a large-scale dataset of user-chaptered videos. We study three tasks on top of this dataset and show that video chapter generation models trained on VidChapters-7M transfer well to dense video captioning.

Benchmarks

Common Downstream Tasks

Task Paper Download Link Publication
Retrieval Collecting Highly Parallel Data for Paraphrase Evaluation MSVD ACL 2011
Retrieval A Dataset for Movie Description LSMDC CVPR 2015
Retrieval MSR-VTT: A Large Video Description Dataset for Bridging Video and Language MSR-VTT CVPR 2016
Retrieval Localizing Moments in Video with Natural Language DiDeMo ICCV 2017
Retrieval Dense-Captioning Events in Videos ActivityNet Caption ICCV 2017
Retrieval Towards Automatic Learning of Procedures from Web Instructional Videos YouCook2 AAAI 2018
OE QA TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering TGIF-Frame CVPR 2017
OE QA A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering LSMDC-FiB CVPR 2017
OE QA Video Question Answering via Gradually Refined Attention over Appearance and Motion MSRVTT-QA,MSVD-QA MM 2017
OE QA ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering ActivityNet-QA AAAI 2019
MC QA Learning Language-Visual Embedding for Movie Understanding with Natural-Language LSMDC-MC
MC QA TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering TGIF-Action, TGIF-Transition CVPR 2017
MC QA A Joint Sequence Fusion Model for Video Question Answering and Retrieval MSRVTT-MC ECCV 2018
Caption Collecting Highly Parallel Data for Paraphrase Evaluation MSVD ACL 2011
Caption MSR-VTT: A Large Video Description Dataset for Bridging Video and Language MSR-VTT CVPR 2016
Caption VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research VATEX ICCV 2019
Dense Caption Dense-Captioning Events in Videos ActivityNet Caption ICCV 2017
Dense Caption Towards Automatic Learning of Procedures from Web Instructional Videos YouCook2 AAAI 2018
Dense Caption Multimodal Pretraining for Dense Video Captioning ViTT AACL 2020
Action HMDB: A large video database for human motion recognition HMDB-51 ICCV 2021
Action UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild UCF-101 ICCV 2013
Action ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding ActivityNet-200 CVPR 2015
Action Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding Charades-157 ECCV 2016
Action The Kinetics Human Action Video Dataset Kinetics-400/600/700

Advanced Downstream Tasks

Task-Specific Benchmarks

paper task duration domain link publication
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Video QA ~8m movie MovieChat CVPR 2024
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding Video QA ~8m movie MoVQA
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding Video QA ~3m open (ego) EgoSchema
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models Temporal Grounding ~60s (mix.) open ViLMA ICLR 2024
From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering Video QA 9s open Causal-VidQA CVPR 2022
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference Video Language Inference 35.2s movie VIOLIN CVPR 2020
TVQA: Localized, Compositional Video Question Answering Video QA 60-90s movie TVQA EMNLP 2018
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning Video QA 30s open AGQA CVPR 2021
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions Video QA 44s open NExT-QA-MC, NExT-QA-OE CVPR 2021
Towards Long-Form Video Understanding Classification 1-3m movie LVU CVPR 2021
STAR: A Benchmark for Situated Reasoning in Real-World Videos Video QA 12s open Star NIPS 2021
Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments Video QA 20s virtual env. Env-QA ICCV 2021
COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis Localization/Action Seg. 3.36m open (instruct) COIN CVPR 2019
Cross-task weakly supervised learning from instructional videos Localization 4m57s open (instruct) CrossTask CVPR 2019
Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence Video QA 60s open Social-IQ CVPR 2019

Multifaceted Benchmarks

Benchmark Task Data Paper Preprint Publication Affiliation
Video-Bench MC (general domain) mixture

Metrics

Projects & Tools

projects

  • (Video Agent) VLog, Transform Video as a Document with ChatGPT, CLIP, BLIP2, GRIT, Whisper, LangChain.

tools

  • VideoDB, It enables developers to: 1) Upload multiple videos to create a library or collection; 2) Search across these videos and get real-time video responses or compilations; 3) Publish your searchable collection on the ChatGPT store; 4) Receive summarized text answers (RAG); 5) Gain key insights from specific videos (e.g. "Top points from episode 31").
  • video2dataset, Easily create large video dataset from video urls. Can download and package 10M videos in 12h on a single 16 core machine.
  • Match cutting, A match cut is a transition between a pair of shots that uses similar framing, composition, or action to fluidly bring the viewer from one scene to the next
  • Awesome-Video-Object-Segmentation, A curated list of video object segmentation (vos) papers, datasets, and projects.
  • pytube, A lightweight, dependency-free Python library (and command-line utility) for downloading YouTube Videos.
  • movienet-tools, Movie toolbox provides many basic tools and functions for the researches on movie understanding, with which you can get started with your research easily.
  • PySceneDetect, Video Scene Cut Detection and Analysis Tool

Video Generation

Reading List

Survey

  • (2023-10) A Survey on Video Diffusion Models paper repo

Reading List

Paper Base Structure Data Code Publication Preprint Affiliation
Video generation models as world simulators Transformer - - - 2402.blog OpenAI
Vlogger: Make Your Dream A Vlog Diffusion Vlogger 2401.09414 Shanghai AI Lab
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis ControlNet FlowVid 2312.17681 Meta
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction Diffusion SEINE 2310.20700 Shanghai AI Lab
MotionDirector: Motion Customization of Text-to-Video Diffusion Models Diffusion MotionDirector 2310.08465 NUS
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning GPT4 +UNet VideoDirectorGPT 2309.15091 UNC
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers Transformer +VQVAE self-construct CogVideo 2205.15868 THU

Metrics

  • T2VScore, T2VScore: Towards A Better Metric for Text-to-Video Generation

Projects