StreamingBench evaluates Multimodal Large Language Models (MLLMs) in real-time, streaming video understanding tasks. 🌟
As MLLMs continue to advance, they remain largely focused on offline video comprehension, where all frames are pre-loaded before making queries. However, this is far from the human ability to process and respond to video streams in real-time, capturing the dynamic nature of multimedia content. To bridge this gap, StreamingBench introduces the first comprehensive benchmark for streaming video understanding in MLLMs.
- 🎯 Real-time Visual Understanding: Can the model process and respond to visual changes in real-time?
- 🔊 Omni-source Understanding: Does the model integrate visual and audio inputs synchronously in real-time video streams?
- 🎬 Contextual Understanding: Can the model comprehend the broader context within video streams?
- 📊 900 diverse videos
- 📝 4,500 human-annotated QA pairs
- ⏱️ Five questions per video at different timestamps
example.mp4
- Python 3.x
- moviepy
-
Download Dataset: Retrieve all necessary files from the StreamingBench Dataset.
-
Decompress Files: Extract the downloaded files and organize them in the
./data
directory as follows:StreamingBench/ ├── data/ │ ├── real/ # Unzip Real Time Visual Understanding_*.zip into this folder │ ├── omni/ # Unzip other .zip files into this folder │ ├── sqa/ # Unzip Sequential Question Answering_*.zip into this folder │ └── proactive/ # Unzip Proactive Output_*.zip into this folder
-
Preprocess Data: Run the following command to preprocess the data:
cd ./scripts bash preprocess.sh
Prepare your own model for evaluation by following the instructions provided here. This guide will help you set up and configure your model to ensure it is ready for testing against the dataset.
Now you can run the benchmark:
bash eval.sh
This will run the benchmark and save the results to the specified output file. Then you can calculate the metrics using the following command:
bash stats.sh
- All Context
- 60 seconds of context preceding the query time
- Comparison of Main Experiment vs. 60 Seconds of Video Context
"≤ xs" means that the answer is considered correct if the actual output time is within x seconds of the ground truth.
@article{lin2024streaming,
title={StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding},
author={Junming Lin and Zheng Fang and Chi Chen and Zihao Wan and Fuwen Luo and Peng Li and Yang Liu and Maosong Sun},
journal={arXiv preprint arXiv:2411.03628},
year={2024}
}