The section includes the csv files listing the data samples in Panda-70M and the code to download the videos.
[Note] Please use the video2dataset tool from this repository to download the dataset, as the video2dataset from the official repository cannot work with our csv format.
Split | Download | # Source Videos | # Samples | Video Duration | Storage Space |
---|---|---|---|---|---|
Training (full) | link (2.73 GB) | 3,779,763 | 70,723,513 | 167 khrs | ~36 TB |
Training (10M) | link (504 MB) | 3,755,240 | 10,473,922 | 37.0 khrs | ~8.0 TB |
Training (2M) | link (118 MB) | 800,000 | 2,400,000 | 7.56 khrs | ~1.6 TB |
Validation | link (1.2 MB) | 2,000 | 6,000 | 18.5 hrs | ~4.0 GB |
Testing | link (1.2 MB) | 2,000 | 6,000 | 18.5 hrs | ~4.0 GB |
- Validation and testing set are collected from 2,000 source videos which do not appear in any training set to avoid testing information leakage. For each source video, we randomly sample 3 clips.
- Training set (10M) is the high-quality subset of training set (full). In the subset, we only sample at most 3 clips from a source video to increase diversity and the video-caption matching scores are all larger than 0.43 to guarantee a better caption quality.
- Training set (2M) is randomly sampled from training set (10M) and include 3 clips for each source video.
- [Note 1] The training csv files are too large and are compressed into zip files. Please
unzip
to get the csv files. - [Note 2] We will remove the video samples from our dataset as long as you need it. Please contact tsaishienchen at gmail dot com for the request.
git clone https://github.com/snap-research/Panda-70M.git
cd Panda-70M/dataset_dataloading/video2dataset
pip install -e .
cd ..
Download the csv files and change <csv_file>
and <output_folder>
arguments to download corresponding data.
video2dataset --url_list="<csv_file>" \
--url_col="url" \
--caption_col="caption" \
--clip_col="timestamp" \
--output_folder="<output_folder>" \
--save_additional_columns="[matching_score,desirable_filtering,shot_boundary_detection]" \
--config="video2dataset/video2dataset/configs/panda70m.yaml"
Error Message | Solution |
pyarrow.lib.ArrowTypeError: Expected bytes, got |
Your ffmpeg and ffmpeg-python version is out-of-date. Update them by pip or conda. Please refer this issue for more details. |
HTTP Error 403: Forbidden |
Your IP got blocked. Use proxy for downloading. Please refer this issue for more details. |
HTTP Error 429: Too Many Requests |
Your download requests reach a limit. Slow down the download speed by reducing processes_count and thread_count in the config file. Please refer this issue for more details. |
YouTube said: ERROR - Precondition check failed |
Your yt-dlp version is out-of-date and need to install a nightly version. Please refer this issue for more details. |
In the json file:"status": "failed_to_download" & "error_message": |
The YouTube video has been set to private or removed. Please skip this sample. |
YouTube: Skipping player responses from android clients |
The latest version of yt-dlp will solve this issue. Please refer this issue for more details. |
The code will download and store the data with the format:
output-folder
├── 00000 {shardID}
| ├── 0000000_00000.mp4 {shardID + videoID _ clipID}
| ├── 0000000_00000.txt
| ├── 0000000_00000.json
| ├── 0000000_00001.mp4
| ├── 0000000_00001.txt
| ├── 0000000_00001.json
| └── ...
| ├── 0000099_00004.mp4
| ├── 0000099_00004.txt
| ├── 0000099_00004.json
├── 00001
| ├── 0000100_00000.mp4
| ├── 0000100_00000.txt
| ├── 0000100_00000.json
│ ...
...
- Each data comes with 3 files:
.mp4
(video),.txt
(caption),.json
(meta information) - Meta information includes:
- Caption
- Matching score: confidence score of each video-caption pair
- [🔥 New] Desirablability filtering: whether a video is a suitable training sample for a video generation model. There are six categories of filtering results:
desirable
,0_low_desirable_score
,1_still_foreground_image
,2_tiny_camera_movement
,3_screen_in_screen
,4_computer_screen_recording
. Check here for examples for each category. - [🔥 New] Shot boundary detection: a list of intervals representing continuous shots within a video (predicted by TransNetV2). If the length of the list is one, it indicates the video consists of a single continuous shot without any shot boundaries.
- Other metadata: video title, description, categories, subtitles, to name but a few.
- [Note 1] The dataset is unshuffled and the clips from a same long video would be stored into a shard. Please manually shuffle them if needed.
- [Note 2] The videos are resized into 360 px height. You can change
download_size
in the config file to get different video resolutions. - [Note 3] The videos are downloaded with audio by default. You can change
download_audio
in the config file to turn off the audio and increase download speed.
The code for data downloading is built upon video2dataset. Thanks for sharing the great codebase!