Skip to content

Latest commit

 

History

History

dataset_dataloading

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

🐼 Panda-70M: Dataset Dataloading

The section includes the csv files listing the data samples in Panda-70M and the code to download the videos.

[Note] Please use the video2dataset tool from this repository to download the dataset, as the video2dataset from the official repository cannot work with our csv format.

Data Splitting and Download Link

Split Download # Source Videos # Samples Video Duration Storage Space
Training (full) link (2.73 GB) 3,779,763 70,723,513 167 khrs ~36 TB
Training (10M) link (504 MB) 3,755,240 10,473,922 37.0 khrs ~8.0 TB
Training (2M) link (118 MB) 800,000 2,400,000 7.56 khrs ~1.6 TB
Validation link (1.2 MB) 2,000 6,000 18.5 hrs ~4.0 GB
Testing link (1.2 MB) 2,000 6,000 18.5 hrs ~4.0 GB
  • Validation and testing set are collected from 2,000 source videos which do not appear in any training set to avoid testing information leakage. For each source video, we randomly sample 3 clips.
  • Training set (10M) is the high-quality subset of training set (full). In the subset, we only sample at most 3 clips from a source video to increase diversity and the video-caption matching scores are all larger than 0.43 to guarantee a better caption quality.
  • Training set (2M) is randomly sampled from training set (10M) and include 3 clips for each source video.
  • [Note 1] The training csv files are too large and are compressed into zip files. Please unzip to get the csv files.
  • [Note 2] We will remove the video samples from our dataset as long as you need it. Please contact tsaishienchen at gmail dot com for the request.

Download Dataset

Setup Repository and Enviroment

git clone https://github.com/snap-research/Panda-70M.git
cd Panda-70M/dataset_dataloading/video2dataset
pip install -e .
cd ..

Download Dataset

Download the csv files and change <csv_file> and <output_folder> arguments to download corresponding data.

video2dataset --url_list="<csv_file>" \
              --url_col="url" \
              --caption_col="caption" \
              --clip_col="timestamp" \
              --output_folder="<output_folder>" \
              --save_additional_columns="[matching_score,desirable_filtering,shot_boundary_detection]" \
              --config="video2dataset/video2dataset/configs/panda70m.yaml"

Known Issues

Error Message Solution
pyarrow.lib.ArrowTypeError: Expected bytes, got
a 'list' object
Your ffmpeg and ffmpeg-python version is out-of-date. Update them by pip or conda. Please refer this issue for more details.
HTTP Error 403: Forbidden
Your IP got blocked. Use proxy for downloading. Please refer this issue for more details.
HTTP Error 429: Too Many Requests
Your download requests reach a limit. Slow down the download speed by reducing processes_count and thread_count in the config file. Please refer this issue for more details.
YouTube said: ERROR - Precondition check failed
Your yt-dlp version is out-of-date and need to install a nightly version. Please refer this issue for more details.
In the json file:
"status": "failed_to_download" & "error_message":
"[Errno 2] No such file or directory: '/tmp/...'"
The YouTube video has been set to private or removed. Please skip this sample.
YouTube: Skipping player responses from android clients
(got player responses for video ... instead of ...)
The latest version of yt-dlp will solve this issue. Please refer this issue for more details.

Dataset Format

The code will download and store the data with the format:

output-folder
 ├── 00000 {shardID}
 |     ├── 0000000_00000.mp4 {shardID + videoID _ clipID}
 |     ├── 0000000_00000.txt
 |     ├── 0000000_00000.json
 |     ├── 0000000_00001.mp4
 |     ├── 0000000_00001.txt
 |     ├── 0000000_00001.json
 |     └── ...
 |     ├── 0000099_00004.mp4
 |     ├── 0000099_00004.txt
 |     ├── 0000099_00004.json
 ├── 00001
 |     ├── 0000100_00000.mp4
 |     ├── 0000100_00000.txt
 |     ├── 0000100_00000.json
 │     ...
 ...
  • Each data comes with 3 files: .mp4 (video), .txt (caption), .json (meta information)
  • Meta information includes:
    • Caption
    • Matching score: confidence score of each video-caption pair
    • [🔥 New] Desirablability filtering: whether a video is a suitable training sample for a video generation model. There are six categories of filtering results: desirable, 0_low_desirable_score, 1_still_foreground_image, 2_tiny_camera_movement, 3_screen_in_screen, 4_computer_screen_recording. Check here for examples for each category.
    • [🔥 New] Shot boundary detection: a list of intervals representing continuous shots within a video (predicted by TransNetV2). If the length of the list is one, it indicates the video consists of a single continuous shot without any shot boundaries.
    • Other metadata: video title, description, categories, subtitles, to name but a few.
  • [Note 1] The dataset is unshuffled and the clips from a same long video would be stored into a shard. Please manually shuffle them if needed.
  • [Note 2] The videos are resized into 360 px height. You can change download_size in the config file to get different video resolutions.
  • [Note 3] The videos are downloaded with audio by default. You can change download_audio in the config file to turn off the audio and increase download speed.

Acknowledgements

The code for data downloading is built upon video2dataset. Thanks for sharing the great codebase!