YouTube8M is nice, but it comes with a lot of extra stuff that you might not want. If you just want the video urls and the labels, then you're in luck.
Video ids and labels can be downloaded from:
- parsed_dataset_renamed_train.json
- parsed_dataset_renamed_train.pkl
- parsed_dataset_renamed_val.json
- parsed_dataset_renamed_val.pkl
Alternatively, run this script: python download_dataset.py
You can look at the videos and labels easily using the provided script:
pip install -r requirements.txt
python examine_videos.py
The script used to generate the files is also included in the repo.
If for whatever reason you want to regenerate the data, you can run something like the following (modifying paths until they make sense).
mkdir -p ~/data/yt8m/video; cd ~/data/yt8m/video
pip install tensorflow==1.14.0
curl data.yt8m.org/download.py | partition=2/video/train mirror=us python
curl data.yt8m.org/download.py | partition=2/video/validate mirror=us python
curl data.yt8m.org/download.py | partition=2/video/test mirror=us python
python parse_tfrecord.py