Merge pull request #4 from tsaishien-chen/main

Push code
snap-research · Mar 2, 2024 · bd9a8cd · bd9a8cd
2 parents d58f78c + b7206e3
commit bd9a8cd
Show file tree

Hide file tree

Showing 386 changed files with 44,352 additions and 168 deletions.
diff --git a/README.md b/README.md
@@ -16,10 +16,101 @@ Ming-Hsuan Yang,
 Sergey Tulyakov
 
 <!-- [Arxiv Report](https://arxiv.org/abs/2307.04725) | [Project Page](https://snap-research.github.io/Panda-70M) -->
-[![arXiv](https://img.shields.io/badge/arXiv-2312.00000-b31b1b.svg)](https://arxiv.org/abs/2312.00000)
+[![arXiv](https://img.shields.io/badge/arXiv-2402.19479-b31b1b.svg)](https://arxiv.org/abs/2402.19479)
 [![Project Page](https://img.shields.io/badge/Project-Website-green)](https://snap-research.github.io/Panda-70M)
 
-*Code is coming soon!*
+## Introduction
+Panda-70M is a large-scale dataset with 70M high-quality video-caption pairs.
+This repository have three sections:
+- [Dataset Dataloading](./dataset_dataloading) includes the csv files listing the data of Panda-70M and the code to download the dataset.
+- [Splitting](./splitting) includes the code to split a long video into multiple semantics-consistent short clips.
+- [Captioning](./captioning) includes the proposed video captioning model trained on Panda-70M.
+
+## Dataset
+### Collection Pipeline
+<p align="center" width="100%">
+<a target="_blank"><img src="assets/collection_pipeline.gif" style="width: 100%; min-width: 200px; display: block; margin: auto;"></a>
+</p>
+
+### Download
+  | Split           | Download | # Source Videos | # Samples | Video Duration | Storage Space|
+  |-----------------|----------|-----------------|-----------|----------------|--------------|
+  | Training (full) | [link](https://drive.google.com/file/d/1DeODUcdJCEfnTjJywM-ObmrlVg-wsvwz/view?usp=sharing) (2.01 GB) | 3,779,763 | 70,723,513 | 167 khrs  | ~36 TB  |
+  | Training (10M)  | [link](https://drive.google.com/file/d/1Lrsb65HTJ2hS7Iuy6iPCmjoc3abbEcAX/view?usp=sharing) (381 MB)  | 3,755,240 | 10,473,922 | 37.0 khrs | ~8.0 TB |
+  | Training (2M)   | [link](https://drive.google.com/file/d/1jWTNGjb-hkKiPHXIbEA5CnFwjhA-Fq_Q/view?usp=sharing) (86.5 MB) | 800,000   | 2,400,000  | 7.56 khrs | ~1.6 TB |
+  | Validation      | [link](https://drive.google.com/file/d/1cTCaC7oJ9ZMPSax6I4ZHvUT-lqxOktrX/view?usp=sharing) (803 KB)  | 2,000     | 6,000      | 18.5 hrs  | ~4.0 GB |
+  | Testing         | [link](https://drive.google.com/file/d/1ee227tHEO-DT8AkX7y2q6-bfAtUL-yMI/view?usp=sharing) (803 KB)  | 2,000     | 6,000      | 18.5 hrs  | ~4.0 GB |
+
+More details can be found in [Dataset Dataloading](./dataset_dataloading) section.
+
+## Demonstration
+### Video-Caption Pairs in Panda-70M
+  <table class="center">
+    <tr>
+      <td width=33.3% style="border: none"><img src="./assets/aIPu1xGNbhc.49.gif"></td>
+      <td width=33.3% style="border: none"><img src="./assets/AIyw1FO1aqs.57.gif"></td>
+      <td width=33.3% style="border: none"><img src="./assets/Kb8ON0iCs38.97.gif"></td>
+    </tr>
+    <tr style="text-align: center;">
+      <td width=33.3% style="border: none">A rhino and a lion are fighting in the dirt.</td>
+      <td width=33.3% style="border: none">A person is holding a long haired dachshund in their arms.</td>
+      <td width=33.3% style="border: none">A rocket launches into space on the launch pad.</td>
+    </tr>
+  </table>
+
+  <table class="center">
+    <tr>
+      <td width=33.3% style="border: none"><img src="./assets/AvVDsFBc6bA.0.gif"></td>
+      <td width=33.3% style="border: none"><img src="./assets/S-1NdEjjg7c.58.gif"></td>
+      <td width=33.3% style="border: none"><img src="./assets/10Y6wIEuG00.62.gif"></td>
+    </tr>
+    <tr style="text-align: center;">
+      <td width=33.3% style="border: none">A person is kneading dough and putting jam on it.</td>
+      <td width=33.3% style="border: none">A little boy is playing with a basketball in the city.</td>
+      <td width=33.3% style="border: none">A 3d rendering of a zoo with animals and a train.</td>
+    </tr>
+  </table>
+
+  <table class="center">
+    <tr>
+      <td width=33.3% style="border: none"><img src="./assets/_uQs-YDb5VA.9.gif"></td>
+      <td width=33.3% style="border: none"><img src="./assets/CgcadSRtAag.140.gif"></td>
+      <td width=33.3% style="border: none"><img src="./assets/1NMpoAqzJfY.25.gif"></td>
+    </tr>
+    <tr style="text-align: center;">
+      <td width=33.3% style="border: none">A person in blue gloves is connecting an electrical supply to an injector.</td>
+      <td width=33.3% style="border: none">There is a beach with waves and rocks in the foreground, and a city skyline in the background.</td>
+      <td width=33.3% style="border: none">It is a rally car driving on a dirt road in the countryside, with people watching from the side of the road.</td>
+    </tr>
+  </table>
+
+<sup>**We will remove the video samples from our dataset / Github / project webpage as long as you need it. Please contact tsaishienchen at gmail dot com for the request.</sup>
+
+Please check [here](https://snap-research.github.io/Panda-70M/more_samples) for more samples.
+
+### Long Video Splitting and Captioning
+https://github.com/tsaishien-chen/Panda-70M/assets/43384650/481b369a-122b-4571-a83e-416201ebd6c9
+
+https://github.com/tsaishien-chen/Panda-70M/assets/43384650/fee5468d-815f-41a7-8202-bdb3b60fcac7
+
+## License of Panda-70M
+
+See [license](https://github.com/tsaishien-chen/Panda-70M/blob/main/LICENSE).
+The video samples are collected from a publicly available dataset.
+Users must follow [the related license](https://raw.githubusercontent.com/microsoft/XPretrain/main/hd-vila-100m/LICENSE) to use these video samples.
+
+## Citation
+
+If you find this project useful for your research, please cite our paper. :blush:
+
+```bibtex
+@article{chen2024panda70M,
+    title   = {Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers},
+    author  = {Chen, Tsai-Shien and Siarohin, Aliaksandr and Menapace, Willi and Deyneka, Ekaterina and Chao, Hsiang-wei and Jeon, Byung Eun and Fang, Yuwei and Lee, Hsin-Ying and Ren, Jian and Yang, Ming-Hsuan and Tulyakov, Sergey},
+    journal = {arXiv preprint arXiv:2402.19479},
+    year    = {2024}
+}
+```
 
 ## Contact Information
 **Tsai-Shien Chen**: [[email protected]](mailto:[email protected]) 
diff --git a/assets/10Y6wIEuG00.62.gif b/assets/10Y6wIEuG00.62.gif
diff --git a/assets/1NMpoAqzJfY.25.gif b/assets/1NMpoAqzJfY.25.gif
diff --git a/assets/AIyw1FO1aqs.57.gif b/assets/AIyw1FO1aqs.57.gif
diff --git a/assets/AvVDsFBc6bA.0.gif b/assets/AvVDsFBc6bA.0.gif
diff --git a/assets/CgcadSRtAag.140.gif b/assets/CgcadSRtAag.140.gif
diff --git a/assets/Kb8ON0iCs38.97.gif b/assets/Kb8ON0iCs38.97.gif
diff --git a/assets/S-1NdEjjg7c.58.gif b/assets/S-1NdEjjg7c.58.gif
diff --git a/assets/_uQs-YDb5VA.9.gif b/assets/_uQs-YDb5VA.9.gif
diff --git a/assets/aIPu1xGNbhc.49.gif b/assets/aIPu1xGNbhc.49.gif
diff --git a/assets/collection_pipeline.gif b/assets/collection_pipeline.gif
diff --git a/captioning/README.md b/captioning/README.md
@@ -0,0 +1,92 @@
+# 🐼 Panda-70M: Video Captioning
+
+## Introduction
+We propose a video captioning model to generate a caption for a short video clip.
+The model includes vision (green) and textual (blue) branches to benefit video captioning by both video and text inputs.
+We release the checkpoint trained on Panda-70M.
+<p align="center" width="100%">
+<a target="_blank"><img src="assets/architecture.png" style="width: 60%; min-width: 200px; display: block; margin: auto;"></a>
+</p>
+
+## Preparations
+### Setup Repository and Enviroment
+```
+git clone https://github.com/tsaishien-chen/Panda-70M.git
+cd Panda-70M/captioning
+
+# create a conda environment
+conda create --name panda70m_captioning python=3.9 -y
+conda activate panda70m_captioning
+pip install -r requirements.txt
+
+# install ffmpeg
+apt-get update -y
+apt-get install -y default-jre
+```
+### Download Checkpoint
+You can manually download the file [here](https://drive.google.com/file/d/1Gjp5LrgGJobcFi3AaXvLnzlY7IWXyaI5/view?usp=sharing) (3.82GB) and move it to the `checkpoint` folder or run:
+```
+wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1Gjp5LrgGJobcFi3AaXvLnzlY7IWXyaI5' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1Gjp5LrgGJobcFi3AaXvLnzlY7IWXyaI5" -O checkpoint/checkpoint_best.pth && rm -rf /tmp/cookies.txt
+```
+### Prepare Vicuna:
+- Please follow the [intructions](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md) from FastChat to install **vicuna-7b-v0** weight.
+- **[Note]** You need to apply delta weights and after processed, the weights should be moved to `vicuna_weights/vicuna-7b-v0` folder with the file list like [this](https://github.com/tsaishien-chen/Panda-70M/blob/main/captioning/vicuna_weights/vicuna-7b-v0/README.md).
+
+## Quick Demo
+```
+python inference.py --video-list inputs/video_list.txt --prompt-list inputs/prompt_list.txt
+```
+The code will caption two test videos listed in the `video_list.txt` with the extra inputs of textual information from the `prompt_list.txt`. Here are some output examples:
+<table class="center">
+    <tr style="line-height: 0">
+        <td width=30% style="border: none; text-align: center"><b>Input Video</b></td>
+        <td width=50% style="border: none; text-align: center"><b>Input Text</b></td>
+        <td width=20% style="border: none; text-align: center"><b>Output Caption</b></td>
+    </tr>
+    <tr>
+        <td width=30% style="border: none"><img src="assets/video1.gif" style="width:100%"></td>
+        <td width=50% style="border: none; text-align: center"><sup>
+          Some information about a video you will get:<br>
+Transcription: Today we're gonna take a quick look at the 1966 Ford Mustang GT 289 v8 under the hood.<br>
+Metadata: ['Old VS New - 1966 Ford Mustang GT & 2018 Ford Mustang | Just a Quick Look', 'Lets check out this beautiful 1966 Ford Mustang GT 289 in the showroom with the 2018 Ford Mustang!']<br>
+Please look at the video and faithfully summarize it in one sentence.</sup></td>
+        <td width=20% style="border: none; text-align: center">A red mustang parked in a showroom with american flags hanging from the ceiling.</td>
+    </tr>
+    <tr>
+        <td width=30% style="border: none"><img src="assets/video2.gif" style="width:100%"></td>
+        <td width=50% style="border: none; text-align: center">Please faithfully summarize the following video in one sentence.</td>
+        <td width=20% style="border: none; text-align: center">An aerial view of a city with a river running through it.</td>
+    </tr>
+</table>
+
+<sup>**We will remove the video samples from our dataset / Github / project webpage as long as you need it. Please contact tsaishienchen at gmail dot com for the request.</sup>
+
+- **[Note]** You might get different outputs due to the randomness of LLM's generation.
+
+## Evaluation
+### Zero-shot Captioning Performance
+  |            | BLEU-4 | ROUGE-L | METEOR | CIDEr | BertScore |
+  |------------|--------|---------|--------|-------|-----------|
+  | **MSRVTT** | 25.4%  | 50.1%   | 27.7%  | 31.5% | 87.9%     |
+  | **MSVD**   | 32.8%  | 61.2%   | 35.3%  | 49.2% | 90.2%     |
+
+- **[Note]** The results might not be perfectly reproduced due to the randomness of LLM's generation and could have an deviation of ±0.5%.
+
+### Prepare Testing Data
+- You can download the video samples here [[MSRVTT](https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip) / [MSVD](https://www.cs.utexas.edu/users/ml/clamp/videoDescription/)] and move them to `test_datasets/video_samples/MSRVTT` or `MSVD` folder.
+- The caption annotations of the testing samples are already saved in `test_datasets/anno_downstream` folder.
+
+### Evaluation
+```
+# MSRVTT
+python inference.py --video-list test_datasets/video_list/msrvtt_test.txt --output-json msrvtt_caption.json
+python compute_results.py --predict-json msrvtt_caption.json --target-json test_datasets/anno_downstream/msrvtt_caption_test.json
+
+# MSVD
+python inference.py --video-list test_datasets/video_list/msvd_test.txt --output-json msvd_caption.json
+python compute_results.py --predict-json msvd_caption.json --target-json test_datasets/anno_downstream/msvd_caption_test.json
+```
+
+## Acknowledgements
+The code for video captioning is built upon [Video-LLaMA](https://github.com/DAMO-NLP-SG/Video-LLaMA).
+Thanks for sharing the great work!
diff --git a/captioning/assets/architecture.png b/captioning/assets/architecture.png
diff --git a/captioning/assets/video1.gif b/captioning/assets/video1.gif
diff --git a/captioning/assets/video2.gif b/captioning/assets/video2.gif
diff --git a/captioning/checkpoint/README.md b/captioning/checkpoint/README.md
@@ -0,0 +1 @@
+Put the model checkpoint here
diff --git a/captioning/compute_results.py b/captioning/compute_results.py
@@ -0,0 +1,56 @@
+from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
+from pycocoevalcap.bleu.bleu import Bleu
+from pycocoevalcap.meteor.meteor import Meteor
+from pycocoevalcap.rouge.rouge import Rouge
+from pycocoevalcap.cider.cider import Cider
+from bert_score import score as bert_score_compute
+from tqdm import tqdm
+from collections import defaultdict
+import pandas as pd
+import argparse
+import json
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Evaluation")
+    parser.add_argument("--predict-json", required=True, help="prediction json file.")
+    parser.add_argument("--target-json", required=True, help="ground truth json file.")
+    args = parser.parse_args()
+
+    pd = json.load(open(args.predict_json))
+    gt = json.load(open(args.target_json))
+    pds = defaultdict(list)
+    gts = defaultdict(list)
+    pds_all = []
+    gts_all = []
+
+    for i, data in enumerate(gt):
+        video, captions = data["video"], data["caption"]
+        pds[i].append({"image_id":video, "caption":pd[video]})
+        pds_all += ([pd[video]]*len(captions))
+
+        for caption in captions:
+            gts[i].append({"image_id":video, "caption":caption})
+        gts_all += captions
+
+    tokenizer = PTBTokenizer()
+    pds = tokenizer.tokenize(pds)
+    gts = tokenizer.tokenize(gts)
+    scorers = [(Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]),
+               (Meteor(),"METEOR"),
+               (Rouge(), "ROUGE_L"),
+               (Cider(), "CIDEr")]
+
+    eval_dict = {}
+    for scorer, method in scorers:
+        score, scores = scorer.compute_score(gts, pds)
+        if scorer.method() == "Bleu":
+            eval_dict["BLEU4"] = score[3]
+        else:
+            eval_dict[scorer.method()] = score
+
+    _, _, score = bert_score_compute(pds_all, gts_all, lang='en', verbose=False)
+    eval_dict["BERTScore"] = score.mean().item()
+
+    for k, v in eval_dict.items():
+        print("%s: %.2f%%"%(k, v*100))
diff --git a/captioning/eval_configs/panda70M_eval.yaml b/captioning/eval_configs/panda70M_eval.yaml
@@ -0,0 +1,39 @@
+model:
+  arch: video_llama
+  model_type: pretrain_vicuna
+  input_prompt: True
+  ckpt: "checkpoint/checkpoint_best.pth"
+
+  # Q-Former
+  num_query_token: 32
+
+  # Vicuna
+  llama_model: "vicuna_weights/vicuna-7b-v0"
+
+  # Branch  
+  fusion_head_layers: 2
+  max_frame_pos: 32
+  fusion_header_type: "seqTransf"
+  num_video_query_token: 32
+  num_text_query_token: 32
+  input_vid2tex_query_embed: True
+  detach_video_query_embed: True
+
+  max_caption_len: 48
+  max_prompt_len: 200
+  start_sym: "<s>"
+  end_sym: "</s>"
+
+datasets:
+  hdvila:
+    vis_processor:
+      train:
+        name: "alpro_video_eval"
+        n_frms: 8
+        image_size: 224
+    text_processor:
+      train:
+        name: "blip_caption"
+
+run:
+  task: video_text_pretrain
diff --git a/captioning/inference.py b/captioning/inference.py
@@ -0,0 +1,82 @@
+import glob
+import argparse
+import torch
+import json
+import os
+from video_llama.common.config import Config
+from video_llama.common.registry import registry
+from video_llama.processors.video_processor import load_video
+from transformers import StoppingCriteria, StoppingCriteriaList
+from tqdm import tqdm
+
+
+class DotDict(dict):
+    """dot.notation access to dictionary attributes"""
+    __getattr__ = dict.get
+    __setattr__ = dict.__setitem__
+    __delattr__ = dict.__delitem__
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Inference")
+    parser.add_argument("--cfg-path", default="eval_configs/panda70M_eval.yaml", help="path to configuration file.")
+    parser.add_argument("--video-list", required=True, help="list of input videos.")
+    parser.add_argument("--output-json", default=None, help="output json file. Leave none to print out the results.")
+    parser.add_argument("--prompt-list", default=None, help="list of correponding input prompts. Leave none if no prompt input.")
+    args = parser.parse_args()
+    cfg = Config(args)
+
+    model_config = cfg.model_cfg
+    model_cls = registry.get_model_class(model_config.arch)
+    model = model_cls.from_config(model_config).to("cuda")
+    model.eval()
+
+    vis_processor_cfg = DotDict({"name":"alpro_video_eval", "n_frms":8, "image_size":224})
+    vis_processor = registry.get_processor_class(vis_processor_cfg.name).from_config(vis_processor_cfg)
+    text_processor_cfg = DotDict({"name":"blip_caption", "max_words":100})
+    text_processor = registry.get_processor_class(text_processor_cfg.name).from_config(text_processor_cfg)
+
+    batch_size = 16
+
+    videos = open(args.video_list, "r").read().splitlines()    
+    if args.prompt_list:
+        prompts = open(args.prompt_list, "r").read().split("\n\n")
+
+    results = {}
+    for i in tqdm(range(0, len(videos), batch_size)):
+        video_batch = []
+        video_path_batch = []
+        prompt_batch = []
+
+        for j in range(i, min(i+batch_size, len(videos))):
+            try:
+                video_path = videos[j]
+                video = load_video(video_path=video_path, n_frms=8, sampling ="uniform")
+                video = vis_processor.transform(video)
+                assert video.shape == torch.Size([3, 8, 224, 224])
+            except Exception as e:
+                print(e)
+                continue
+
+            video_batch.append(video)
+            video_path_batch.append(video_path.split('/')[-1])
+            prompt_batch.append(prompts[j] if args.prompt_list else "Please faithfully summarize the following video in one sentence.")
+
+        video_batch = torch.stack(video_batch).to("cuda")
+        outputs = model.inference(video_batch, prompt_batch)
+
+        for video_path, output in zip(video_path_batch, outputs):
+            output = output.capitalize()+"."
+            if args.output_json:
+                results[video_path] = output
+            else:
+                print("====="*20)
+                print("[Input video]", video_path)
+                print("[Input prompt]")
+                print(prompt_batch[j-i])
+                print("[Output caption]", output)
+
+    if args.output_json:
+        results = json.dumps(results, indent = 4)
+        with open(args.output_json, "w") as f:
+            f.write(results)
diff --git a/captioning/inputs/prompt_list.txt b/captioning/inputs/prompt_list.txt
@@ -0,0 +1,6 @@
+Some information about a video you will get:
+Transcription: Today we're gonna take a quick look at the 1966 Ford Mustang GT 289 v8 under the hood.
+Metadata: ['Old VS New - 1966 Ford Mustang GT & 2018 Ford Mustang | Just a Quick Look', 'Lets check out this beautiful 1966 Ford Mustang GT 289 in the showroom with the 2018 Ford Mustang!']
+Please look at the video and faithfully summarize it in one sentence.
+
+Please faithfully summarize the following video in one sentence.
diff --git a/captioning/inputs/video1.mp4 b/captioning/inputs/video1.mp4
diff --git a/captioning/inputs/video2.mp4 b/captioning/inputs/video2.mp4
diff --git a/captioning/inputs/video_list.txt b/captioning/inputs/video_list.txt
@@ -0,0 +1,2 @@
+inputs/video1.mp4
+inputs/video2.mp4