MSRVTT-Personalization

Multi-subject Open-set Personalization in Video Generation
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, Sergey Tulyakov

In this paper, we introduce MSRVTT-Personalization, a new benchmark for the task of personalization. It aims at accurate subject fidelity assessment and supports various conditioning modes, including conditioning on face crops, conditioning on single or multiple arbitrary subjects, and conditioning on foreground objects and background.

We include the testing dataset and evaluation protocol in this repository. We show a test sample of MSRVTT-Personalization below:


Ground Truth Video	Personalization Annotations

^{**We will remove video samples from Github / project webpage / technical presentation as long as you need it. Please contact tsaishienchen at gmail dot com for the request.}

Leaderboard

MSRVTT-Personalization evaluates a model across six metrics:
- Text similarity (Text-S)
- Video similarity (Vid-S)
- Subject similarity (Subj-S)
- Face similarity (Face-S)
- Dynamic degree (Dync-D)
- Temporal consistency (T-Cons)

Quantitative evaluation:

Subject mode of MSRVTT-Personalization (inputs an entire subject image as the condition)

Method	Text-S	Vid-S	Subj-S	Dync-D	T-Cons
ELITE	0.245	0.620	0.359	-	-
VideoBooth	0.222	0.612	0.395	0.448	0.963
DreamVideo	0.261	0.611	0.310	0.311	0.956
VideoAlchemy	0.269	0.732	0.617	0.466	0.993

Face mode of MSRVTT-Personalization (inputs a face crop image as the condition)

Method	Text-S	Vid-S	Face-S	Dync-D	T-Cons
IP-Adapter	0.251	0.648	0.269	-	-
PhotoMaker	0.278	0.569	0.189	-	-
Magic-Me	0.251	0.602	0.135	0.418	0.974
VideoAlchemy	0.273	0.687	0.382	0.424	0.994

Qualitative evaluation:
- Subject mode of MSRVTT-Personalization
  
  ELITE VideoBooth DreamVideo VideoAlchemy Ground Truth
- Face mode of MSRVTT-Personalization
  
  IP-Adapter PhotoMaker Magic-Me VideoAlchemy Ground Truth

Evaluation Protocol

Get started

git clone https://github.com/snap-research/VideoAlchemy.git
cd VideoAlchemy

Step 1: Download videos and annotations

Download videos of original MSR-VTT

cd msrvtt_personalization
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip
unzip MSRVTT.zip
mkdir msrvtt_videos
mv MSRVTT/videos/all/video*.mp4 msrvtt_videos
rm -r MSRVTT MSRVTT.zip

Download annotations of MSRVTT-Personalization

# Manually download
# https://drive.google.com/file/d/1LPPvXRTtmGDFUwTvMlaCMC3ofi72LhPW/view
# and put it under the `msrvtt_personalization` folder
unzip msrvtt_personalization_annotation.zip
rm msrvtt_personalization_annotation.zip

Step 2: Download model checkpoints

Download ArcFace checkpoint

cd ../models/arcface
mkdir weight; cd weight
# Manually download
# https://1drv.ms/u/c/4A83B6B633B029CC/AcwpsDO2toMggEpCFgAAAAA?e=aNn50a
# and put it under the `models/arcface/weight` folder

Download YOLOv9 face detection checkpoint

cd ../../YOLOv9
mkdir weight; cd weight
# Manually download
# https://drive.google.com/file/d/15K4e08lcZiiQrXmdsnm2BhcoNS3MOMmx/view
# and put it under the `models/YOLOv9/weight` folder

Download Grounded-SAM checkpoints

cd ../../Grounded-Segment-Anything
mkdir checkpoints; cd checkpoints
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth

Download RAFT checkpoint

cd ../../raft
wget https://dl.dropboxusercontent.com/s/4j4z58wuv8o0mfz/models.zip
unzip models.zip
rm models.zip

Download VideoBooth checkpoints (optional)

cd ../../demo/VideoBooth

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git lfs install

mkdir pretrained_models
git clone https://huggingface.co/yumingj/VideoBooth_models
mv VideoBooth_models pretrained_models

git clone https://huggingface.co/CompVis/stable-diffusion-v1-4
mv stable-diffusion-v1-4 pretrained_models

Step 3: Build environments

Build evaluation environment

cd ../..
conda env create -f environment.yml

Build VideoBooth environment (optional)

cd demo/VideoBooth
conda env create -f environment.yml

Step 4: Prepare MSRVTT-Personalization dataset

Parse the conditional text and images
- The files will be stored in the msrvtt_personalization_data folder by default.
- This step is slow as it includes the prediction of background images. It takes around 3 hours on a single 80GB A100 gpu.

conda activate msrvtt-personalization
cd ../../msrvtt_personalization
python prepare_msrvtt_personalization_data.py

Generate the ground truth embeddings files
- The files will be stored in the msrvtt_personalization_embeddings folder by default.

python prepare_msrvtt_personalization_embeddings.py

Prepare the list of input conditional text and image
- Note that you can modify the code to customize the testing set and the input conditions. Currently support the subject mode and face mode.
- Each mode is supposed to include videos.txt listing the names of testing samples, prompts.txt listing the conditional text prompts, and word_tag.txt listing the conditional entity words and their corresponding images.

python prepare_conditional_prompts_and_images_list.py

Step 5: Generate videos and collect ground truth embeddings (optional)

Use VideoBooth as example to demo how to generate videos using the conditional inputs of MSRVTT-Personalization
- The generated videos will be stored in the demo/VideoBooth/msrvtt_personalization_subject_mode/outputs folder by default.

conda activate videobooth
cd ../demo/VideoBooth
python -m msrvtt_personalization_subject_mode.prepare_inputs
python -m msrvtt_personalization_subject_mode.generate_videos
conda deactivate

Collect ground truth embeddings for each video

python -m msrvtt_personalization_subject_mode.collect_groundtruth_embeddings

Step 6: Run evaluation

Evaluate the generated videos
- Set $num_gpus to activate distributed evaluation.
- Make sure the ground truth embeddings files (such as text_embeddings.pt, video_embeddings.pt, subject_embeddings.pkl.gz, and face_embeddings.pt) are in the same folder as the generated videos. See the code for reference.
- Modify evaluation_config to select the evaluation metrics you want to include.
- The evaluation results will be stored in a json file under the same folder as video_folder.

cd ../../evaluation_protocol
python -m torch.distributed.run --nproc_per_node=$num_gpus -m scripts.evaluate_distributed
  --evaluation_config /path/to/evaluation_config
  --video_folder /path/to/video_folder
  --num_frames $num_frames

Example command for the evaluation of VideoBooth (optional)

python -m torch.distributed.run --nproc_per_node=8 -m scripts.evaluate_distributed --evaluation_config configs/evaluation_config_msrvtt_personalization_subject_mode.yaml --video_folder ../demo/VideoBooth/msrvtt_personalization_subject_mode/outputs --num_frames 16

License of MSRVTT-Personalization

To add

Citation

If you find this project useful for your research, please cite our paper. 😊

@inproceedings{chen2025dreamalchemist,
  title   = {Multi-subject Open-set Personalization in Video Generation},
  author  = {Chen, Tsai-Shien and Siarohin, Aliaksandr and Menapace, Willi and Fang, Yuwei and Lee, Kwot Sin and Skorokhodov, Ivan and Aberman, Kfir and Zhu, Jun-Yan and Yang, Ming-Hsuan and Tulyakov, Sergey},
  journal = {arXiv preprint arXiv:2501.00000},
  year    = {2025}
}

Contact Information

Tsai-Shien Chen: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
demo/VideoBooth		demo/VideoBooth
docs		docs
evaluation_protocol		evaluation_protocol
models		models
msrvtt_personalization		msrvtt_personalization
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MSRVTT-Personalization

Leaderboard

Evaluation Protocol

Table of Contents

Get started

Step 1: Download videos and annotations

Step 2: Download model checkpoints

Step 3: Build environments

Step 4: Prepare MSRVTT-Personalization dataset

Step 5: Generate videos and collect ground truth embeddings (optional)

Step 6: Run evaluation

License of MSRVTT-Personalization

Citation

Contact Information

About

Releases

Packages

Languages

tsaishien-chen/VideoAlchemy

Folders and files

Latest commit

History

Repository files navigation

MSRVTT-Personalization

Leaderboard

Evaluation Protocol

Table of Contents

Get started

Step 1: Download videos and annotations

Step 2: Download model checkpoints

Step 3: Build environments

Step 4: Prepare MSRVTT-Personalization dataset

Step 5: Generate videos and collect ground truth embeddings (optional)

Step 6: Run evaluation

License of MSRVTT-Personalization

Citation

Contact Information

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages