Multi-subject Open-set Personalization in Video Generation
Tsai-Shien Chen,
Aliaksandr Siarohin,
Willi Menapace,
Yuwei Fang,
Kwot Sin Lee,
Ivan Skorokhodov,
Kfir Aberman,
Jun-Yan Zhu,
Ming-Hsuan Yang,
Sergey Tulyakov
In this paper, we introduce MSRVTT-Personalization, a new benchmark for the task of personalization. It aims at accurate subject fidelity assessment and supports various conditioning modes, including conditioning on face crops, conditioning on single or multiple arbitrary subjects, and conditioning on foreground objects and background.
We include the testing dataset and evaluation protocol in this repository. We show a test sample of MSRVTT-Personalization below:
Ground Truth Video | Personalization Annotations |
-
MSRVTT-Personalization evaluates a model across six metrics:
- Text similarity (Text-S)
- Video similarity (Vid-S)
- Subject similarity (Subj-S)
- Face similarity (Face-S)
- Dynamic degree (Dync-D)
- Temporal consistency (T-Cons)
-
Quantitative evaluation:
-
Subject mode of MSRVTT-Personalization (inputs an entire subject image as the condition)
Method Text-S Vid-S Subj-S Dync-D T-Cons ELITE 0.245 0.620 0.359 - - VideoBooth 0.222 0.612 0.395 0.448 0.963 DreamVideo 0.261 0.611 0.310 0.311 0.956 VideoAlchemy 0.269 0.732 0.617 0.466 0.993 -
Face mode of MSRVTT-Personalization (inputs a face crop image as the condition)
Method Text-S Vid-S Face-S Dync-D T-Cons IP-Adapter 0.251 0.648 0.269 - - PhotoMaker 0.278 0.569 0.189 - - Magic-Me 0.251 0.602 0.135 0.418 0.974 VideoAlchemy 0.273 0.687 0.382 0.424 0.994
-
-
Qualitative evaluation:
-
Subject mode of MSRVTT-Personalization
ELITE VideoBooth DreamVideo VideoAlchemy Ground Truth -
Face mode of MSRVTT-Personalization
IP-Adapter PhotoMaker Magic-Me VideoAlchemy Ground Truth
-
- Step 1: Download videos and annotations
- Step 2: Download model checkpoints
- Step 3: Build environments
- Step 4: Prepare MSRVTT-Personalization dataset
- Step 5: Generate videos and collect ground truth embeddings (optional)
- Step 6: Run evaluation
git clone https://github.com/snap-research/VideoAlchemy.git
cd VideoAlchemy
- Download videos of original MSR-VTT
cd msrvtt_personalization
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip
unzip MSRVTT.zip
mkdir msrvtt_videos
mv MSRVTT/videos/all/video*.mp4 msrvtt_videos
rm -r MSRVTT MSRVTT.zip
- Download annotations of MSRVTT-Personalization
# Manually download
# https://drive.google.com/file/d/1LPPvXRTtmGDFUwTvMlaCMC3ofi72LhPW/view
# and put it under the `msrvtt_personalization` folder
unzip msrvtt_personalization_annotation.zip
rm msrvtt_personalization_annotation.zip
- Download ArcFace checkpoint
cd ../models/arcface
mkdir weight; cd weight
# Manually download
# https://1drv.ms/u/c/4A83B6B633B029CC/AcwpsDO2toMggEpCFgAAAAA?e=aNn50a
# and put it under the `models/arcface/weight` folder
- Download YOLOv9 face detection checkpoint
cd ../../YOLOv9
mkdir weight; cd weight
# Manually download
# https://drive.google.com/file/d/15K4e08lcZiiQrXmdsnm2BhcoNS3MOMmx/view
# and put it under the `models/YOLOv9/weight` folder
- Download Grounded-SAM checkpoints
cd ../../Grounded-Segment-Anything
mkdir checkpoints; cd checkpoints
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth
- Download RAFT checkpoint
cd ../../raft
wget https://dl.dropboxusercontent.com/s/4j4z58wuv8o0mfz/models.zip
unzip models.zip
rm models.zip
- Download VideoBooth checkpoints (optional)
cd ../../demo/VideoBooth
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git lfs install
mkdir pretrained_models
git clone https://huggingface.co/yumingj/VideoBooth_models
mv VideoBooth_models pretrained_models
git clone https://huggingface.co/CompVis/stable-diffusion-v1-4
mv stable-diffusion-v1-4 pretrained_models
- Build evaluation environment
cd ../..
conda env create -f environment.yml
- Build VideoBooth environment (optional)
cd demo/VideoBooth
conda env create -f environment.yml
- Parse the conditional text and images
- The files will be stored in the
msrvtt_personalization_data
folder by default. - This step is slow as it includes the prediction of background images. It takes around 3 hours on a single 80GB A100 gpu.
- The files will be stored in the
conda activate msrvtt-personalization
cd ../../msrvtt_personalization
python prepare_msrvtt_personalization_data.py
- Generate the ground truth embeddings files
- The files will be stored in the
msrvtt_personalization_embeddings
folder by default.
- The files will be stored in the
python prepare_msrvtt_personalization_embeddings.py
- Prepare the list of input conditional text and image
- Note that you can modify the code to customize the testing set and the input conditions. Currently support the subject mode and face mode.
- Each mode is supposed to include
videos.txt
listing the names of testing samples,prompts.txt
listing the conditional text prompts, andword_tag.txt
listing the conditional entity words and their corresponding images.
python prepare_conditional_prompts_and_images_list.py
- Use VideoBooth as example to demo how to generate videos using the conditional inputs of MSRVTT-Personalization
- The generated videos will be stored in the
demo/VideoBooth/msrvtt_personalization_subject_mode/outputs
folder by default.
- The generated videos will be stored in the
conda activate videobooth
cd ../demo/VideoBooth
python -m msrvtt_personalization_subject_mode.prepare_inputs
python -m msrvtt_personalization_subject_mode.generate_videos
conda deactivate
- Collect ground truth embeddings for each video
python -m msrvtt_personalization_subject_mode.collect_groundtruth_embeddings
- Evaluate the generated videos
- Set
$num_gpus
to activate distributed evaluation. - Make sure the ground truth embeddings files (such as
text_embeddings.pt
,video_embeddings.pt
,subject_embeddings.pkl.gz
, andface_embeddings.pt
) are in the same folder as the generated videos. See the code for reference. - Modify
evaluation_config
to select the evaluation metrics you want to include. - The evaluation results will be stored in a json file under the same folder as
video_folder
.
- Set
cd ../../evaluation_protocol
python -m torch.distributed.run --nproc_per_node=$num_gpus -m scripts.evaluate_distributed
--evaluation_config /path/to/evaluation_config
--video_folder /path/to/video_folder
--num_frames $num_frames
- Example command for the evaluation of VideoBooth (optional)
python -m torch.distributed.run --nproc_per_node=8 -m scripts.evaluate_distributed --evaluation_config configs/evaluation_config_msrvtt_personalization_subject_mode.yaml --video_folder ../demo/VideoBooth/msrvtt_personalization_subject_mode/outputs --num_frames 16
To add
If you find this project useful for your research, please cite our paper. 😊
@inproceedings{chen2025dreamalchemist,
title = {Multi-subject Open-set Personalization in Video Generation},
author = {Chen, Tsai-Shien and Siarohin, Aliaksandr and Menapace, Willi and Fang, Yuwei and Lee, Kwot Sin and Skorokhodov, Ivan and Aberman, Kfir and Zhu, Jun-Yan and Yang, Ming-Hsuan and Tulyakov, Sergey},
journal = {arXiv preprint arXiv:2501.00000},
year = {2025}
}
Tsai-Shien Chen: [email protected]