QD-DETR : Query-Dependent Video Representation for Moment Retrieval and Highlight Detection (CVPR 2023 Paper)
by WonJun Moon*1, SangEek Hyun*1, SangUk Park2, Dongchan Park2, Jae-Pil Heo1
1 Sungkyunkwan University, 2 Pyler, * Equal Contribution
[Arxiv] [Paper] [Project Page] [Video]
- Charades-STA experiments with C3D features are actually conducted with I3D features and I3D benchmarking tables. Features are provided here from VSLNET Github. Sorry for the inconvenience.
- Our new paper on moment retrieval and highlight detection is now available at [CG-DETR arxiv] (Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding). Codes will be soon available at [CG-DETR Github].
0. Clone this repo
1. Prepare datasets
(2023/11/21) For a newer version of instructions for preparing datasets, please refer to CG-DETR.
QVHighlights : Download official feature files for QVHighlights dataset from Moment-DETR.
Download moment_detr_features.tar.gz (8GB), extract it under '../features' directory. You can change the data directory by modifying 'feat_root' in shell scripts under 'qd_detr/scripts/' directory.
tar -xf path/to/moment_detr_features.tar.gz
TVSum : Download feature files for TVSum dataset from UMT.
Download TVSum (69.1MB), and either extract it under '../features/tvsum/' directory or change 'feat_root' in TVSum shell files under 'qd_detr/scripts/tvsum/'.
2. Install dependencies. Python version 3.7 is required.
pip install -r requirements.txt
For anaconda setup, please refer to the official Moment-DETR github.
Training with (only video) and (video + audio) can be executed by running the shell below:
bash qd_detr/scripts/train.sh --seed 2018
bash qd_detr/scripts/train_audio.sh --seed 2018
To calculate the standard deviation in the paper, we ran with 5 different seeds 0, 1, 2, 3, and 2018 (2018 is the seed used in Moment-DETR). Best validation accuracy is yielded at the last epoch.
Once the model is trained, hl_val_submission.jsonl
and hl_test_submission.jsonl
can be yielded by running inference.sh.
bash qd_detr/scripts/inference.sh results/{direc}/model_best.ckpt 'val'
bash qd_detr/scripts/inference.sh results/{direc}/model_best.ckpt 'test'
where direc
is the path to the saved checkpoint.
For more details for submission, check standalone_eval/README.md.
Pretraining with ASR captions is also available. To launch pretraining, run:
bash qd_detr/scripts/pretrain.sh
This will pretrain the QD-DETR model on the ASR captions for 100 epochs, the pretrained checkpoints and other experiment log files will be written into results
.
With the pretrained checkpoint, we can launch finetuning from a pretrained checkpoint PRETRAIN_CHECKPOINT_PATH
as:
bash qd_detr/scripts/train.sh --resume ${PRETRAIN_CHECKPOINT_PATH}
Note that this finetuning process is the same as standard training except that it initializes weights from a pretrained checkpoint.
Training with (only video) and (video + audio) can be executed by running the shell below:
bash qd_detr/scripts/tvsum/train_tvsum.sh
bash qd_detr/scripts/tvsum/train_tvsum_audio.sh
Best results are stored in 'results_[domain_name]/best_metric.jsonl'.
- Pretraining with ASR captions
- Runninng predictions on customized datasets
are also available as we use the official implementation for Moment-DETR as the basis. For the instructions, check their github.
Method (Modality) | Model file |
---|---|
QD-DETR (Video+Audio) Checkpoint | link |
QD-DETR (Video only) Checkpoint | link |
If you find this repository useful, please use the following entry for citation.
@inproceedings{moon2023query,
title={Query-dependent video representation for moment retrieval and highlight detection},
author={Moon, WonJun and Hyun, Sangeek and Park, SangUk and Park, Dongchan and Heo, Jae-Pil},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={23023--23033},
year={2023}
}
If there are any questions, feel free to contact with the authors: WonJun Moon ([email protected]), Sangeek Hyun ([email protected]).
The annotation files and many parts of the implementations are borrowed Moment-DETR. Following, our codes are also under MIT license.