T2VLAD

Introduction

T2VLAD is proposed by Baidu in CVPR2021 for text-video retrieval. Text-video retrieval is a challenging task that aims to search relevant video contents based on natural language descriptions. The key to this problem is to measure text- video similarities in a joint embedding space. T2VLAD designs an efficient global-local alignment method. This model achieves consistent improvements on three standard text-video retrieval benchmarks and outperform the state- of-the-art by a clear margin.

Data

Please refer to MSR-VTT data download and preparation doc MSR-VTT data

Train

Train on MSR-VTT

Download data then move to data/MSRVTT folder.

Start training

Train T2VLAD on MSRVTT scripts:

export CUDA_VISIBLE_DEVICES=0
python3.7 train.py --config ./configs/msrvtt_transformers.json

T2VLAD uses the Ranger optimizer during training. We haven't supported the implementation of Ranger optimizer, for now, the AdamW optimizer can be used to complete the training.

Test

Evaluation performs on downstream task, i.e. text-video clip retrieval on MSR-VTT dataset, test accuracy can be obtained using scripts:

export CUDA_VISIBLE_DEVICES=0
python3.7 test.py --config ./configs/msrvtt_transformers.json --resume ./T2VLAD_msrvtt.pdparams

Accuracy on MSR-VTT: Text $\rightarrow$ Video

R@1	R@5	R@10	Median R	checkpoints
29.5	59.0	70.1	4	T2VLAD.pdparams

Video $\rightarrow$ Text

R@1	R@5	R@10	Median R
26.1	54.7	68.1	4

Reference

T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval , Xiaohan Wang, Linchao Zhu, Yi Yang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_en.md

README_en.md

T2VLAD

Contents

Introduction

Data

Train

Train on MSR-VTT

Start training

Test

Reference

Files

README_en.md

Latest commit

History

README_en.md

File metadata and controls

T2VLAD

Contents

Introduction

Data

Train

Train on MSR-VTT

Start training

Test

Reference