Skip to content
/ iKUN Public

[CVPR 2024] iKUN: Speak to Trackers without Retraining

License

Notifications You must be signed in to change notification settings

dyhBUPT/iKUN

Repository files navigation

iKUN: Speak to Trackers without Retraining

arXiv

framework

Abstract

Referring multi-object tracking (RMOT) aims to track multiple objects based on input textual descriptions. Previous works realize it by simply integrating an extra textual module into the multi-object tracker. However, they typically need to retrain the entire framework and have difficulties in optimization. In this work, we propose an insertable Knowledge Unification Network, termed iKUN, to enable communication with off-the-shelf trackers in a plug-and-play manner. Concretely, a knowledge unification module (KUM) is designed to adaptively extract visual features based on textual guidance. Meanwhile, to improve the localization accuracy, we present a neural version of Kalman filter (NKF) to dynamically adjust process noise and observation noise based on the current motion status. Moreover, to address the problem of open-set long-tail distribution of textual descriptions, a test-time similarity calibration method is proposed to refine the confidence score with pseudo frequency. Extensive experiments on Refer-KITTI dataset verify the effectiveness of our framework. Finally, to speed up the development of RMOT, we also contribute a more challenging dataset, Refer-Dance, by extending public DanceTrack dataset with motion and dressing descriptions.

Experiments

experiments

Data Preparation

Download Refer-KITTI and our prepared files. Please organize them as follows:

path_to_data_and_files
├── CLIP
  ├── RN50.pt
  ├── ViT-B-32.pt
├── NeuralSORT
  ├── 0005
  ├── 0011
  ├── 0013
├── Refer-KITTI
  ├── gt_template
  ├── expression
  ├── KITTI
    ├── labels_with_ids
    ├── training
├── iKUN.pth
├── iKUN_cascade_attention.pth
├── iKUN_cross_correlation.pth
├── iKUN_test-first_modulation.pth
├── Refer-KITTI_labels.json
├── textual_features.json

Then set the default values of --save_root in opts.py to your path_to_data_and_files.

You can download our constructed Refer-Dance dataset from baidu disk.

Requirements

  • python==3.8
  • torch==2.0.1
  • torchvision==0.15.2
  • tensorboard==2.13.0
  • numpy==1.21.0
  • einops==0.6.1
  • ftfy==6.1.1
  • regex==2023.5.5
  • tqdm==4.65.0
  • clip==1.0

Here is a from-scratch script:

conda create python=3.8.16 -n iKUN_Git --y
conda activate iKUN_Git
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia --y
pip install six==1.16.0
pip install tensorboard==2.13.0
pip install einops==0.6.1
pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git # or setup from your local CLIP with `python setup.py develop`

Note: you need to slightly modify the source code of CLIP following issue #12.

Test

For direct testing, you can run the following command to generate the results of the baseline model:

python test.py --test_ckpt iKUN.pth

To test the three designs of KUM, you can run:

python test.py --kum_mode 'cascade attention' --test_ckpt iKUN_cascade_attention.pth
python test.py --kum_mode 'cross correlation' --test_ckpt iKUN_cross_correlation.pth
python test.py --kum_mode 'text-first modulation' --test_ckpt iKUN_test-first_modulation.pth

To run the full iKUN, i.e., cascade attention & similarity calibration, please run:

python test.py --kum_mode 'cascade attention' --test_ckpt iKUN_cascade_attention.pth --similarity_calibration

Then you can evaluate the results following official commands, and obtains 44.56%, 32.05%, 62.48% for HOTA, DetA, AssA, respectively.

Train

You can run the following command to train the baseline model:

python train.py --exp_name my_exp

Then generate the test results by running:

python test.py --exp_name my_exp --test_ckpt 'my_exp/epoch99.pth'

To train the three designs of KUM, you can run:

python train.py --exp_name my_exp --kum_mode 'cascade attention'
python train.py --exp_name my_exp --kum_mode 'cross correlation'
python train.py --exp_name my_exp --kum_mode 'text-first modulation'

Citation

@InProceedings{Du_2024_CVPR,
    author    = {Du, Yunhao and Lei, Cheng and Zhao, Zhicheng and Su, Fei},
    title     = {iKUN: Speak to Trackers without Retraining},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {19135-19144}
}

About

[CVPR 2024] iKUN: Speak to Trackers without Retraining

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages