Feature Extraction

Most of the code comes from VideoMAEv2: Scaling Video Masked Autoencoders with Dual Masking.

Prepare

export BASE_DIR=/xxx/AIcity2024-track3
cd $BASE_DIR/feature_extraction

Installation

Please follow the instrctions in INSTALL.md

Pretrain Weights

We choose an open source pretrained weight in final competition. The weight needs to apply for download, so we provide link as follows to get the weight directly.

Download vit_g_hybrid_pt_1200e_k710_ft and put it into $BASE_DIR/feature_extraction/weights.

Fine-tune model on official data

To fine-tune model (VideoMAEv2 with pretrain weight mentioned above) on the official dataset A1 with 8xA100-40G, you can use the following command:

bash scripts/finetune/track3_vit_g_A1_ft.sh

Also we provide finetuned models: videomae-v2_finetune_aicity.pth 🤗 . You can download it and put it into $BASE_DIR/feature_extraction/weights for feature extraction.

Inference to extract video features

You can extract video features by running:

# extract video feats for A1
python inference_video_feature_vitg.py \
    --video_dir $BASE_DIR/data/crop_videos/A1 \
    --ckpt_pth weights/videomae-v2_finetune_aicity.pth \
    --output_dir $BASE_DIR/data/extracted_features/A1

# extract video feats for A2
python inference_video_feature_vitg.py \
    --video_dir $BASE_DIR/data/crop_videos/A2 \
    --ckpt_pth weights/videomae-v2_finetune_aicity.pth \
    --output_dir $BASE_DIR/data/extracted_features/A2

Also we provide extracted feats for dataset A1&A2 here: extracted feats and put the into $BASE_DIR/data/extracted_features/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEATURE_EXTRACTION.md

FEATURE_EXTRACTION.md

Feature Extraction

Prepare

Installation

Pretrain Weights

Fine-tune model on official data

Inference to extract video features

Files

FEATURE_EXTRACTION.md

Latest commit

History

FEATURE_EXTRACTION.md

File metadata and controls

Feature Extraction

Prepare

Installation

Pretrain Weights

Fine-tune model on official data

Inference to extract video features