This repo contains the supported code and configuration files to reproduce object detection results of Swin Transformer. It is based on mmdetection.
05/11/2021 Models for MoBY are released
04/12/2021 Initial commits
Backbone | Pretrain | Lr Schd | box mAP | mask mAP | #params | FLOPs | config | log | model |
---|---|---|---|---|---|---|---|---|---|
Swin-T | ImageNet-1K | 1x | 43.7 | 39.8 | 48M | 267G | config | github/baidu | github/baidu |
Swin-T | ImageNet-1K | 3x | 46.0 | 41.6 | 48M | 267G | config | github/baidu | github/baidu |
Swin-S | ImageNet-1K | 3x | 48.5 | 43.3 | 69M | 359G | config | github/baidu | github/baidu |
Backbone | Pretrain | Lr Schd | box mAP | mask mAP | #params | FLOPs | config | log | model |
---|---|---|---|---|---|---|---|---|---|
Swin-T | ImageNet-1K | 1x | 48.1 | 41.7 | 86M | 745G | config | github/baidu | github/baidu |
Swin-T | ImageNet-1K | 3x | 50.4 | 43.7 | 86M | 745G | config | github/baidu | github/baidu |
Swin-S | ImageNet-1K | 3x | 51.9 | 45.0 | 107M | 838G | config | github/baidu | github/baidu |
Swin-B | ImageNet-1K | 3x | 51.9 | 45.0 | 145M | 982G | config | github/baidu | github/baidu |
Backbone | Pretrain | Lr Schd | box mAP | mask mAP | #params | FLOPs | config | log | model |
---|---|---|---|---|---|---|---|---|---|
Swin-T | ImageNet-1K | 3x | 50.0 | - | 45M | 283G | config | github | github |
Backbone | Pretrain | Lr Schd | box mAP | mask mAP | #params | FLOPs | config | log | model |
---|---|---|---|---|---|---|---|---|---|
Swin-T | ImageNet-1K | 3x | 50.4 | 43.8 | 47M | 292G | config | github | github |
Notes:
- Pre-trained models can be downloaded from Swin Transformer for ImageNet Classification.
- Access code for
baidu
isswin
.
Backbone | Pretrain | Lr Schd | box mAP | mask mAP | #params | FLOPs | config | log | model |
---|---|---|---|---|---|---|---|---|---|
Swin-T | ImageNet-1K | 1x | 43.6 | 39.6 | 48M | 267G | config | github/baidu | github/baidu |
Swin-T | ImageNet-1K | 3x | 46.0 | 41.7 | 48M | 267G | config | github/baidu | github/baidu |
Backbone | Pretrain | Lr Schd | box mAP | mask mAP | #params | FLOPs | config | log | model |
---|---|---|---|---|---|---|---|---|---|
Swin-T | ImageNet-1K | 1x | 48.1 | 41.5 | 86M | 745G | config | github/baidu | github/baidu |
Swin-T | ImageNet-1K | 3x | 50.2 | 43.5 | 86M | 745G | config | github/baidu | github/baidu |
Notes:
- The drop path rate needs to be tuned for best practice.
- MoBY pre-trained models can be downloaded from MoBY with Swin Transformer.
Please refer to get_started.md for installation and dataset preparation.
# single-gpu testing
python tools/test.py <CONFIG_FILE> <DET_CHECKPOINT_FILE> --eval bbox segm
# multi-gpu testing
tools/dist_test.sh <CONFIG_FILE> <DET_CHECKPOINT_FILE> <GPU_NUM> --eval bbox segm
To train a detector with pre-trained models, run:
# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options model.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]
# multi-gpu training
tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options model.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]
For example, to train a Cascade Mask R-CNN model with a Swin-T
backbone and 8 gpus, run:
tools/dist_train.sh configs/swin/cascade_mask_rcnn_swin_tiny_patch4_window7_mstrain_480-800_giou_4conv1f_adamw_3x_coco.py 8 --cfg-options model.pretrained=<PRETRAIN_MODEL>
Note: use_checkpoint
is used to save GPU memory. Please refer to this page for more details.
We use apex for mixed precision training by default. To install apex, run:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
If you would like to disable apex, modify the type of runner as EpochBasedRunner
and comment out the following code block in the configuration files:
# do not use mmdet version fp16
fp16 = None
optimizer_config = dict(
type="DistOptimizerHook",
update_interval=1,
grad_clip=None,
coalesce=True,
bucket_size_mb=-1,
use_fp16=True,
)
@article{liu2021Swin,
title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
journal={arXiv preprint arXiv:2103.14030},
year={2021}
}
Image Classification: See Swin Transformer for Image Classification.
Semantic Segmentation: See Swin Transformer for Semantic Segmentation.
Self-Supervised Learning: See MoBY with Swin Transformer.
Video Recognition, See Video Swin Transformer.
I added Swin Transformer MoE (referred to as Swin-T MoE hereafter) to the backbone network. MoE is a method that expands the model parameters and improves the model performance. The implementation of Swin Transformer MoE used Microsoft's Tutel framework.
Install Tutel
python3 -m pip uninstall tutel -y
python3 -m pip install --user --upgrade git+https://github.com/microsoft/tutel@main
You can check out Swin-T MoE at .
.\mmdet\models\backbones\swin_transformer_moe.py.
I provided the relevant configuration files for reference:
contains the parameters for the Swin-T MoE backbone network:
.\configs\swin\cascade_mask_rcnn_swin_moe_tiny_patch4_window7_mstrain_480-800_giou_4conv1f_adamw_3x_coco.py
contains the modified configuration for the backbone network:
.\configs\swin\cascade_mask_rcnn_swin_moe_tiny_patch4_window7_mstrain_480-800_giou_4conv1f_adamw_3x_coco.py
As the output of Swin-T MoE is different from Swin-T, I modified the extract_feat
function in .\mmdet\models\detectors\two_stage.py
.
You can change the config according to your needs.