mindspore-lab · Cui-yshoho · Oct 16, 2023 · Oct 16, 2023 · Oct 16, 2023 · Oct 20, 2023
diff --git a/configs/fastvit/README.md b/configs/fastvit/README.md
@@ -0,0 +1,91 @@
+# FastViT
+<!--- Guideline: use url linked to abstract in ArXiv instead of PDF for fast loading.  -->
+
+> [A Fast Hybrid Vision Transformer using Structural Reparameterization](https://arxiv.org/abs/2303.14189)
+
+## Introduction
+<!--- Guideline: Introduce the model and architectures. Cite if you use/adopt paper explanation from others. -->
+
+The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing skip-connections in the network. We further apply train-time overparametrization and large kernel convolutions to boost accuracy and empirically show that these choices have minimal effect on latency. We show that - our model is 3.5x faster than CMT, a recent state-of-the-art hybrid transformer architecture, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset. At similar latency, our model obtains 4.2% better Top-1 accuracy on ImageNet than MobileOne. Our model consistently outperforms competing architectures across several tasks -- image classification, detection, segmentation and 3D mesh regression with significant improvement in latency on both a mobile device and a desktop GPU. Furthermore, our model is highly robust to out-of-distribution samples and corruptions, improving over competing robust models.
+
+<!--- Guideline: If an architecture table/figure is available in the paper, put one here and cite for intuitive illustration. -->
+
+## Results
+<!--- Guideline:
+Table Format:
+- Model: model name in lower case with _ seperator.
+- Context: Training context denoted as {device}x{pieces}-{MS mode}, where mindspore mode can be G - graph mode or F - pynative mode with ms function. For example, D910x8-G is for training on 8 pieces of Ascend 910 NPU using graph mode.
+- Top-1 and Top-5: Keep 2 digits after the decimal point.
+- Params (M): # of model parameters in millions (10^6). Keep 2 digits after the decimal point
+- Recipe: Training recipe/configuration linked to a yaml config file. Use absolute url path.
+- Download: url of the pretrained model weights. Use absolute url path.
+-->
+
+Our reproduced model performance on ImageNet-1K is reported as follows.
+
+<div align="center">
+
+| Model     | Context  | Top-1 (%) | Top-5 (%) | Params (M) | Recipe                                                                                        | Download                                                                          |
+|-----------|----------|-----------|-----------|------------|-----------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
+| FastViT-T8 | D910x8-G | 74.25     | 91.97     | 48      | [yaml](https://github.com/mindspore-lab/mindcv/blob/main/configs/fastvit/fastvit_t8_ascend.yaml) |
+
+</div>
+
+#### Notes
+- Context: Training context denoted as {device}x{pieces}-{MS mode}, where mindspore mode can be G - graph mode or F - pynative mode with ms function. For example, D910x8-G is for training on 8 pieces of Ascend 910 NPU using graph mode.
+- Top-1 and Top-5: Accuracy reported on the validation set of ImageNet-1K.
+
+
+## Quick Start
+### Preparation
+
+#### Installation
+Please refer to the [installation instruction](https://github.com/mindspore-lab/mindcv#installation) in MindCV.
+
+#### Dataset Preparation
+Please download the [ImageNet-1K](https://www.image-net.org/challenges/LSVRC/2012/index.php) dataset for model training and validation.
+
+### Training
+<!--- Guideline: Avoid using shell script in the command line. Python script preferred. -->
+
+* Distributed Training
+
+It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run
+
+```shell
+# distributed training on multiple GPU/Ascend devices
+mpirun -n 8 python train.py --config configs/fastvit/fastvit_t8_ascend.yaml --data_dir /path/to/imagenet
+```
+> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
+
+Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
+
+For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
+
+**Note:**  As the global batch size  (batch_size x num_devices) is an important hyper-parameter, it is recommended to keep the global batch size unchanged for reproduction or adjust the learning rate linearly to a new global batch size.
+
+* Standalone Training
+
+If you want to train or finetune the model on a smaller dataset without distributed training, please run:
+
+```shell
+# standalone training on a CPU/GPU/Ascend device
+python train.py --config configs/fastvit/fastvit_t8_ascend.yaml --data_dir /path/to/dataset --distribute False
+```
+
+### Validation
+
+To validate the accuracy of the trained model, you can use `validate.py` and parse the checkpoint path with `--ckpt_path`.
+
+```
+python validate.py -c configs/fastvit/fastvit_t8_ascend.yaml --data_dir /path/to/imagenet --ckpt_path /path/to/ckpt
+```
+
+### Deployment
+
+To deploy online inference services with the trained model efficiently, please refer to the [deployment tutorial](https://mindspore-lab.github.io/mindcv/tutorials/deployment/).
+
+## References
+<!--- Guideline: Citation format GB/T 7714 is suggested. -->
+
+[1] Vasu P K A, Gabriel J, Zhu J, et al. FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization[J]. arXiv preprint arXiv:2303.14189, 2023.
diff --git a/configs/fastvit/fastvit_t8_ascend.yaml b/configs/fastvit/fastvit_t8_ascend.yaml
@@ -0,0 +1,60 @@
+# system
+mode: 0
+distribute: False
+num_parallel_workers: 8
+val_while_train: True
+val_interval: 1
+log_interval: 100
+
+# dataset
+dataset: "imagenet"
+data_dir: "/path/to/imagenet"
+shuffle: True
+dataset_download: False
+batch_size: 128
+
+# augmentation
+image_resize: 224
+scale: [0.08, 1.0]
+ratio: [0.75, 1.333]
+hflip: 0.5
+vflip: 0.0
+interpolation: "bicubic"
+re_prob: 0.1
+mixup: 0.8
+cutmix: 1.0
+color_jitter: 0.4
+auto_augment: "randaug-m7-mstd0.5"
+
+# model
+model: "fastvit_t8"
+num_classes: 1000
+pretrained: False
+keep_checkpoint_max: 10
+ckpt_save_policy: "latest_k"
+ckpt_save_interval: 1
+ckpt_save_dir: "./ckpt"
+epoch_size: 300
+dataset_sink_mode: True
+ema_decay: 0.9995
+amp_level: "O2"
+loss_scale_type: 'auto'
+
+# loss
+loss: "CE"
+label_smoothing: 0.1
+
+# lr scheduler
+scheduler: "cosine_decay"
+lr: 0.001
+min_lr: 0.0
+warmup_epochs: 5
+warmup_factor: 0.01
+decay_epochs: 295
+
+# optimizer
+opt: "adamw"
+momentum: 0.9
+weight_decay: 0.05
+filter_bias_and_bn: True
+use_nesterov: False