简体中文 | English

PaddlePaddle Vision Transformers

State-of-the-art Visual Transformer and MLP Models for PaddlePaddle

🤖 PaddlePaddle Visual Transformers (PaddleViT 或 PPViT) 为开发者提供视觉领域的高性能Transformer模型实现。我们的主要实现基于Visual Transformers, Visual Attentions, 以及 MLPs等视觉模型算法。此外，PaddleViT集成了PaddlePaddle 2.1+中常用的layers, utilities, optimizers, schedulers, 数据增强, 以及训练/评估脚本等。我们持续关注SOTA的ViT和MLP模型算法，并提供完整训练、测试代码。PaddleViT的核心任务是为用户提供方便易用的CV领域前沿算法。

🤖 PaddleViT 为多项视觉任务提供模型和工具，例如图像分类，目标检测，语义分割，GAN等。每个模型架构均在独立的Python模块中定义，以便于用户能够快速的开展研究和进行实验。同时，我们也提供了模型的预训练权重文件，以便您加载并使用自己的数据集进行微调。PaddleViT还集成了常用的工具和模块，用于自定义数据集、数据预处理，性能评估以及分布式训练等。

🤖 PaddleViT 基于深度学习框架 PaddlePaddle进行开发, 我们同时在Paddle AI Studio上提供了项目教程(coming soon). 对于新用户能够简单易操作。

视觉任务

PaddleViT 提供了多项视觉任务的模型和工具，请访问以下链接以获取详细信息：

PaddleViT-Cls 用于图像分类
PaddleViT-Det 用于目标检测
PaddleViT-Seg 用于语义分割
PaddleViT-GAN 用于生成对抗模型
Docs 提供文档和教程
docs-export 预测模型的生成与部署

我们同时提供对应教程：

PaddleViT免费在线课程这里

项目特性

SOTA模型的完整实现
- 提供多项CV任务的SOTA Transformer 模型
- 提供高性能的数据处理和训练方法
- 持续推出最新的SOTA算法的实现
易于使用的工具
- 通过简单配置即可实现对模型变体的实现
- 将实用功能与工具进行模块化设计
- 对于教育者和从业者的使用低门槛
- 所有模型以统一框架实现
符合用户的自定义需求
- 提供每个模型的实现的最佳实践
- 提供方便用户调整自定义配置的模型实现
- 模型文件可以独立使用以便于用户快速复现算法
高性能
- 支持DDP (多进程训练/评估，其中每个进程在单个GPU上运行)
- 支持混合精度 support (AMP)训练策略

ViT模型算法

图像分类 (Transformers)

ViT (from Google), released with paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
DeiT (from Facebook and Sorbonne), released with paper Training data-efficient image transformers & distillation through attention, by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
Swin Transformer (from Microsoft), released with paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
VOLO (from Sea AI Lab and NUS), released with paper VOLO: Vision Outlooker for Visual Recognition, by Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, Shuicheng Yan.
CSwin Transformer (from USTC and Microsoft), released with paper CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , by Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo.
CaiT (from Facebook and Sorbonne), released with paper Going deeper with Image Transformers, by Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, Hervé Jégou.
PVTv2 (from NJU/HKU/NJUST/IIAI/SenseTime), released with paper PVTv2: Improved Baselines with Pyramid Vision Transformer, by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
Shuffle Transformer (from Tencent), released with paper Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer, by Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, Bin Fu.
T2T-ViT (from NUS and YITU), released with paper Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , by Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, Shuicheng Yan.
CrossViT (from IBM), released with paper CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, by Chun-Fu Chen, Quanfu Fan, Rameswar Panda.
BEiT (from Microsoft Research), released with paper BEiT: BERT Pre-Training of Image Transformers, by Hangbo Bao, Li Dong, Furu Wei.
Focal Transformer (from Microsoft), released with paper Focal Self-attention for Local-Global Interactions in Vision Transformers, by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
Mobile-ViT (from Apple), released with paper MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer, by Sachin Mehta, Mohammad Rastegari.
ViP (from National University of Singapore), released with Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition, by Qibin Hou and Zihang Jiang and Li Yuan and Ming-Ming Cheng and Shuicheng Yan and Jiashi Feng.
XCiT (from Facebook/Inria/Sorbonne), released with paper XCiT: Cross-Covariance Image Transformers, by Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou.
PiT (from NAVER/Sogan University), released with paper Rethinking Spatial Dimensions of Vision Transformers, by Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, Seong Joon Oh.
HaloNet, (from Google), released with paper Scaling Local Self-Attention for Parameter Efficient Visual Backbones, by Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, Jonathon Shlens.
PoolFormer, (from Sea AI Lab/NUS), released with paper MetaFormer is Actually What You Need for Vision, by Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan.
BoTNet, (from UC Berkeley/Google), released with paper Bottleneck Transformers for Visual Recognition, by Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, Ashish Vaswani.
CvT (from McGill/Microsoft), released with paper CvT: Introducing Convolutions to Vision Transformers, by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang
HvT (from Monash University), released with paper Scalable Vision Transformers with Hierarchical Pooling, by Zizheng Pan, Bohan Zhuang, Jing Liu, Haoyu He, Jianfei Cai.
TopFormer (from HUST/Tencent/Fudan/ZJU), released with paper TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation, by Wenqiang Zhang, Zilong Huang, Guozhong Luo, Tao Chen, Xinggang Wang, Wenyu Liu, Gang Yu, Chunhua Shen.
ConvNeXt (from FAIR/UCBerkeley), released with paper A ConvNet for the 2020s, by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie.
CoaT (from UCSD), released with paper Co-Scale Conv-Attentional Image Transformers, by Weijian Xu, Yifan Xu, Tyler Chang, Zhuowen Tu.
ResT (from NJU), released with paper ResT: An Efficient Transformer for Visual Recognition, by Qinglong Zhang, Yubin Yang.
ResTV2 (from NJU), released with paper ResT V2: Simpler, Faster and Stronger, by Qinglong Zhang, Yubin Yang.

图像分类 (MLP & others)

MLP-Mixer (from Google), released with paper MLP-Mixer: An all-MLP Architecture for Vision, by Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy
ResMLP (from Facebook/Sorbonne/Inria/Valeo), released with paper ResMLP: Feedforward networks for image classification with data-efficient training, by Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou.
gMLP (from Google), released with paper Pay Attention to MLPs, by Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le.
FF Only (from Oxford), released with paper Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet, by Luke Melas-Kyriazi.
RepMLP (from BNRist/Tsinghua/MEGVII/Aberystwyth), released with paper RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition, by Xiaohan Ding, Chunlong Xia, Xiangyu Zhang, Xiaojie Chu, Jungong Han, Guiguang Ding.
CycleMLP (from HKU/SenseTime), released with paper CycleMLP: A MLP-like Architecture for Dense Prediction, by Shoufa Chen, Enze Xie, Chongjian Ge, Ding Liang, Ping Luo.
ConvMixer (from Anonymous), released with Patches Are All You Need?, by Anonymous.
ConvMLP (from UO/UIUC/PAIR), released with ConvMLP: Hierarchical Convolutional MLPs for Vision, by Jiachen Li, Ali Hassani, Steven Walton, Humphrey Shi.
RepLKNet (from Tsinghua/MEGVII/Aberystwyth), released with Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs , by Xiaohan Ding, Xiangyu Zhang, Yizhuang Zhou, Jungong Han, Guiguang Ding, Jian Sun.
MobileOne (from Apple), released with An Improved One millisecond Mobile Backbone, by Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel, Anurag Ranjan.

即将更新:

DynamicViT (from Tsinghua/UCLA/UW), released with paper DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification, by Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh.

目标检测

DETR (from Facebook), released with paper End-to-End Object Detection with Transformers, by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
Swin Transformer (from Microsoft), released with paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
PVTv2 (from NJU/HKU/NJUST/IIAI/SenseTime), released with paper PVTv2: Improved Baselines with Pyramid Vision Transformer, by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.

即将更新:

Focal Transformer (from Microsoft), released with paper Focal Self-attention for Local-Global Interactions in Vision Transformers, by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
UP-DETR (from Tencent), released with paper UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, by Zhigang Dai, Bolun Cai, Yugeng Lin, Junying Chen.

目标分割

现有模型:

SETR (from Fudan/Oxford/Surrey/Tencent/Facebook), released with paper Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, by Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, Li Zhang.
DPT (from Intel), released with paper Vision Transformers for Dense Prediction, by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun.
Swin Transformer (from Microsoft), released with paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
Segmenter (from Inria), realeased with paper Segmenter: Transformer for Semantic Segmentation, by Robin Strudel, Ricardo Garcia, Ivan Laptev, Cordelia Schmid.
Trans2seg (from HKU/Sensetime/NJU), released with paper Segmenting Transparent Object in the Wild with Transformer, by Enze Xie, Wenjia Wang, Wenhai Wang, Peize Sun, Hang Xu, Ding Liang, Ping Luo.
SegFormer (from HKU/NJU/NVIDIA/Caltech), released with paper SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers, by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
CSwin Transformer (from USTC and Microsoft), released with paper CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
TopFormer (from HUST/Tencent/Fudan/ZJU), released with paper TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation

即将更新:

FTN (from Baidu), released with paper Fully Transformer Networks for Semantic Image Segmentation, by Sitong Wu, Tianyi Wu, Fangjian Lin, Shengwei Tian, Guodong Guo.
Shuffle Transformer (from Tencent), released with paper Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer, by Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, Bin Fu
Focal Transformer (from Microsoft), released with paper Focal Self-attention for Local-Global Interactions in Vision Transformers, by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao. ](https://arxiv.org/abs/2107.00652), by Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo.

GAN

TransGAN (from Seoul National University and NUUA), released with paper TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up, by Yifan Jiang, Shiyu Chang, Zhangyang Wang.
Styleformer (from Facebook and Sorbonne), released with paper Styleformer: Transformer based Generative Adversarial Networks with Style Vector, by Jeeseung Park, Younggeun Kim.

即将更新:

ViTGAN (from UCSD/Google), released with paper ViTGAN: Training GANs with Vision Transformers, by Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang, Zhuowen Tu, Ce Liu.

安装

准备

Linux/MacOS/Windows
Python 3.6/3.7
PaddlePaddle 2.1.0+
CUDA10.2+

注意: 建议安装最新版本的 PaddlePaddle 以避免训练PaddleViT时出现一些 CUDA 错误。 PaddlePaddle稳定版安装请参考链接， PaddlePaddle开发版安装请参考链接.

安装

创建Conda虚拟环境并激活.

conda create -n paddlevit python=3.7 -y
conda activate paddlevit

按照官方说明安装 PaddlePaddle, e.g.,

conda install paddlepaddle-gpu==2.1.2 cudatoolkit=10.2 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/

注意: 请根据您的环境更改 paddlepaddle 版本和 cuda 版本.

安装依赖项.

通用的依赖项:
```
pip install yacs pyyaml
```

分割需要的依赖项:

pip install cityscapesScripts

安装 detail package:

git clone https://github.com/ccvl/detail-api
cd detail-api/PythonAPI
make
make install

GAN需要的依赖项:
```
pip install lmdb
```

从GitHub克隆项目

git clone https://github.com/BR-IDL/PaddleViT.git

预训练模型和下载 (Model Zoo)

图像分类

Model	Acc@1	Acc@5	#Params	FLOPs	Image Size	Crop pct	Interp	Link
vit_base_patch32_224	80.68	95.61	88.2M	4.4G	224	0.875	bicubic	google/baidu(ubyr)
vit_base_patch32_384	83.35	96.84	88.2M	12.7G	384	1.0	bicubic	google/baidu(3c2f)
vit_base_patch16_224	84.58	97.30	86.4M	17.0G	224	0.875	bicubic	google/baidu(qv4n)
vit_base_patch16_384	85.99	98.00	86.4M	49.8G	384	1.0	bicubic	google/baidu(wsum)
vit_large_patch16_224	85.81	97.82	304.1M	59.9G	224	0.875	bicubic	google/baidu(1bgk)
vit_large_patch16_384	87.08	98.30	304.1M	175.9G	384	1.0	bicubic	google/baidu(5t91)
vit_large_patch32_384	81.51	96.09	306.5M	44.4G	384	1.0	bicubic	google/baidu(ieg3)

swin_t_224	81.37	95.54	28.3M	4.4G	224	0.9	bicubic	google/baidu(h2ac)
swin_s_224	83.21	96.32	49.6M	8.6G	224	0.9	bicubic	google/baidu(ydyx)
swin_b_224	83.60	96.46	87.7M	15.3G	224	0.9	bicubic	google/baidu(h4y6)
swin_b_384	84.48	96.89	87.7M	45.5G	384	1.0	bicubic	google/baidu(7nym)
swin_b_224_22kto1k	85.27	97.56	87.7M	15.3G	224	0.9	bicubic	google/baidu(6ur8)
swin_b_384_22kto1k	86.43	98.07	87.7M	45.5G	384	1.0	bicubic	google/baidu(9squ)
swin_l_224_22kto1k	86.32	97.90	196.4M	34.3G	224	0.9	bicubic	google/baidu(nd2f)
swin_l_384_22kto1k	87.14	98.23	196.4M	100.9G	384	1.0	bicubic	google/baidu(5g5e)

deit_tiny_distilled_224	74.52	91.90	5.9M	1.1G	224	0.875	bicubic	google/baidu(rhda)
deit_small_distilled_224	81.17	95.41	22.4M	4.3G	224	0.875	bicubic	google/baidu(pv28)
deit_base_distilled_224	83.32	96.49	87.2M	17.0G	224	0.875	bicubic	google/baidu(5f2g)
deit_base_distilled_384	85.43	97.33	87.2M	49.9G	384	1.0	bicubic	google/baidu(qgj2)

volo_d1_224	84.12	96.78	26.6M	6.6G	224	1.0	bicubic	google/baidu(xaim)
volo_d1_384	85.24	97.21	26.6M	19.5G	384	1.0	bicubic	google/baidu(rr7p)
volo_d2_224	85.11	97.19	58.6M	13.7G	224	1.0	bicubic	google/baidu(d82f)
volo_d2_384	86.04	97.57	58.6M	40.7G	384	1.0	bicubic	google/baidu(9cf3)
volo_d3_224	85.41	97.26	86.2M	19.8G	224	1.0	bicubic	google/baidu(a5a4)
volo_d3_448	86.50	97.71	86.2M	80.3G	448	1.0	bicubic	google/baidu(uudu)
volo_d4_224	85.89	97.54	192.8M	42.9G	224	1.0	bicubic	google/baidu(vcf2)
volo_d4_448	86.70	97.85	192.8M	172.5G	448	1.0	bicubic	google/baidu(nd4n)
volo_d5_224	86.08	97.58	295.3M	70.6G	224	1.0	bicubic	google/baidu(ymdg)
volo_d5_448	86.92	97.88	295.3M	283.8G	448	1.0	bicubic	google/baidu(qfcc)
volo_d5_512	87.05	97.97	295.3M	371.3G	512	1.15	bicubic	google/baidu(353h)

cswin_tiny_224	82.81	96.30	22.3M	4.2G	224	0.9	bicubic	google/baidu(4q3h)
cswin_small_224	83.60	96.58	34.6M	6.5G	224	0.9	bicubic	google/baidu(gt1a)
cswin_base_224	84.23	96.91	77.4M	14.6G	224	0.9	bicubic	google/baidu(wj8p)
cswin_base_384	85.51	97.48	77.4M	43.1G	384	1.0	bicubic	google/baidu(rkf5)
cswin_large_224	86.52	97.99	173.3M	32.5G	224	0.9	bicubic	google/baidu(b5fs)
cswin_large_384	87.49	98.35	173.3M	96.1G	384	1.0	bicubic	google/baidu(6235)

cait_xxs24_224	78.38	94.32	11.9M	2.2G	224	1.0	bicubic	google/baidu(j9m8)
cait_xxs36_224	79.75	94.88	17.2M	33.1G	224	1.0	bicubic	google/baidu(nebg)
cait_xxs24_384	80.97	95.64	11.9M	6.8G	384	1.0	bicubic	google/baidu(2j95)
cait_xxs36_384	82.20	96.15	17.2M	10.1G	384	1.0	bicubic	google/baidu(wx5d)
cait_s24_224	83.45	96.57	46.8M	8.7G	224	1.0	bicubic	google/baidu(m4pn)
cait_xs24_384	84.06	96.89	26.5M	15.1G	384	1.0	bicubic	google/baidu(scsv)
cait_s24_384	85.05	97.34	46.8M	26.5G	384	1.0	bicubic	google/baidu(dnp7)
cait_s36_384	85.45	97.48	68.1M	39.5G	384	1.0	bicubic	google/baidu(e3ui)
cait_m36_384	86.06	97.73	270.7M	156.2G	384	1.0	bicubic	google/baidu(r4hu)
cait_m48_448	86.49	97.75	355.8M	287.3G	448	1.0	bicubic	google/baidu(imk5)

pvtv2_b0	70.47	90.16	3.7M	0.6G	224	0.875	bicubic	google/baidu(dxgb)
pvtv2_b1	78.70	94.49	14.0M	2.1G	224	0.875	bicubic	google/baidu(2e5m)
pvtv2_b2	82.02	95.99	25.4M	4.0G	224	0.875	bicubic	google/baidu(are2)
pvtv2_b2_linear	82.06	96.04	22.6M	3.9G	224	0.875	bicubic	google/baidu(a4c8)
pvtv2_b3	83.14	96.47	45.2M	6.8G	224	0.875	bicubic	google/baidu(nc21)
pvtv2_b4	83.61	96.69	62.6M	10.0G	224	0.875	bicubic	google/baidu(tthf)
pvtv2_b5	83.77	96.61	82.0M	11.5G	224	0.875	bicubic	google/baidu(9v6n)

shuffle_vit_tiny	82.39	96.05	28.5M	4.6G	224	0.875	bicubic	google/baidu(8a1i)
shuffle_vit_small	83.53	96.57	50.1M	8.8G	224	0.875	bicubic	google/baidu(xwh3)
shuffle_vit_base	83.95	96.91	88.4M	15.5G	224	0.875	bicubic	google/baidu(1gsr)

t2t_vit_7	71.68	90.89	4.3M	1.0G	224	0.9	bicubic	google/baidu(1hpa)
t2t_vit_10	75.15	92.80	5.8M	1.3G	224	0.9	bicubic	google/baidu(ixug)
t2t_vit_12	76.48	93.49	6.9M	1.5G	224	0.9	bicubic	google/baidu(qpbb)
t2t_vit_14	81.50	95.67	21.5M	4.4G	224	0.9	bicubic	google/baidu(c2u8)
t2t_vit_19	81.93	95.74	39.1M	7.8G	224	0.9	bicubic	google/baidu(4in3)
t2t_vit_24	82.28	95.89	64.0M	12.8G	224	0.9	bicubic	google/baidu(4in3)
t2t_vit_t_14	81.69	95.85	21.5M	4.4G	224	0.9	bicubic	google/baidu(4in3)
t2t_vit_t_19	82.44	96.08	39.1M	7.9G	224	0.9	bicubic	google/baidu(mier)
t2t_vit_t_24	82.55	96.07	64.0M	12.9G	224	0.9	bicubic	google/baidu(6vxc)
t2t_vit_14_384	83.34	96.50	21.5M	13.0G	384	1.0	bicubic	google/baidu(r685)

cross_vit_tiny_224	73.20	91.90	6.9M	1.3G	224	0.875	bicubic	google/baidu(scvb)
cross_vit_small_224	81.01	95.33	26.7M	5.2G	224	0.875	bicubic	google/baidu(32us)
cross_vit_base_224	82.12	95.87	104.7M	20.2G	224	0.875	bicubic	google/baidu(jj2q)
cross_vit_9_224	73.78	91.93	8.5M	1.6G	224	0.875	bicubic	google/baidu(mjcb)
cross_vit_15_224	81.51	95.72	27.4M	5.2G	224	0.875	bicubic	google/baidu(n55b)
cross_vit_18_224	82.29	96.00	43.1M	8.3G	224	0.875	bicubic	google/baidu(xese)
cross_vit_9_dagger_224	76.92	93.61	8.7M	1.7G	224	0.875	bicubic	google/baidu(58ah)
cross_vit_15_dagger_224	82.23	95.93	28.1M	5.6G	224	0.875	bicubic	google/baidu(qwup)
cross_vit_18_dagger_224	82.51	96.03	44.1M	8.7G	224	0.875	bicubic	google/baidu(qtw4)
cross_vit_15_dagger_384	83.75	96.75	28.1M	16.4G	384	1.0	bicubic	google/baidu(w71e)
cross_vit_18_dagger_384	84.17	96.82	44.1M	25.8G	384	1.0	bicubic	google/baidu(99b6)

beit_base_patch16_224_pt22k	85.21	97.66	87M	12.7G	224	0.9	bicubic	google/baidu(fshn)
beit_base_patch16_384_pt22k	86.81	98.14	87M	37.3G	384	1.0	bicubic	google/baidu(arvc)
beit_large_patch16_224_pt22k	87.48	98.30	304M	45.0G	224	0.9	bicubic	google/baidu(2ya2)
beit_large_patch16_384_pt22k	88.40	98.60	304M	131.7G	384	1.0	bicubic	google/baidu(qtrn)
beit_large_patch16_512_pt22k	88.60	98.66	304M	234.0G	512	1.0	bicubic	google/baidu(567v)

Focal-T	82.03	95.86	28.9M	4.9G	224	0.875	bicubic	google/baidu(i8c2)
Focal-T (use conv)	82.70	96.14	30.8M	4.9G	224	0.875	bicubic	google/baidu(smrk)
Focal-S	83.55	96.29	51.1M	9.4G	224	0.875	bicubic	google/baidu(dwd8)
Focal-S (use conv)	83.85	96.47	53.1M	9.4G	224	0.875	bicubic	google/baidu(nr7n)
Focal-B	83.98	96.48	89.8M	16.4G	224	0.875	bicubic	google/baidu(8akn)
Focal-B (use conv)	84.18	96.61	93.3M	16.4G	224	0.875	bicubic	google/baidu(5nfi)

mobilevit_xxs	70.31	89.68	1.32M	0.44G	256	1.0	bicubic	google/baidu(axpc)
mobilevit_xs	74.47	92.02	2.33M	0.95G	256	1.0	bicubic	google/baidu(hfhm)
mobilevit_s	76.74	93.08	5.59M	1.88G	256	1.0	bicubic	google/baidu(34bg)
mobilevit_s $\dag$	77.83	93.83	5.59M	1.88G	256	1.0	bicubic	google/baidu(92ic)

vip_s7	81.50	95.76	25.1M	7.0G	224	0.875	bicubic	google/baidu(mh9b)
vip_m7	82.75	96.05	55.3M	16.4G	224	0.875	bicubic	google/baidu(hvm8)
vip_l7	83.18	96.37	87.8M	24.5G	224	0.875	bicubic	google/baidu(tjvh)

xcit_nano_12_p16_224_dist	72.32	90.86	0.6G	3.1M	224	1.0	bicubic	google/baidu(7qvz)
xcit_nano_12_p16_384_dist	75.46	92.70	1.6G	3.1M	384	1.0	bicubic	google/baidu(1y2j)
xcit_large_24_p16_224_dist	84.92	97.13	35.9G	189.1M	224	1.0	bicubic	google/baidu(kfv8)
xcit_large_24_p16_384_dist	85.76	97.54	105.5G	189.1M	384	1.0	bicubic	google/baidu(ffq3)
xcit_nano_12_p8_224_dist	76.33	93.10	2.2G	3.0M	224	1.0	bicubic	google/baidu(jjs7)
xcit_nano_12_p8_384_dist	77.82	94.04	6.3G	3.0M	384	1.0	bicubic	google/baidu(dmc1)
xcit_large_24_p8_224_dist	85.40	97.40	141.4G	188.9M	224	1.0	bicubic	google/baidu(y7gw)
xcit_large_24_p8_384_dist	85.99	97.69	415.5G	188.9M	384	1.0	bicubic	google/baidu(9xww)

pit_ti	72.91	91.40	4.8M	0.5G	224	0.9	bicubic	google/baidu(ydmi)
pit_ti_distill	74.54	92.10	5.1M	0.5G	224	0.9	bicubic	google/baidu(7k4s)
pit_xs	78.18	94.16	10.5M	1.1G	224	0.9	bicubic	google/baidu(gytu)
pit_xs_distill	79.31	94.36	10.9M	1.1G	224	0.9	bicubic	google/baidu(ie7s)
pit_s	81.08	95.33	23.4M	2.4G	224	0.9	bicubic	google/baidu(kt1n)
pit_s_distill	81.99	95.79	24.0M	2.5G	224	0.9	bicubic	google/baidu(hhyc)
pit_b	82.44	95.71	73.5M	10.6G	224	0.9	bicubic	google/baidu(uh2v)
pit_b_distill	84.14	96.86	74.5M	10.7G	224	0.9	bicubic	google/baidu(3e6g)

halonet26t	79.10	94.31	12.5M	3.2G	256	0.95	bicubic	google/baidu(ednv)
halonet50ts	81.65	95.61	22.8M	5.1G	256	0.94	bicubic	google/baidu(3j9e)

poolformer_s12	77.24	93.51	11.9M	1.8G	224	0.9	bicubic	google/baidu(zcv4)
poolformer_s24	80.33	95.05	21.3M	3.4G	224	0.9	bicubic	google/baidu(nedr)
poolformer_s36	81.43	95.45	30.8M	5.0G	224	0.9	bicubic	google/baidu(fvpm)
poolformer_m36	82.11	95.69	56.1M	8.9G	224	0.95	bicubic	google/baidu(whfp)
poolformer_m48	82.46	95.96	73.4M	11.8G	224	0.95	bicubic	google/baidu(374f)

botnet50	77.38	93.56	20.9M	5.3G	224	0.875	bicubic	google/baidu(wh13)

CvT-13-224	81.59	95.67	20M	4.5G	224	0.875	bicubic	google/baidu(vev9)
CvT-21-224	82.46	96.00	32M	7.1G	224	0.875	bicubic	google/baidu(t2rv)
CvT-13-384	83.00	96.36	20M	16.3G	384	1.0	bicubic	google/baidu(wswt)
CvT-21-384	83.27	96.16	32M	24.9G	384	1.0	bicubic	google/baidu(hcem)
CvT-13-384-22k	83.26	97.09	20M	16.3G	384	1.0	bicubic	google/baidu(c7m9)
CvT-21-384-22k	84.91	97.62	32M	24.9G	384	1.0	bicubic	google/baidu(9jxe)
CvT-w24-384-22k	87.58	98.47	277M	193.2G	384	1.0	bicubic	google/baidu(bbj2)

HVT-Ti-1	69.45	89.28	5.7M	0.6G	224	0.875	bicubic	google/baidu(egds)
HVT-S-0	80.30	95.15	22.0M	4.6G	224	0.875	bicubic	google/baidu(hj7a)
HVT-S-1	78.06	93.84	22.1M	2.4G	224	0.875	bicubic	google/baidu(tva8)
HVT-S-2	77.41	93.48	22.1M	1.9G	224	0.875	bicubic	google/baidu(bajp)
HVT-S-3	76.30	92.88	22.1M	1.6G	224	0.875	bicubic	google/baidu(rjch)
HVT-S-4	75.21	92.34	22.1M	1.6G	224	0.875	bicubic	google/baidu(ki4j)


mlp_mixer_b16_224	76.60	92.23	60.0M	12.7G	224	0.875	bicubic	google/baidu(xh8x)
mlp_mixer_l16_224	72.06	87.67	208.2M	44.9G	224	0.875	bicubic	google/baidu(8q7r)

resmlp_24_224	79.38	94.55	30.0M	6.0G	224	0.875	bicubic	google/baidu(jdcx)
resmlp_36_224	79.77	94.89	44.7M	9.0G	224	0.875	bicubic	google/baidu(33w3)
resmlp_big_24_224	81.04	95.02	129.1M	100.7G	224	0.875	bicubic	google/baidu(r9kb)
resmlp_12_distilled_224	77.95	93.56	15.3M	3.0G	224	0.875	bicubic	google/baidu(ghyp)
resmlp_24_distilled_224	80.76	95.22	30.0M	6.0G	224	0.875	bicubic	google/baidu(sxnx)
resmlp_36_distilled_224	81.15	95.48	44.7M	9.0G	224	0.875	bicubic	google/baidu(vt85)
resmlp_big_24_distilled_224	83.59	96.65	129.1M	100.7G	224	0.875	bicubic	google/baidu(4jk5)
resmlp_big_24_22k_224	84.40	97.11	129.1M	100.7G	224	0.875	bicubic	google/baidu(ve7i)

gmlp_s16_224	79.64	94.63	19.4M	4.5G	224	0.875	bicubic	google/baidu(bcth)

ff_only_tiny (linear_tiny)	61.28	84.06			224	0.875	bicubic	google/baidu(mjgd)
ff_only_base (linear_base)	74.82	91.71			224	0.875	bicubic	google/baidu(m1jc)

repmlp_res50_light_224	77.01	93.46	87.1M	3.3G	224	0.875	bicubic	google/baidu(b4fg)

cyclemlp_b1	78.85	94.60	15.1M		224	0.9	bicubic	google/baidu(mnbr)
cyclemlp_b2	81.58	95.81	26.8M		224	0.9	bicubic	google/baidu(jwj9)
cyclemlp_b3	82.42	96.07	38.3M		224	0.9	bicubic	google/baidu(v2fy)
cyclemlp_b4	82.96	96.33	51.8M		224	0.875	bicubic	google/baidu(fnqd)
cyclemlp_b5	83.25	96.44	75.7M		224	0.875	bicubic	google/baidu(s55c)

convmixer_1024_20	76.94	93.35	24.5M	9.5G	224	0.96	bicubic	google/baidu(qpn9)
convmixer_768_32	80.16	95.08	21.2M	20.8G	224	0.96	bicubic	google/baidu(m5s5)
convmixer_1536_20	81.37	95.62	51.8M	72.4G	224	0.96	bicubic	google/baidu(xqty)

convmlp_s	76.76	93.40	9.0M	2.4G	224	0.875	bicubic	google/baidu(3jz3)
convmlp_m	79.03	94.53	17.4M	4.0G	224	0.875	bicubic	google/baidu(vyp1)
convmlp_l	80.15	95.00	42.7M	10.0G	224	0.875	bicubic	google/baidu(ne5x)

topformer_tiny	65.98	87.32	1.5M	0.13G	224	0.875	bicubic	google/baidu
topformer_small	72.44	91.17	3.1M	0.24G	224	0.875	bicubic	google/baidu
topformer_base	75.25	92.67	5.1M	0.37G	224	0.875	bicubic	google/baidu

目标检测

Model	backbone	box_mAP	Model
DETR	ResNet50	42.0	google/baidu(n5gk)
DETR	ResNet101	43.5	google/baidu(bxz2)
Mask R-CNN	Swin-T 1x	43.7	google/baidu(qev7)
Mask R-CNN	Swin-T 3x	46.0	google/baidu(m8fg)
Mask R-CNN	Swin-S 3x	48.4	google/baidu(hdw5)
Mask R-CNN	pvtv2_b0	38.3	google/baidu(3kqb)
Mask R-CNN	pvtv2_b1	41.8	google/baidu(k5aq)
Mask R-CNN	pvtv2_b2	45.2	google/baidu(jh8b)
Mask R-CNN	pvtv2_b2_linear	44.1	google/baidu(8ipt)
Mask R-CNN	pvtv2_b3	46.9	google/baidu(je4y)
Mask R-CNN	pvtv2_b4	47.5	google/baidu(n3ay)
Mask R-CNN	pvtv2_b5	47.4	google/baidu(jzq1)

目标分割

Pascal Context

Model	Backbone	Batch_size	mIoU (ss)	mIoU (ms+flip)	Backbone_checkpoint	Model_checkpoint	ConfigFile
SETR_Naive	ViT_large	16	52.06	52.57	google/baidu(owoj)	google/baidu(xdb8)	config
SETR_PUP	ViT_large	16	53.90	54.53	google/baidu(owoj)	google/baidu(6sji)	config
SETR_MLA	ViT_Large	8	54.39	55.16	google/baidu(owoj)	google/baidu(wora)	config
SETR_MLA	ViT_large	16	55.01	55.87	google/baidu(owoj)	google/baidu(76h2)	config

Cityscapes

Model	Backbone	Batch_size	Iteration	mIoU (ss)	mIoU (ms+flip)	Backbone_checkpoint	Model_checkpoint	ConfigFile
SETR_Naive	ViT_Large	8	40k	76.71	79.03	google/baidu(owoj)	google/baidu(g7ro)	config
SETR_Naive	ViT_Large	8	80k	77.31	79.43	google/baidu(owoj)	google/baidu(wn6q)	config
SETR_PUP	ViT_Large	8	40k	77.92	79.63	google/baidu(owoj)	google/baidu(zmoi)	config
SETR_PUP	ViT_Large	8	80k	78.81	80.43	google/baidu(owoj)	baidu(f793)	config
SETR_MLA	ViT_Large	8	40k	76.70	78.96	google/baidu(owoj)	baidu(qaiw)	config
SETR_MLA	ViT_Large	8	80k	77.26	79.27	google/baidu(owoj)	baidu(6bgj)	config

ADE20K

Model	Backbone	Batch_size	Iteration	mIoU (ss)	mIoU (ms+flip)	Backbone_checkpoint	Model_checkpoint	ConfigFile
SETR_Naive	ViT_Large	16	160k	47.57	48.12	google/baidu(owoj)	baidu(lugq)	config
SETR_PUP	ViT_Large	16	160k	49.12	49.51	google/baidu(owoj)	baidu(udgs)	config
SETR_MLA	ViT_Large	8	160k	47.80	49.34	google/baidu(owoj)	baidu(mrrv)	config
DPT	ViT_Large	16	160k	47.21	-	google/baidu(owoj)	baidu(ts7h)	config
Segmenter	ViT_Tiny	16	160k	38.45	-	TODO	baidu(1k97)	config
Segmenter	ViT_Small	16	160k	46.07	-	TODO	baidu(i8nv)	config
Segmenter	ViT_Base	16	160k	49.08	-	TODO	baidu(hxrl)	config
Segmenter	ViT_Large	16	160k	51.82	-	TODO	baidu(wdz6)	config
Segmenter_Linear	DeiT_Base	16	160k	47.34	-	TODO	baidu(5dpv)	config
Segmenter	DeiT_Base	16	160k	49.27	-	TODO	baidu(3kim)	config
Segformer	MIT-B0	16	160k	38.37	-	TODO	baidu(ges9)	config
Segformer	MIT-B1	16	160k	42.20	-	TODO	baidu(t4n4)	config
Segformer	MIT-B2	16	160k	46.38	-	TODO	baidu(h5ar)	config
Segformer	MIT-B3	16	160k	48.35	-	TODO	baidu(g9n4)	config
Segformer	MIT-B4	16	160k	49.01	-	TODO	baidu(e4xw)	config
Segformer	MIT-B5	16	160k	49.73	-	TODO	baidu(uczo)	config
UperNet	Swin_Tiny	16	160k	44.90	45.37	-	baidu(lkhg)	config
UperNet	Swin_Small	16	160k	47.88	48.90	-	baidu(vvy1)	config
UperNet	Swin_Base	16	160k	48.59	49.04	-	baidu(y040)	config
UperNet	CSwin_Tiny	16	160k	49.46		baidu(l1cp)	baidu(y1eq)	config
UperNet	CSwin_Small	16	160k	50.88		baidu(6vwk)	baidu(fz2e)	config
UperNet	CSwin_Base	16	160k	50.64		baidu(0ys7)	baidu(83w3)	config
TopFormer	TopFormer_Base	16	160k	38.3	-	google/baidu	google/baidu(ufxt)	config
TopFormer	TopFormer_Base	32	160k	39.2	-	google/baidu	google/baidu(ufxt)	config
TopFormer	TopFormer_Small	16	160k	36.5	-	google/baidu	google/baidu(ufxt)	config
TopFormer	TopFormer_Small	32	160k	37.0	-	google/baidu	google/baidu(ufxt)	config
TopFormer	TopFormer_Tiny	16	160k	33.6	-	google/baidu	google/baidu(ufxt)	config
TopFormer	TopFormer_Tiny	32	160k	34.6	-	google/baidu	google/baidu(ufxt)	config
TopFormer	TopFormer_Tiny	16	160k	32.5	-	google/baidu	google/baidu(ufxt)	config
TopFormer	TopFormer_Tiny	32	160k	33.4	-	google/baidu	google/baidu(ufxt)	config
Trans2seg_Medium	Resnet50c	32	160k	36.81	-	google/baidu(4dd5)	google/baidu(i2nt)	config

Trans10kV2

Model	Backbone	Batch_size	Iteration	mIoU (ss)	mIoU (ms+flip)	Backbone_checkpoint	Model_checkpoint	ConfigFile
Trans2seg_Medium	Resnet50c	16	16k	75.97	-	google/baidu(4dd5)	google/baidu(w25r)	config

GAN

Model	FID	Image Size	Crop_pct	Interpolation	Model
styleformer_cifar10	2.73	32	1.0	lanczos	google/baidu(ztky)
styleformer_stl10	15.65	48	1.0	lanczos	google/baidu(i973)
styleformer_celeba	3.32	64	1.0	lanczos	google/baidu(fh5s)
styleformer_lsun	9.68	128	1.0	lanczos	google/baidu(158t)

*使用fid50k_full指标在 Cifar10, STL10, Celeba 以及 LSUNchurch 数据集上评估结果.

图像分类的快速示例

如果需要使用模型预训练权重，需要转到对应子文件夹，例如， /image_classification/ViT/, 然后下载 .pdparam 权重文件并在python脚本中更改相关文件路径。模型的配置文件位于.、configs/.

假设下载的预训练权重文件存储在./vit_base_patch16_224.pdparams, 在python中使用vit_base_patch16_224模型:

from config import get_config
from visual_transformer import build_vit as build_model
# config files in ./configs/
config = get_config('./configs/vit_base_patch16_224.yaml')
# build model
model = build_model(config)
# load pretrained weights
model_state_dict = paddle.load('./vit_base_patch16_224')
model.set_dict(model_state_dict)

🤖 详细用法庆参见每个模型对应文件夹中的README文件.

评估

如果在单GPU上评估ViT模型在ImageNet2012数据集的性能，请使用命令行运行以下脚本：

sh run_eval.sh

or

CUDA_VISIBLE_DEVICES=0 \
python main_single_gpu.py \
    -cfg=./configs/vit_base_patch16_224.yaml \
    -dataset=imagenet2012 \
    -batch_size=16 \
    -data_path=/path/to/dataset/imagenet/val \
    -eval \
    -pretrained=/path/to/pretrained/model/vit_base_patch16_224  # .pdparams is NOT needed

使用多GPU运行评估

sh run_eval_multi.sh

or

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python main_multi_gpu.py \
    -cfg=./configs/vit_base_patch16_224.yaml \
    -dataset=imagenet2012 \
    -batch_size=16 \
    -data_path=/path/to/dataset/imagenet/val \
    -eval \
    -pretrained=/path/to/pretrained/model/vit_base_patch16_224  # .pdparams is NOT needed

训练

如果使用单GPU在ImageNet2012数据集训练ViT模型，请使用命令行运行以下脚本：

sh run_train.sh

or

CUDA_VISIBLE_DEVICES=0 \
python main_single_gpu.py \
  -cfg=./configs/vit_base_patch16_224.yaml \
  -dataset=imagenet2012 \
  -batch_size=32 \
  -data_path=/path/to/dataset/imagenet/train

使用多GPU运行训练：

sh run_train_multi.sh

or

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python main_multi_gpu.py \
    -cfg=./configs/vit_base_patch16_224.yaml \
    -dataset=imagenet2012 \
    -batch_size=16 \
    -data_path=/path/to/dataset/imagenet/train

贡献

我们鼓励并感谢您对 PaddleViT 项目的贡献, 请查看CONTRIBUTING.md以参考我们的工作流程和代码风格.

许可

此 repo 遵循 Apache-2.0 许可.

联系

如果您有任何问题, 请在我们的Github上创建一个issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_cn.md

README_cn.md

PaddlePaddle Vision Transformers

State-of-the-art Visual Transformer and MLP Models for PaddlePaddle

视觉任务

项目特性

ViT模型算法

图像分类 (Transformers)

图像分类 (MLP & others)

即将更新:

目标检测

即将更新:

目标分割

现有模型:

即将更新:

GAN

即将更新:

安装

准备

安装

预训练模型和下载 (Model Zoo)

图像分类

目标检测

目标分割

Pascal Context

Cityscapes

ADE20K

Trans10kV2

GAN

图像分类的快速示例

评估

训练

贡献

许可

联系

Files

README_cn.md

Latest commit

History

README_cn.md

File metadata and controls

PaddlePaddle Vision Transformers

State-of-the-art Visual Transformer and MLP Models for PaddlePaddle

视觉任务

项目特性

ViT模型算法

图像分类 (Transformers)

图像分类 (MLP & others)

即将更新:

目标检测

即将更新:

目标分割

现有模型:

即将更新:

GAN

即将更新:

安装

准备

安装

预训练模型和下载 (Model Zoo)

图像分类

目标检测

目标分割

Pascal Context

Cityscapes

ADE20K

Trans10kV2

GAN

图像分类的快速示例

评估

训练

贡献

许可

联系