A curated list of vision transformer related resources, including survey, paper, source code, etc. Maintainer: Murufeng
We are looking for a maintainer! Let me know if interested.
Please feel free to pull requests or open an issue to add papers and source codes.
-
S^2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision
-
CycleMLP: A MLP-like Architecture for Dense Prediction Github
- 1.目标检测(Object-Detection)
- 2.超分辨率(Super-Resolution)
- 3.图像分割、语义分割(Segmentation)
- 4.GAN/生成式/对抗式(GAN/Generative/Adversarial)
- 5.track
- 6.video
- 7.多模态结合
- 8.人体姿态估计
- 9.神经网络架构搜索NAS
- 10.人脸识别
- 11.行人重识别
- 12.密集人群检测
- 13.医学图像处理
- 14.图像风格迁移
- 15.low level vision(去噪,去雨,复原,去模糊等等)
- A Survey of Transformers
- 论文作者&单位: 复旦大学邱锡鹏团队; Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu
- 时间: 2021.6.08
- A Survey on Visual Transformer
- 论文作者&单位: 华为诺亚方舟; Kai Han, Yunhe Wang, Hanting Chen, etc
- 2021.1.30
- Transformers in Vision: A Survey
- 论文作者:Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, Mubarak Shah
- 时间: 2021.1.4
- 一文总结微软研究院Transformer霸榜模型三部曲
- 颜水成团队提出VOLO屠榜CV任务,无需额外训练数据,首次在ImageNet 上达到87.1%
- 分层级联Transformer!苏黎世联邦提出TransCNN: 显著降低了计算/空间复杂度!
- 登上更高峰!颜水成、程明明团队开源ViP,引入三维信息编码机制,无需卷积与注意力
- 清华鲁继文团队提出DynamicViT:一种高效的动态稀疏化Token的ViT
- 并非所有图像都值16x16个词--- 清华&华为提出一种自适应序列长度的动态ViT
- 注意力可以使MLP完全替代CNN吗? 未来有哪些研究方向?
- 超越Swin Transformer!谷歌提出了收敛更快、鲁棒性更强、性能更强的NesT
- Visual Parser: Representing Part-whole Hierarchies with Transformers -- 牛津大学
- Crossformer: A versatile vision Transformer based on cross-scale attention 浙大&哥大
- Contextual Transformer Networks for Visual Recognition - 京东AI梅涛团队
- Rethinking and Improving Relative Position Encoding for Vision Transformer - 中山大学&微软
- Local-to-Global Self-Attention in Vision Transformers -Inception Institute of Artificial Intelligence(IIAI)
- Polarized Self-Attention: Towards High-quality Pixel-wise Regression -- 新的自注意力机制,屠榜分割任务
- A Unified Efficient Pyramid Transformer for Semantic Segmentation - 李沐 Amazon & 复旦
- 华为诺亚提出CMT:卷积神经网络遇见视觉Transformer
- 超越Swin,Transformer屠榜三大视觉任务!微软推出新作:Focal Self-Attention paper
- Multi-Scale Densenet续作?搞定Transformer降采样,清华联合华为开源动态ViT! paper
- CSWin Transformer:具有十字形窗口的视觉Transformer主干 paper
- Faceboo提出 Early Convolutions Help Transformers See Better
- VOLO: Vision Outlooker for Visual Recognition
- 论文链接: https://arxiv.org/abs/2106.13112
- 代码地址: https://github.com/sail-sg/volo
- 作者团队:颜水成,冯佳时 无需任何额外训练数据,首次在ImageNet数据集上实现87.1%精度
- Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition
- Scaling Vision Transformers
- CAT: Cross Attention in Vision Transformer
- CoAtNet: Marrying Convolution and Attention for All Data Sizes
- 作者单位:谷歌大佬;Zihang Dai, Hanxiao Liu, Quoc V. Le, Mingxing Tan
- Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer
- Container: Context Aggregation Network
- Aggregating Nested Transformers
- X-volution: On the unification of convolution and self-attention
- Video Swin Transformer
- Dynamic Head: Unifying Object Detection Heads with Attentions
- 超越Swin,Transformer屠榜三大视觉任务!微软推出新作:Focal Self-Attention
- **CSWin Transformer:具有十字形窗口的视觉Transformer主干(Swin Transformer 的进阶版) *
- Dynamic Head: Unifying Object Detection Heads with Attentions
- [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- [VTs]: Visual Transformers: Token-based Image Representation and Processing for Computer Vision]
- [So-ViT]: So-ViT: Mind Visual Tokens for Vision Transformer
- [Token Labeling] Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet
- [LeViT] LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference]
- [CrossViT] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
- [CeiT] Incorporating Convolution Designs into Visual Transformers
- [DeepViT] DeepViT: Towards Deeper Vision Transformer
- [TNT] Transformer in Transformer
- [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
- [BoTNet] Bottleneck Transformers for Visual Recognition]
- [Visformer] Visformer: The Vision-friendly Transformer
- [ConTNet] ConTNet: Why not use convolution and transformer at the same time?
- [DeiT]: Training data-efficient image transformers & distillation through attention
- [Twins] Twins: Revisiting Spatial Attention Design in Vision Transformers
- Scaling Vision Transformers
- [GasHis-Transformer] GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification
- [Vision Transformer] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR)
- [RegionViT] Regional-to-Local Attention for Vision Transformers
- [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions]
- [FPT] Feature Pyramid Transformer (CVPR)
- [PiT] Rethinking Spatial Dimensions of Vision Transformers
- [CoaT] Co-Scale Conv-Attentional Image Transformers
- [LocalViT] LocalViT: Bringing Locality to Vision Transformers
- [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- [DPT] Vision Transformers for Dense Prediction
- [MViT] Mask Vision Transformer for Facial Expression Recognition in the wild]
- [Shuffle Transformer] Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer
- [TransCNN] Transformer in Convolutional Neural Networks\
- [ResT] ResT: An Efficient Transformer for Visual Recognition
- [CPVT] Do We Really Need Explicit Position Encodings for Vision Transformers?
- [ConViT] ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
- [CoaT] Co-Scale Conv-Attentional Image Transformers
- [CvT] CvT: Introducing Convolutions to Vision Transformers
- [ConTNet] ConTNet: Why not use convolution and transformer at the same time?
- [CeiT] Incorporating Convolution Designs into Visual Transformers
- [BoTNet] Bottleneck Transformers for Visual Recognition
- [CPTR] CPTR: Full Transformer Network for Image Captioning
- [DynamicViT]: Efficient Vision Transformers with Dynamic Token Sparsification
- [DVT] Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length
- [LeViT] LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference
- [UP-DETR] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers (CVPR)
- [Deformable DETR] Deformable DETR: Deformable Transformers for End-to-End Object Detection (ICLR)
- [DETR] End-to-End Object Detection with Transformers (ECCV)
- [Meta-DETR] Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning
- [DA-DETR] DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention
- [DETReg] Unsupervised Pretraining with Region Priors for Object Detection
- [Pointformer] 3D Object Detection with Pointformer
- [ViT-FRCNN] Toward Transformer-Based Object Detection
- Oriented Object Detection with Transformer
- [YOLOS] You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection
- [COTR] COTR: Convolution in Transformer Network for End to End Polyp Detection
- [TransVOD] End-to-End Video Object Detection with Spatial-Temporal Transformers
- [CAT] CAT: Cross-Attention Transformer for One-Shot Object Detection
- [M2TR] M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection
- Transformer Transforms Salient Object Detection and Camouflaged Object Detection
- [SSTN] SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving
- [TSP-FCOS] Rethinking Transformer-based Set Prediction for Object Detection
- [ACT] End-to-End Object Detection with Adaptive Clustering Transformer
- [PED] DETR for Pedestrian Detection
- [DPT] Vision Transformers for Dense Prediction
- Fully Transformer Networks for Semantic ImageSegmentation
- [TransVOS] TransVOS: Video Object Segmentation with Transformers
- [SegFormer] SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
- [VisTR] End-to-End Video Instance Segmentation with Transformers (CVPR)
- [Trans2Seg] Segmenting Transparent Object in the Wild with Transformer
- SETR :Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
- 作者单位:复旦, 牛津大学, 萨里大学, 腾讯优图, Facebook
- 主页:https://fudan-zvg.github.io/SETR/
- 代码:https://github.com/fudan-zvg/SETR
- 论文:https://arxiv.org/abs/2012.15840
-
[GANsformer] Generative Adversarial Transformers
-
[TransGAN]: Two Transformers Can Make One Strong GAN
-
[AOT-GAN] Aggregated Contextual Transformations for High-Resolution Image Inpainting
- [STGT] Spatial-Temporal Graph Transformer for Multiple Object Tracking
- Transformer Tracking
- [TransCenter] TransCenter: Transformers with Dense Queries for Multiple-Object Tracking
- [TrackFormer] TrackFormer: Multi-Object Tracking with Transformers
- [TransTrack] TransTrack: Multiple-Object Tracking with Transformer
- Video Swin Transformer
- Anticipative Video Transformer
- [TimeSformer] Is Space-Time Attention All You Need for Video Understanding?
- [VidTr] VidTr: Video Transformer Without Convolutions
- [ViViT] ViViT: A Video Vision Transformer
- [VTN] Video Transformer Network
- [VisTR] End-to-End Video Instance Segmentation with Transformers (CVPR)
- [STTN] Learning Joint Spatial-Temporal Transformations for Video Inpainting (ECCV)
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
- [TransPose] TransPose: Towards Explainable Human Pose Estimation by Transformer
- [TFPose] TFPose: Direct Human Pose Estimation with Transformers
- Lifting Transformer for 3D Human Pose Estimation in Video
- [BossNAS] BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search
- Vision Transformer Architecture Search
- [SUNETR] SUNETR: Transformers for 3D Medical Image Segmentation]
- [U-Transformer] U-Net Transformer: Self and Cross Attention for Medical Image Segmentation
- [MedT] Medical Transformer: Gated Axial-Attention for Medical Image Segmentation
- [TransUNet] TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation
- StyTr2: Unbiased Image Style Transfer with Transformers
- [IPT] Pre-Trained Image Processing Transformer (CVPR)
- [SDNet]: mutil-branch for single image deraining using swin
- Uformer: A General U-Shaped Transformer for Image Restoration
- Chasing Sparsity in Vision Transformers:An End-to-End Exploration
- MViT: Mask Vision Transformer for Facial Expression Recognition in the wild
- [CPTR] CPTR: Full Transformer Network for Image Captioning
- Learn to Dance with AIST++: Music Conditioned 3D Dance Generation
- Deepfake Video Detection Using Convolutional Vision Transformer
- Training Vision Transformers for Image Retrieval
- 待更新
- 待更新