Here is a non official implementation, in Pytorch, of the paper Vision Transformer for Small-Size Datasets.
The configuration has been trained on CIFAR-10 and shows interesting results.
The main components of the papers are :
The ViT architecture :
The Shifted Patch Tokenizer (for increasing the locality inductive bias) :
The Locality Self-Attention :
These components can be found in the models.py
- Use register_buffer for the -inf mask in the Locality Self-Attention
- Use warmup
- Visualize Attention layers
- Track scaling coefficient in attention using TensorBoard