Attention Or Patches ? What we really need ? 🤷🏻

Authors

Zakariae EL ASRI - Nicolas GREVET

Objectif

In this project, we provide the details of reimplementing a ConvMixer (a new architecture on Convolutional Neural Networks) on a new dataset. Then, we verify if the ConvMixers performs better than a Transformer-base model on this new dataset

A tutorial is availible in this google colab: https://colab.research.google.com/drive/1kkSIgFQIAP-0Yt2spVbQBwZv2MsBtWuM?usp=sharing

Abstract

For many years, the mainstream architecture in computer vision was CNNs, until the time when Vision Transformer (ViT), a transformer-based model shown promising performance. On later works, it was improved to outperform CNNs in many vision tasks. Where image resolutions are very large, the quadratic computation complexity of selfattention was a major bottleneck for vision. To tackle this problem, ViTs introduced the use of patch embeddings, which group together small regions of the image into single input features. This raises the idea that gains of vision transformers are due, in part, to patch representation as input. The question is to determine which factor is more important, the patch representation or the self-attention?

In this sense, Trockman et al. [1] presented a new idea in computer vision. The authors present a new architecture named ConvMixer that destroy the pyramid architecture on CNNs and replace it by an isotropic one using patches.

The paper show that the new architecture outperforms ViT, for similar parameter counts and dataset sizes.

The model: ConvMixer

This model destroys the historical triangular architecture of ConvNets that increases feature sizes and decreases resolution. Instead, It use an isotropic architecture similar to transformers, where the main computations are performed with convolutions instead of self-attention. The architecture is very simple. It has a patch embedding stage followed by repeated convolutional blocks.

Experiment

Dataset:

In this project, we use the Imagenette [2] dataset. It's very similar to Imagenet, but much less expensive to deal with. The Imagenette dataset consists of a subset of 10 classes from Imagenet (tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute). It contains two versions: '320 px' and '160 px' and have a 70/30 train/valid split.

Training

We trained the model on Imagenette-160 classification with the folowing specifications:

Data augmentation: Random Horizontal Flip and Random Resized Crop
kernel size and a patch size : 7
ConvMixer layer depth : 12
hidden dimensions : 256
Activation function: GeLu
One Cycle Learning Rate Policy
Optimizer: AdamW

Baselines:

ViT, with 6 transform layers with 8 heads in the Multi-Head Attention block.
Resnet-18:

Results

We can observe the trend of the convergence of the losses vs. the number of epochs (max 30), it shows that our model isn't overfitting to the training data

Our main objective is not to perform the best accuracy but to compare thethree models to see that patches are a major factor in such architectures. we see that the ConvMixer model is clearly performing better with the same range in the number of parameters (between 0.8M and 1.6M).

Files

Project_Patches_or_Attention.ipynb : A tutorial from Google Collab
Patches or Attention _ Project Deep Learning.pdf : The final Report for the project.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
images		images
Patch-all-you-need.png		Patch-all-you-need.png
Patches or Attention _ Project Deep Learning.pdf		Patches or Attention _ Project Deep Learning.pdf
Project_Patches_or_Attention.ipynb		Project_Patches_or_Attention.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Attention Or Patches ? What we really need ? 🤷🏻

Authors

Objectif

Abstract

The model: ConvMixer

Experiment

Dataset:

Training

Baselines:

Results

Files

Further Reading

About

Releases

Packages

Languages

elasriz/Patches-OR-Attention

Folders and files

Latest commit

History

Repository files navigation

Attention Or Patches ? What we really need ? 🤷🏻

Authors

Objectif

Abstract

The model: ConvMixer

Experiment

Dataset:

Training

Baselines:

Results

Files

Further Reading

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages