ConvTransformer for Audio Classification

This repository contains an implementation of a hybrid architecture combining convolutional layers and transformers for classification tasks. The ConvTransformer model leverages the strengths of both convolutional neural networks (CNNs) and transformers to achieve effective feature extraction and classification.

Overview

The ConvTransformer model combines convolutional layers for local feature extraction and transformers for capturing global dependencies. This hybrid approach allows the model to effectively learn both local and global patterns in the input data, making it suitable for various classification tasks.

Model Architecture

The ConvTransformer model consists of the following components:

Convolutional Layers: A series of 1D convolutional layers are used to extract local features from the input data. Each convolutional layer is followed by batch normalization, LeakyReLU activation, and average pooling.
Positional Encoding: Positional encoding is added to the input sequence to incorporate positional information into the transformer layers.
Transformer Blocks: The transformer blocks consist of multi-head self-attention and feed-forward layers. The self-attention mechanism allows the model to capture dependencies between different positions in the input sequence.
Classification Head: The output of the transformer blocks is averaged globally, and then passed through an MLP head for classification. The MLP head consists of linear layers with batch normalization, LeakyReLU activation, and dropout.

Installation

To use the ConvTransformer model, follow these steps:

Clone the repository:

git clone https://github.com/sm1899/convtransformer-classification.git

Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

Here's an example of how to use the ConvTransformer model:

from convtransformer import ConvTransformerClassifier

# Instantiate the model
input_size = 16000
num_classes = 10
conv_channels = [32, 64, 128]
conv_kernel_sizes = [8, 5, 3]
conv_strides = [1, 1, 1]
embed_size = 128
num_heads = 4
num_transformer_blocks = 2
mlp_hidden_dim = 128
dropout = 0.1

model = ConvTransformerClassifier(input_size, num_classes, conv_channels, conv_kernel_sizes, conv_strides, embed_size, num_heads, num_transformer_blocks, mlp_hidden_dim, dropout)

# Prepare input data
input_tensor = torch.randn((32, 9, 16000))

# Forward pass
output = model(input_tensor)
print(output.shape)

Results

The ConvTransformer model has been evaluated on urban sound dataset with verious hyeprparametrs, models. You can find the detailed comparetive results and comparisons in the [audio_transformer_from_scratch](jupyter notebook).

License

This project is licensed under the MIT License.

Acknowledgements

We would like to acknowledge the following resources and papers that inspired and influenced the development of the ConvTransformer model:

Feel free to customize and expand upon this README file based on your specific implementation details and project structure.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitattributes		.gitattributes
audio_transformer_from_scratch.ipynb		audio_transformer_from_scratch.ipynb
convtransformer.py		convtransformer.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ConvTransformer for Audio Classification

Table of Contents

Overview

Model Architecture

Installation

Usage

Results

License

Acknowledgements

About

Releases

Packages

Languages

sm1899/CNN-Transformer

Folders and files

Latest commit

History

Repository files navigation

ConvTransformer for Audio Classification

Table of Contents

Overview

Model Architecture

Installation

Usage

Results

License

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages