Image Captioning using Transformers

This project implements an image captioning model utilizing CNN and Transformer architectures. The model architecture is as follows:

Architecture

CNN as Feature Extractor:
- We use EfficientNetB0 as the Convolutional Neural Network (CNN) to extract features from the images.
Transformer Encoder:
- The extracted features from the CNN are passed to the encoder, which consists of multiple Transformer layers.
- These layers process the features and prepare them for decoding.
Transformer Decoder:
- The decoder, which also comprises Transformer layers, generates captions for the images based on the features provided by the encoder.
Data
- Each input image is associated with five captions.
- Data preprocessing and augmentation techniques are applied to improve the model's robustness and performance.
Model Training
- Once the model is trained, the results, metrics, and the trained model itself are logged using MLFlow and DagsHub for efficient tracking and management of model development.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
ImageCaptioningArchitecture.drawio.png		ImageCaptioningArchitecture.drawio.png
ImageCaptioningwithMLFlow.ipynb		ImageCaptioningwithMLFlow.ipynb
README.md		README.md
image_captioning.ipynb		image_captioning.ipynb