Skip to content

Latest commit

 

History

History
45 lines (24 loc) · 1.79 KB

README.md

File metadata and controls

45 lines (24 loc) · 1.79 KB

Image Captioning using Transformers

This project implements an image captioning model utilizing CNN and Transformer architectures. The model architecture is as follows:

Architecture

ImageCaptioningArchitecture drawio

  1. CNN as Feature Extractor:

    • We use EfficientNetB0 as the Convolutional Neural Network (CNN) to extract features from the images.
  2. Transformer Encoder:

    • The extracted features from the CNN are passed to the encoder, which consists of multiple Transformer layers.
    • These layers process the features and prepare them for decoding.
  3. Transformer Decoder:

    • The decoder, which also comprises Transformer layers, generates captions for the images based on the features provided by the encoder.
  4. Data

    • Each input image is associated with five captions.
    • Data preprocessing and augmentation techniques are applied to improve the model's robustness and performance.
  5. Model Training

    • Once the model is trained, the results, metrics, and the trained model itself are logged using MLFlow and DagsHub for efficient tracking and management of model development.

Output Examples

MLFlow Experiment Tracking

Screenshot 2024-08-22 at 11 33 45 PM Screenshot 2024-08-22 at 11 33 34 PM