Skip to content

This repository trains image captioning model using CNN and Transformers.

Notifications You must be signed in to change notification settings

skp-1997/Image-Captioning-with-MLFLOW

Repository files navigation

Image Captioning using Transformers

This project implements an image captioning model utilizing CNN and Transformer architectures. The model architecture is as follows:

Architecture

ImageCaptioningArchitecture drawio

  1. CNN as Feature Extractor:

    • We use EfficientNetB0 as the Convolutional Neural Network (CNN) to extract features from the images.
  2. Transformer Encoder:

    • The extracted features from the CNN are passed to the encoder, which consists of multiple Transformer layers.
    • These layers process the features and prepare them for decoding.
  3. Transformer Decoder:

    • The decoder, which also comprises Transformer layers, generates captions for the images based on the features provided by the encoder.
  4. Data

    • Each input image is associated with five captions.
    • Data preprocessing and augmentation techniques are applied to improve the model's robustness and performance.
  5. Model Training

    • Once the model is trained, the results, metrics, and the trained model itself are logged using MLFlow and DagsHub for efficient tracking and management of model development.

Output Examples

MLFlow Experiment Tracking

Screenshot 2024-08-22 at 11 33 45 PM Screenshot 2024-08-22 at 11 33 34 PM