This project implements an image captioning model utilizing CNN and Transformer architectures. The model architecture is as follows:
-
CNN as Feature Extractor:
- We use EfficientNetB0 as the Convolutional Neural Network (CNN) to extract features from the images.
-
Transformer Encoder:
- The extracted features from the CNN are passed to the encoder, which consists of multiple Transformer layers.
- These layers process the features and prepare them for decoding.
-
Transformer Decoder:
- The decoder, which also comprises Transformer layers, generates captions for the images based on the features provided by the encoder.
-
Data
- Each input image is associated with five captions.
- Data preprocessing and augmentation techniques are applied to improve the model's robustness and performance.
-
Model Training
- Once the model is trained, the results, metrics, and the trained model itself are logged using MLFlow and DagsHub for efficient tracking and management of model development.