Skip to content

This project implements an image captioning model using the COCO 2017 dataset. The model is built using a CNN as an image encoder and a Transformer-based architecture for the caption generation process.

License

Notifications You must be signed in to change notification settings

Shreyash-Gaur/Image_Captioning

Repository files navigation


Image Captioning with Transformer

This project demonstrates an advanced image captioning system built using a Transformer model. The model is trained on the COCO 2017 dataset and is capable of generating descriptive captions for images. The implementation leverages TensorFlow and Keras for constructing the model and performing the training process.

Project Overview

The goal of this project is to generate captions for images using a Transformer-based architecture. The system consists of the following key components:

  • Dataset Preprocessing: The COCO 2017 dataset is used to train the model. Captions are preprocessed by lowercasing, removing punctuation, and adding start and end tokens.

  • Model Architecture:

    • CNN Encoder: The project uses an InceptionV3 model pre-trained on ImageNet to extract image features. These features are then reshaped and passed to the Transformer encoder.
    • Transformer Encoder and Decoder: The encoder processes the image features, while the decoder generates the caption word by word. Multi-head attention mechanisms are employed in both the encoder and decoder to capture relationships in the data.
    • Embeddings: Word embeddings are used to convert words into dense vectors, and positional embeddings are applied to retain the order of the words.
  • Training Strategy:

    • Checkpointing: Model checkpoints are saved during training to allow resuming from the last saved point.
    • Data Augmentation: Image augmentation techniques such as random flipping, rotation, and contrast adjustment are used to improve the model's robustness.
    • Loss and Metrics: A custom training loop calculates the loss and accuracy during training and validation. The loss function used is sparse categorical cross-entropy.
  • Model Inference: The trained model can generate captions for new images by passing the image through the encoder and generating words sequentially through the decoder.

Key Features

  • GPU Memory Management: The implementation restricts TensorFlow to allocate a limited amount of GPU memory to prevent out-of-memory errors.
  • Custom Training Loop: The model is trained using a custom training loop that allows for more flexibility in handling data, applying augmentations, and updating metrics.
  • Checkpointing and Resuming Training: The model saves checkpoints during training and can resume from the latest checkpoint, ensuring progress is not lost due to interruptions.
  • Image Augmentation: To improve generalization, various image augmentation techniques are applied during training.
  • Inference on Custom Images: The model can generate captions for any input image, whether provided via a URL or a local file.

Getting Started

  1. Dataset Preparation: Ensure the COCO 2017 dataset is available and the necessary annotations are loaded and preprocessed.
  2. Model Training: Train the model by running the cells in the notebook. The training process includes saving checkpoints, which can be used to resume training if interrupted.
  3. Inference: Use the trained model to generate captions for new images. You can provide an image via a URL or a local file.

Results

The model is capable of generating coherent captions for various images. Example captions generated by the model include:

  • "A man riding a bike down a street"
  • "A boat with a lot of people on it"

Future Work

Potential improvements include:

  • Fine-tuning the model on a larger vocabulary or incorporating external datasets to improve caption quality.
  • Experimenting with different architectures and hyperparameters to enhance model performance.
  • Implementing advanced techniques like beam search during inference to generate more accurate captions.

Conclusion

This project demonstrates the power of Transformers in the field of image captioning. The combination of a pre-trained CNN for feature extraction and a Transformer for caption generation provides a robust framework for generating descriptive image captions.


About

This project implements an image captioning model using the COCO 2017 dataset. The model is built using a CNN as an image encoder and a Transformer-based architecture for the caption generation process.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published