Image Captioning with Transformer

This project demonstrates an advanced image captioning system built using a Transformer model. The model is trained on the COCO 2017 dataset and is capable of generating descriptive captions for images. The implementation leverages TensorFlow and Keras for constructing the model and performing the training process.

Project Overview

The goal of this project is to generate captions for images using a Transformer-based architecture. The system consists of the following key components:

Dataset Preprocessing: The COCO 2017 dataset is used to train the model. Captions are preprocessed by lowercasing, removing punctuation, and adding start and end tokens.
Model Architecture:
- CNN Encoder: The project uses an InceptionV3 model pre-trained on ImageNet to extract image features. These features are then reshaped and passed to the Transformer encoder.
- Transformer Encoder and Decoder: The encoder processes the image features, while the decoder generates the caption word by word. Multi-head attention mechanisms are employed in both the encoder and decoder to capture relationships in the data.
- Embeddings: Word embeddings are used to convert words into dense vectors, and positional embeddings are applied to retain the order of the words.
Training Strategy:
- Checkpointing: Model checkpoints are saved during training to allow resuming from the last saved point.
- Data Augmentation: Image augmentation techniques such as random flipping, rotation, and contrast adjustment are used to improve the model's robustness.
- Loss and Metrics: A custom training loop calculates the loss and accuracy during training and validation. The loss function used is sparse categorical cross-entropy.
Model Inference: The trained model can generate captions for new images by passing the image through the encoder and generating words sequentially through the decoder.

Key Features

GPU Memory Management: The implementation restricts TensorFlow to allocate a limited amount of GPU memory to prevent out-of-memory errors.
Custom Training Loop: The model is trained using a custom training loop that allows for more flexibility in handling data, applying augmentations, and updating metrics.
Checkpointing and Resuming Training: The model saves checkpoints during training and can resume from the latest checkpoint, ensuring progress is not lost due to interruptions.
Image Augmentation: To improve generalization, various image augmentation techniques are applied during training.
Inference on Custom Images: The model can generate captions for any input image, whether provided via a URL or a local file.

Getting Started

Dataset Preparation: Ensure the COCO 2017 dataset is available and the necessary annotations are loaded and preprocessed.
Model Training: Train the model by running the cells in the notebook. The training process includes saving checkpoints, which can be used to resume training if interrupted.
Inference: Use the trained model to generate captions for new images. You can provide an image via a URL or a local file.

Results

The model is capable of generating coherent captions for various images. Example captions generated by the model include:

"A man riding a bike down a street"
"A boat with a lot of people on it"

Future Work

Potential improvements include:

Fine-tuning the model on a larger vocabulary or incorporating external datasets to improve caption quality.
Experimenting with different architectures and hyperparameters to enhance model performance.
Implementing advanced techniques like beam search during inference to generate more accurate captions.

Conclusion

This project demonstrates the power of Transformers in the field of image captioning. The combination of a pre-trained CNN for feature extraction and a Transformer for caption generation provides a robust framework for generating descriptive image captions.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.ipynb_checkpoints		.ipynb_checkpoints
checkpoint		checkpoint
coco2017		coco2017
.gitignore		.gitignore
6015622c5bfd9cfa68b70df973d2ab59.jpg		6015622c5bfd9cfa68b70df973d2ab59.jpg
Image_Captioning.ipynb		Image_Captioning.ipynb
LICENSE		LICENSE
README.md		README.md
tmp.jpg		tmp.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Captioning with Transformer

Project Overview

Key Features

Getting Started

Results

Future Work

Conclusion

About

Releases

Packages

Languages

License

Shreyash-Gaur/Image_Captioning

Folders and files

Latest commit

History

Repository files navigation

Image Captioning with Transformer

Project Overview

Key Features

Getting Started

Results

Future Work

Conclusion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages