The objective of this repository is to implement Open AI's CLIP Paper: Learning Transferable Visual Models From Natural Language Supervision from scratch using PyTorch.
Read Paper: https://arxiv.org/pdf/2103.00020.pdf
- A model designed for learning joint representations of images and text.
- Leverages a shared embedding space, where images and their corresponding textual descriptions are mapped to similar points.
- Uses a contrastive learning objective to train the model. It aims to maximize the similarity between positive pairs (Correct Image-Text pairs) and minimize the similarity between negative pairs (incorrect pairs)
- The distilbert-base-uncased encoder model for embedding the texts
- The resulting text encoder embedding dimension will be of shape -
(batch_size, text_embedding)
->(32, 768)
- The resnet50 model pretrained model is used for encoding the images
- The resulting image encoder embedding dimension will be of shape -
(batch_size, image_embedding)
->(32, 2048)
The Projection Head serves a crucial role in shaping the representations learned by the model.
- Responsbile for reducing the dimensionality of the high-dimensional embeddings produced by the
image encoder
andtext encoder
- Projecting the embeddings into a lower dimensional space, the model can focus on the most relevant features for the contrastive learning task
- Enhances the discriminative power of the learned representations, helping the model distinguish between positive and negative pairs more effectively during the constrastive learning process.
Try CLIP Demo in HuggingFace Spaces: https://huggingface.co/spaces/bala1802/clip_demo
- Prompt: "people sitting near the beach"
- Prompt: "people walking inside the forest"
- Prompt: "playing soccer"