Skip to content

Implementation of OpenAI's CLIP research paper from scratch using PyTorch, enabling cross-modal learning for vision and language tasks.

License

Notifications You must be signed in to change notification settings

bala1802/OpenAI_CLIP

Repository files navigation

Introduction

The objective of this repository is to implement Open AI's CLIP Paper: Learning Transferable Visual Models From Natural Language Supervision from scratch using PyTorch.

Read Paper: https://arxiv.org/pdf/2103.00020.pdf

About CLIP

image
  • A model designed for learning joint representations of images and text.
  • Leverages a shared embedding space, where images and their corresponding textual descriptions are mapped to similar points.
  • Uses a contrastive learning objective to train the model. It aims to maximize the similarity between positive pairs (Correct Image-Text pairs) and minimize the similarity between negative pairs (incorrect pairs)

Code Implementation

Text Encoder

  • The distilbert-base-uncased encoder model for embedding the texts
  • The resulting text encoder embedding dimension will be of shape - (batch_size, text_embedding) -> (32, 768)

Image Encoder

  • The resnet50 model pretrained model is used for encoding the images
  • The resulting image encoder embedding dimension will be of shape - (batch_size, image_embedding) -> (32, 2048)

Projection Head

The Projection Head serves a crucial role in shaping the representations learned by the model.

  • Responsbile for reducing the dimensionality of the high-dimensional embeddings produced by the image encoder and text encoder
  • Projecting the embeddings into a lower dimensional space, the model can focus on the most relevant features for the contrastive learning task
  • Enhances the discriminative power of the learned representations, helping the model distinguish between positive and negative pairs more effectively during the constrastive learning process.

Results:

Try CLIP Demo in HuggingFace Spaces: https://huggingface.co/spaces/bala1802/clip_demo

  • Prompt: "people sitting near the beach"
image
  • Prompt: "people walking inside the forest"
image
  • Prompt: "playing soccer"
image

About

Implementation of OpenAI's CLIP research paper from scratch using PyTorch, enabling cross-modal learning for vision and language tasks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages