Skip to content

Latest commit

 

History

History
192 lines (152 loc) · 9.76 KB

README.md

File metadata and controls

192 lines (152 loc) · 9.76 KB

Automate Fashion Image Captioning using BLIP-2


The fashion industry is worth trillions of dollars. The goal of any company/seller is to help customer tofind the right product from a huge corpus of products that they are searching for. So, when customer find the right product they are mostly going to add the item to their cart and which help in company revenue.
Accurate and enchanting descriptions of clothes on shopping websites can help customers without fashion knowledge to better understand the features (attributes, style, functionality, etc.) of the items and increase online sales by enticing more customers. Also, most of the time when any customer visits shopping websites, they are looking for a certain style or type of clothes that wish to purchase, they search for the item by providing a description of the item and the system finds the relevant items that match the search query by computing the similarity score between the query and the item caption. In such use cases having an accurate description of the clothes is useful.
Manually writing the descriptions is a non-trivial and highly expensive task. Thus, the automatic generation of descriptions is an urgent need and will help the seller (while uploading the product to recommend captions).


Problem Statement


Given the clothes image provide a short caption that describes the item. In general, in image captioning datasets (e.g., COCO, Fliker), the descriptions of fashion items have three unique features, which makes the automatic generation of captions a challenging task. First, fashion captioning needs to describe the attributes of an item, while image captioning generally narrates the objects and their relations in the image.
e.g. image where the model is wearing a shirt, the general caption model describes such images as "male wearing a white shirt". This is incorrect since we want the model to describe the item. In this application, it is much more important to have a performant to caption the image than an interpretable model.

                                                       



Dataset

FAshion CAptioning Dataset (FACAD), the fashion captioning dataset consisting of over 993K images.

Properties of FACAD dataset:

  1. Diverse fashion images of all four seasons, ages (kids and adults), categories (clothing, shoes, bag, accessories, etc.), angles of a human body (front, back, side, etc.).
  2. It tackles the captioning problem for fashion items.
    • FACAD contains fine-grained descriptions of attributes of fashion-related items, while MS COCO narrates the objects and their relations in general images. FACAD has longer captions (21 words per sentence on average) compared with the 10.4 words per sentence of the MS COCO caption dataset
    • Expression style of FACAD is enchanting, while that of MS COCO is plain without rich expressions. e.g. words like "pearly", "so-simple yet so-chic", and "retro flair" are more attractive than the plain MS COCO descriptions, like "person in a dress".

Source of the dataset
Citation: @inproceedings{
    XuewenECCV20Fashion,
    Author = {Xuewen Yang and Heming Zhang and Di Jin and Yingru Liu and Chi-Hao Wu and Jianchao Tan and Dongliang Xie and Jue Wang and Xin Wang},
    Title = {Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards},
    booktitle = {ECCV},
    Year = {2020}
  }


For this project, have only considered 20k images for the pre-trained model due to resource limitations.
When trained on the entire dataset would allow the model to see different patterns, design of the fashion item which help in the create a large vocabulary that will help in describing new item caption.
For caption an item , have consider item description, color, and brand. Because when a user search for an item they usually either mention specific color or particular brand along with the style they are interested to buy.
For cleaning caption did not apply to the stem as we want the caption with proper grammar words. If this was a classification problem we apply stemming, since in that case, predict output is either 1 or 0 whereas in our case, we want a proper word/sentence.


Solution: Using Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (BLIP-2)


BLIP2 is a recent powerful model by Salesforce that is capable of performing visual question answering as well as image captioning.

Overview on BLIP-2


BLIP2 introduces a lightweight module called Querying Transformer (Q-Former) that effectively enhances the vision-language model. Q-Former is a lightweight transformer that uses learnable query vectors to extract visual features from the frozen image encoder.
It acts as an information bottleneck between the frozen image encoder and the frozen Large Language Model (LLM), where it feeds the most useful visual feature for the LLM to output the desired text.
BLIP2 has mainly two different versions based on the pre-trained LLM model used:
  1. Open Pre-trained Transformer Language Models(opt-2.7b) by Meta. Pre-trained model weights in HuggingFace: Salesforce/blip2-opt-6.7b
  2. FlanT5 model by Google. Pre-trained model weights in HuggingFace: Salesforce/blip2-flan-t5-xl or Salesforce/blip2-flan-t5-xxl

In both these versions Vision Encoder for image extraction used was Vision Transformer (large-sized model) by Google.

Architecture of BLIP-2 model

                                                       

Solution

Fine-tune pre-trained model BLIP2 (trained on Fliker dataset) with Fashion dataset using Low Rank Adaptation (LoRA) a Parameter-efficient fine-tuning technique (PEFT)

The original model Salesforce/blip2-opt-2.7b size was too large. It was quite challenging to fit and fine-tune the model on the 16GB GPU.

So, for this project have downloaded the pre-trained model ybelkada/blip2-opt-2.7b-fp16-sharded from HuggingFace. This model uses OPT-2.7b LLM model with reduced precision to float16.

Requirements

Refer requirements.txt

Tech Stack Used

  1. Python 3.8
  2. HuggingFace transformers 🤗
  3. peft
  4. bitsandbytes
  5. Streamlit
  6. HuggingFace Spaces

Metric

Most commonly used metric for image caption task, that is used to measuring the quality of an predict text based on reference texts are:

  1. Bilingual Evaluation Understudy (Bleu) Score: a concept build on precision.

     Bleu = Number of correct predicted words / Number of total predicted words
    
  2. Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Score: a set of metrics, rather than just one. ROUGE metric return recall, precision and f1-score. In our project have use F1-Rouge score. It's concept build on recall.

     Recall-N-gram = Number of correct predicted n-grams / Number of total target N-grams
    
     Precision-N-gram = Number of correct predicted n-grams / Number of total predict N-grams
    
     F1-Score = 2* ((Recall-N-gram * Precision-N-gram) / (Recall-N-gram + Precision-N-gram))
    

Both these score are build on concept of N-gram. In n-gram the value of n, is group n words and these words will always be in order. For this project have consider the value on n = 2.
Because caption of the fashion item are the attributes, it does not matter in which order the model predits those attributes words.


Results


Dataset F1-Rouge@1 F1-Rouge@2 F1-RougeL@2 BlEU@1 BlEU@2
Train 0.45 0.16 0.44 0.42 0.26
Valid 0.42 0.13 0.41 0.39 0.22
Test 0.45 0.13 0.45 0.41 0.23


Try it Out

Deployed the model on HuggingFace Space. You can check it out here

Demo