Add a `price_tags` table to store predictions #611

raphael0202 · 2024-12-08T19:28:49Z

In #526, we've talked about how to speed up price addition using ML/AI (#526 (comment)).

For price tags, the workflow seems now clear:

user upload their proof
an object detection model identifies individual price tags. The object detection model returns the coordinates of each detected bounding box, along with a confidence score.
we extract information from the price tags: price, EAN or product category, price type, organic or not
we let the user (and later other users) fix and validate the prediction

Step (2) is currently in development. Once the proof is uploaded, we would ideally run all ML models as an async job.
For each detected price tag, we need to store some information:

the bounding box of the price tag with respect to the original proof
the detected price
the detected category and/or EAN
organic/non-organic
the price per (per kg or per unit)
origins (for raw products)

I suggest we create a new price_tags table. Storing intermediate data in a new table is necessary to allow performing async processing. The user won't have to wait for models to run, and we can distribute workload (=price validation) between contributors. Besides, it allows having gold truth data evaluating and training models that extract information from price tags.

Schema

id: ID of the price tag
proof_id: FK to the proof
created: creation datetime
updated: datetime of last update
bounding_box: the coordinates of the bounding box, in relative coordinated as (y_min, x_min, y_max, x_max). The origin is the top-left corner of the image.
status: The status of the price tag: either a price is already linked for this price tag (status=1), it may be waiting for approval or completion (status=null), or it may be invalid (status=0, the information cannot be read or is hidden). Only price_tags with null status will be suggested in a "Hunger Games-like" games.

Fields pre-filled by the extraction model (currently Gemini):

predicted_product_code
predicted_category_tag
predicted_price
predicted_price_per
predicted_price_is_discounted
price_without_discount
predicted_currency
predicted_labels_tags
predicted_origins_tags

Fields that are derived from predictions, validated by the user. These fields are null before the price tag is validated by the user.

price
price_is_discounted
price_without_discount
price_per
product_code
category_tag
currency
labels_tags
origins_tags
models_info (JSONB): extra information about the version of the models that generated the prediction, about which model was used for what, etc.

We can add a nullable price_tag_id in the prices table to keep track of the individual price tag that is behind a price.

Workflow

When a new proof is uploaded, the price tag object detector model is run on the image of the proof. We create one element in the price_tags table for each detected price tag (above a fixed threshold).
For each detected price tag, we run the Gemini model on it and save the results in the predicted_* fields.
The status of the price_tag is null by default. Users can validate the extracted data by calling an endpoint to retrieve all price_tags with null status. Once validated, a new price is created linked with the original price_tag using the price_tag_id foreign key.

The text was updated successfully, but these errors were encountered:

raphodn · 2024-12-11T21:27:15Z

Thanks for the detailed issue !

A few remarks:

after having created a ProofPrediction table (in Create a proof_prediction table to store predictions from ML models #511), isn't there similarities we could reuse here ? it might be complexifying, but I see things in steps :
- first create the PriceTag table that stores a bounding box and a status. that could already be fed to the user in the frontend to help crowdsource prices (needs to type the barcode & price)
- then create a PriceTagPrediction that stores the result of 1 or multiple models that we run on the image. that we could re-run as well in the future if we improve the model, etc. And we use it to improve the UI given to the user (no need to type the barcode & price anymore, it's just validation)
I see this (and the already-created ProofPrediction) in a dedicated ml sub-app. The backend should be able to run without these AI. Even if it will become "core" to speeding up price collection, we shouldn't make it mandatory.

raphael0202 · 2024-12-12T08:27:17Z

then create a PriceTagPrediction that stores the result of 1 or multiple models that we run on the image. that we could re-run as well in the future if we improve the model, etc. And we use it to improve the UI given to the user (no need to type the barcode & price anymore, it's just validation)

I get your point, but to me there is a difference between the proof_predictions table and the price_tags one: we have several types of proofs (receipt, price tag), for which we run different models on:

one to detect bounding box for price tags
one to classify the proof
(later) one to extract all values for receipts

The fact we have different models specific to different type of proofs was the reason we created a generic proof_predictions table.
Here, for price tags, the extraction model (currently Gemini) will only deal with price tags. I find it more convenient to have all data in a single table, as otherwise we have to deal in the backend (and the front-end) with possibly multiple predictions of the same model type.
It's something we do in Robotoff, for good reasons (as we can extract the same information type from multiple images), but at the cost of greater complexity.

I see this (and the already-created ProofPrediction) in a dedicated ml sub-app. The backend should be able to run without these AI. Even if it will become "core" to speeding up price collection, we shouldn't make it mandatory.

I agree that the AI should be optional from the backend side. And the user should be allowed to draw bounding boxes manually using the web app to create new price tags.

raphodn · 2024-12-12T08:46:48Z

Here, for price tags, the extraction model (currently Gemini) will only deal with price tags. I find it more convenient to have all data in a single table, as otherwise we have to deal in the backend (and the front-end) with possibly multiple predictions of the same model type.

But I don't understand how with your current price_tags model proposal you can store multiple predictions ? It's missing a data JSONField, thus we could simply plug in the ProofPrediction model (or a dedicated PricePrediction)

raphael0202 · 2024-12-12T09:14:58Z

But I don't understand how with your current price_tags model proposal you can store multiple predictions ? It's missing a data JSONField, thus we could simply plug in the ProofPrediction model (or a dedicated PricePrediction)

We don't store multiple predictions (ex: we don't store 2 price prediction by 2 different models). On Robotoff, it turns out after a couple of years that we never needed predictions from 2 models at the same time: when a new model is trained and tested, I just delete all the predictions associated with this model and relaunch the model on all images.

edit: to make thing clearer, we can store in the current schema of the price_tags table predictions coming from two different models that do different things. Ex, I plan to add a predicted_blurriness field, that will be predicted by a different model than Gemini.

raphodn · 2024-12-12T09:34:23Z

We don't store multiple predictions (ex: we don't store 2 price prediction by 2 different models)

Ok but I would be in favor to have the flexibility to do any number of predictions, for instance ones coming from Gemini, and another coming from our own model, and have both show up in the UI to help the user, or help us test/compare while we transition out of GenAI, no ? That's why I like the JSONField where we can have any number of predictions :)

raphael0202 · 2024-12-12T11:15:46Z

If you want to keep the flexibility to have any number of predictions of the same type, it's better to have a PriceTagProofPrediction as you suggested!
I'm down for creating this new table then, if we plan to implement model comparison in the front-end :)

raphael0202 · 2024-12-16T10:45:56Z

Updated schema, after the discussions above:

Schemas

`price_tags` table

id: ID of the price tag
proof_id: FK to the proof
price_id: FK to the price created from this price tag, can be null.
created: creation datetime
updated: datetime of last update
bounding_box: the coordinates of the bounding box, in relative coordinated as (y_min, x_min, y_max, x_max). The origin is the top-left corner of the image. Cannot be null.
status: The status of the price tag: either a price is already linked for this price tag (status=1), it may be waiting for approval or completion (status=null), the price or the barcode cannot be read (status=2), the object was deleted by a user (status=0). Only price_tags with null status will be suggested in a "Hunger Games-like" games.
model_version: the version of the object detector model that created this price tag. If it was created by a human, this field is null.
created_by: the name of the user who created the price tag. If the price tag was created automatically after object detection, this field is null.
updated_by: the name of the user who updated the price tag coordinates. If the price tag was created automatically and never updated, this field is null.

`price_tag_predictions` table

id: ID of the price tag prediction
price_tag_id: the ID of the price tag (FK)
type: type of the prediction. Currently, only one value is supported: price_tag_extraction
model_name: name of the model. Currently, there is only one model: gemini
model_version: version of the model. Currently, there is only one version: gemini-1.5-flash
data: JSONB containing prediction data returned by the model. The schema of the dictionary is specific to the model.
created: creation datetime

Workflow

When a new proof is uploaded, the price tag object detector model is run on the image of the proof. We create one element in the price_tags table for each detected price tag (above a fixed threshold).
For each detected price tag, we run the Gemini model on it and create a new PriceTagPrediction object in DB linked to the PriceTag.
The status of the price_tag is null by default. Users can validate the extracted data by calling an endpoint to retrieve all price_tags with null status. Once validated, a new price is created linked with the original price_tag using the price_tag.price_id foreign key.

Linked issue: #611

github-project-automation bot added this to 💸 Open Prices Dec 8, 2024

github-project-automation bot moved this to Backlog in 💸 Open Prices Dec 8, 2024

raphael0202 added the price tag label Dec 8, 2024

raphael0202 mentioned this issue Dec 11, 2024

New page that lists all Price tag proofs without any prices yet (= open for contribution) openfoodfacts/open-prices-frontend#1115

Closed

raphodn added the machine learning label Dec 11, 2024

github-actions bot added the ⭐ top issue Top issue. label Dec 12, 2024

github-actions bot mentioned this issue Dec 12, 2024

👍 Top Issues Dashboard #229

Open

raphael0202 added a commit that referenced this issue Dec 16, 2024

feat: add price_tag table and routes

5574431

Linked issue: #611

raphael0202 mentioned this issue Dec 16, 2024

feat: add price_tag table and routes #628

Merged

raphael0202 added a commit that referenced this issue Dec 16, 2024

feat: add price_tag table and routes

2b113ea

Linked issue: #611

raphael0202 added a commit that referenced this issue Dec 16, 2024

feat: add price_tag table and routes

a176e4e

Linked issue: #611

raphodn moved this from Backlog to In progress in 💸 Open Prices Dec 17, 2024

raphodn linked a pull request Dec 17, 2024 that will close this issue

feat: add price_tag table and routes #628

Merged

raphodn changed the title ~~Add a price_tags table~~ Add a price_tags table to store predictions Dec 17, 2024

This was linked to pull requests Dec 17, 2024

feat: create price tags from the object detector model #629

Merged

feat: save Gemini prediction in price_tag_predictions table #630

Merged

raphodn mentioned this issue Dec 18, 2024

New price validation assistant page (extracted and predicted from owned proofs) openfoodfacts/open-prices-frontend#1137

Open

This was linked to pull requests Dec 18, 2024

fix: don't create price tags for proofs.type != PRICE_TAG #632

Merged

feat: improve gemini processing #631

Merged

TTalex mentioned this issue Dec 19, 2024

Price deletion should also trigger associated price_tag status update #636

Open

raphodn closed this as completed Dec 19, 2024

github-project-automation bot moved this from In progress to Done in 💸 Open Prices Dec 19, 2024

raphodn linked a pull request Dec 26, 2024 that will close this issue

feat(Price tags): new Proof.ready_for_price_tag_validation field to help filter the frontend UI #656

Merged

This was linked to pull requests Dec 26, 2024

refactor(Price tags): fix stats. move constants. cleanup #653

Merged

refactor(Stats): new PriceTag stats #652

Merged

This was referenced Jan 2, 2025

Proof image / Price tags: barcode scanner library to extract the EAN #670

Open

Detect (and ignore) duplicate prices #422

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a `price_tags` table to store predictions #611

Add a `price_tags` table to store predictions #611

raphael0202 commented Dec 8, 2024 •

edited

Loading

raphodn commented Dec 11, 2024 •

edited

Loading

raphael0202 commented Dec 12, 2024

raphodn commented Dec 12, 2024

raphael0202 commented Dec 12, 2024 •

edited

Loading

raphodn commented Dec 12, 2024 •

edited

Loading

raphael0202 commented Dec 12, 2024

raphael0202 commented Dec 16, 2024 •

edited

Loading

Add a price_tags table to store predictions #611

Add a price_tags table to store predictions #611

Comments

raphael0202 commented Dec 8, 2024 • edited Loading

Schema

Workflow

raphodn commented Dec 11, 2024 • edited Loading

raphael0202 commented Dec 12, 2024

raphodn commented Dec 12, 2024

raphael0202 commented Dec 12, 2024 • edited Loading

raphodn commented Dec 12, 2024 • edited Loading

raphael0202 commented Dec 12, 2024

raphael0202 commented Dec 16, 2024 • edited Loading

Schemas

price_tags table

price_tag_predictions table

Workflow

Add a `price_tags` table to store predictions #611

Add a `price_tags` table to store predictions #611

raphael0202 commented Dec 8, 2024 •

edited

Loading

raphodn commented Dec 11, 2024 •

edited

Loading

raphael0202 commented Dec 12, 2024 •

edited

Loading

raphodn commented Dec 12, 2024 •

edited

Loading

raphael0202 commented Dec 16, 2024 •

edited

Loading

`price_tags` table

`price_tag_predictions` table