feat: Added origin extraction #890

ValentinRegnault · 2022-09-08T09:27:34Z

feat: Add ocr origin extraction

What

This pull request adds the origin extraction from ocr, for the eco-score. It also adds an api endpoint, to extract origin from any ocr url. It uses one regex by lang (currently French and partially english are support), and this regex is made to match almost any sentence that tells something about the origin of any ingredient. See Extract origins of ingredients #306.
Explanations of the french regex, same reasoning in english :
In french, a sentence always starts with the subject, so we first start with "({INGREDIENTS_FR_JOINED})" that matches any ingredient.
But we can have several ingredients, with 'and' or a coma as separator, like in a sentence like "Caramel and chocolate are made in France". That's why we add "?(?:et |, ?)?" that means "it may be a space, then a 'and' or a coma, but maybe there is nothing"
In a sentence like "Made in France", there is no subject, so all this is optionnal. It can also be repeated like in "Caramel, Chocolate and Wheat are from France". That's why we use a "*" quantifier
Then there is a support for sentence like "100% made in France" or "quinoa is 100% French" with the ?(?:100%)?
Then the verb, that is also optional, like in "french quinoa". In sentence like "made with french quinoa", the verb "made" isn't captured by the regex, but "french quinoa" is.
After that we have some connectors like "dans, depuis" (= in, from), and also their negations, "hors, en dehors" (=outside). Those are also optional, but the regex can match even if they are in the middle of the sentence.
They are followed by other connectors like "de la|de l' |de|du" (="of" adapted for the different genders)
Then we have the country as a name "?(?P{COUNTRIES_FR_JOINED})?", like in "made in France".
and after we have the country as a adjective , like in "french quinoa" (in French, the adjective goes after the noun). "?(?P<country_adj>{COUNTRIES_ADJECTIVES_FR_JOINED})?"
and after we have "?(?P<large_origin>(?:divers(?:es)?|diff[ée]rent(?:es)?|d'autres) ?(?:pays|[ée]tats?|r[ée]gions?|continents?|origines?))?" that supports sentence like "quinoa from different countries".
Finally, you may have noticed that everything is optional. That means that this regex match empty strings. To avoid that we add "\b(?<!\w)" at the beginning and "\b(?!\w)" at the end. Idea comes from https://stackoverflow.com/a/50188154/8966453
Here are some sentence that matches :
"quinoa de France"
"quinoa produit en France"
"quinoa cultivée en dehors de l' Union Européenne"
"Fabriqué en France"
"Tout les ingrédients sont fabriqués en France"
"A partir de quinoa français"
"quinoa 100% français"
"Le quinoa a différentes origines"
"Le quinoa provient de different pays"
NOT SUPPORTED:
"Le quinoa de notre recette provient de different pays" ("de notre recette" is a genetive, hard to support)
"Nos agriculteurs français produisent le blé de notre recette" (subject isn't the ingredient, hard to support)
I added the file data/ocr/countries.json which is a slightly modified version of the file at https://static.openfoodfacts.net/data/taxonomies/countries.full.json. I added a "nationalities" field for every countrie, which contains the name we give to the inhabitants, key is the lang and value is the nationality name. There is currently only a few countries in english and french. This is mandatory to support sentences like "french quinoa". My code also uses the categories.full.json file.

Note :

This is my first contribution to an open source project, I tried to do things well. Every tests passes. I started from the code in nutrient.py as suggested in Extract origins of ingredients #306. I added some constants to settings.py, an api endpoint, and a file in data.

Part of

alexgarel

@Pykorm first of all a big big thanks for this courageous PR that will really help improve on the eco-score front !

I added quite structural comments, because it's always easier reading code to see improvements, but I hope it won't discourage you.

I think we can concentrate on having the Predictions in this PR, and deploy it asap to get predictions.

That said I wanted to make you aware, as this is a new type of prediction that we will then have to:

add an importer to generate insights
add an annotator to apply validated insights on open food facts.

But this is better done in a separate PR.

robotoff/app/api.py

robotoff/prediction/ocr/origin.py

alexgarel · 2022-09-12T17:13:14Z

robotoff/prediction/ocr/origin.py

+# (ex: "quinoa comes from outside the E.U", "quinoa has several origins")
+UNKNOW_ORIGIN = "unknow origin"
+ALL_INGREDIENTS = "all ingredients"
+


I think we should really, enclose all this first part in a class (and find origin should be a method of this class) and eventually

Right now doing this at module level will really slow down startup, which is something we will pay a lot when running tests, etc.
We want it to be lazy.
Using a class we can use some caching mechanism to do it only when needed (or may schedule it at startup).

So it would be something like:

class OriginParser: def __init__(self): self.initialized = False def initialize(self): """load data needed to parse origins""" if self.initialized: return # already done self.INGREDIENTS = json.load… ... self.initialized = True def find_origin(self, content…) self.initialize() .... ocr_parser = OCRParser()

The last line creates a parser, but not initialized, so where you now call find_origin you would instead call ocr_parser.find_origin. The initialization will happen only once, on the first ocr analyze request.

Note that with this pattern there is a small risk of race condition between threads, and we may consider also handling it, but I'm not sure it's worth it as a first approximation.

Yep, good idea, I'll be working on it

alexgarel · 2022-09-12T18:01:52Z

robotoff/prediction/ocr/origin.py

+            origin_index = -1
+            for index, ing_ori in enumerate(ingredients_origins):
+                if ing_ori["origin"] == origin:
+                    origin_index = index
+
+            if origin_index == -1:
+                ingredients_origins.append({
+                    "origin": origin,
+                    "same_for_all_ingredients": True, # True unless group "ingredients" matched
+                    "concerned_ingredients": None
+                })
+                origin_index = len(ingredients_origins) - 1


why not use a dict (that at the end you will transform to a list of prediction) so that you can index by origin, instead to have to scan the list ? It would make the code more readable.

You could even use a collections.defaultdict:

ingredients_origins: Dict[str, Dict[str, Any]] = collections.defaultdict( lambda: {"same_for_all_ingredients": True, "concerned_ingredients": None} )

and here:

origin_data = ingredients_origins[origin]

At the end you would use an iterator on items()

Yes it's clearly what should be done, I don't why I was scanning the list each time

It's always easier to see this kind of thing as a reviewer !

robotoff/prediction/ocr/origin.py

alexgarel · 2022-09-12T18:16:19Z

robotoff/prediction/ocr/origin.py

+            return next(ingredient_id for ingredient_id, ingredient in INGREDIENTS.items() 
+                if lang in ingredient["name"]
+                and ingredient["name"][lang] == s
+            )
+        except StopIteration:
+            return s


As we know we will do this kind of lookup, I would build a dict at initialization time, so that we can just check the key directly, it will really improve performance.
So that I can do

return INGREDIENTS_ID[lang].get(s, s)

building the INGREDIENTS_ID is something like:

INGREDIENTS_ID = collections.defaultdict(dict) for ingredient_id, ingredient in INGREDIENTS.items(): for lang, name in ingredient["name"].items(): INGREDIENTS_ID[lang][name] = ingredient_id

Good idea, I'll do this while making all this a class

tests/unit/prediction/ocr/test_origin.py

alexgarel · 2022-09-12T18:24:26Z

I forgot to add two things @Pykorm:

you should use make lint and make checks to have your code in good shape :-)
it's ok to have the json of countries in the code for now, but it would be cleaner if:
1. we add needed information in the openfoodfacts taxonomy for countries as properties
2. we use the taxonomy json through the Taxonomy class in robotoff
this insures having future updates of the taxonomy

Co-authored-by: Alex Garel <[email protected]>

ValentinRegnault · 2022-09-14T14:55:37Z

Just commited the modifications you suggested. Now we have a OriginParser class, with lazy initialization.

How can I use the taxonomy properly ? To create a store in taxonomy.py I need a url (and a fallback path). For example, the label taxonomy store points to http://static.openfoodfact.[domain]/labels.full.json
So if I want to implement a new store, for countries, the file must exists at some url, isn't it ?

countries.json is a modified version of the file at https://static.openfoodfacts.net/data/taxonomies/countries.full.json. Basicly it's the same file with a "nationalities" field added for every country, that may contains the nationality of the inhabitants of this countries, by lang. It would be great to add these modifications to the the file at https://static.openfoodfacts.net/data/taxonomies/countries.full.json, but as far as I understand, it's not robotoff that serves it. So how can I make use of the openfoodfacts taxonomies, and make this code cleaner ?

stephanegigandet · 2022-09-16T20:19:29Z

Hi @Pykorm , that's very interesting, thank you for your contribution!

In Product Opener (the OFF backend written in Perl, also with a lot of regular expressions), we also have some functions to extract things like "Origine du Cacao : Pérou", but the match is not as complete as the one you made. I'm not using the ingredients and country taxonomies, so it's harder to see some patterns with multiple ingredients and/or multiple countries.

The corresponding code is there: https://github.com/openfoodfacts/openfoodfacts-server/blob/main/lib/ProductOpener/Ingredients.pm#L1003

stephanegigandet · 2022-09-16T20:25:27Z

Regarding

nationalities : {
en: "Hungarian",
fr: "hongrois"
}

We could add something like that in the OFF taxonomies, as properties. We have to think a bit about it, as we may want to cover the nationalities names for different gender and number (e.g. "grec", "grecque", "grecs", "grecques") so that we match "olives grecques".

One thing I'm wondering about is performance of a regex with 10k ingredients (and synonyms) + 200 nationalities. I've never used regexps that big, but maybe that's not an issue at all.

ValentinRegnault · 2022-09-17T16:21:24Z

Hello,
You're right, we should care about gender and plural. Maybe we could store the nationalities as regex, something like that:
{ "fr": "grec(?:que)?s?" }
but it would be hard to create or find a database with all those names.

I'm not woring about regexes performance, as there is no better solution. I don't see any way to recognize the sentences that say something about the origin that doesn't imply looping over every verbs, every country name... Those computations are mandatory. And I think that the performance of the python's re module are the best we could have.

raphael0202 · 2022-09-23T09:19:42Z

temp.txt

@@ -0,0 +1,339 @@
+Angleterre


Is it expected that these two files temp.txt and temp2.txt are included?

Oh no ! I just missclicked in vscode and push everything, I was thinking that I managed to clean my mistake, but apparently not :/

raphael0202 · 2022-09-23T13:29:24Z

Makefile

@@ -2,7 +2,7 @@

 # nice way to have our .env in environment for use in makefile
 # see https://lithic.tech/blog/2020-05/makefile-dot-env
-# Note: this will mask environment variable as opposed to docker-compose priority
+# Note: this will mask environment variable as opposed to docker compose priority


why have you introduced these changes?

Same thing, I made that because on my computer for unknown reason the "docker compose" don't exists, I have to use "docker-compose". This should be removed.

Alright, can you revert to the original version?

raphael0202 · 2022-09-23T13:32:24Z

Can you also move countries.json to data/taxonomies? data/ocr is reserved for pattern/blacklist files used in OCR detections.

raphael0202 · 2022-09-23T13:33:02Z

.env

-# OFF_UID=1000
-# OFF_GID=1000
+ OFF_UID=1000
+ OFF_GID=1000


why have you introduced these changes?

raphael0202 · 2022-09-23T13:46:55Z

robotoff/prediction/ocr/origin.py

+        INGREDIENTS = json.load(open(settings.TAXONOMY_CATEGORY_PATH, "r"))
+
+        # French ----------------
+        INGREDIENTS_SYNONIMS_FR = [


INGREDIENTS_SYNONIMS_FR -> INGREDIENTS_SYNONYMS_FR

raphael0202 · 2022-09-23T13:53:37Z

robotoff/prediction/ocr/origin.py

+                            ingredient
+                        )
+
+        if len(ingredients_origins) == 0:


These lines are not necessary (the return below works well when ingredients_origins is empty)

You're right, it was usefull in old commits and is now useless

raphael0202 · 2022-09-23T13:55:53Z

robotoff/prediction/ocr/origin.py

+
+        # English -----------------------
+
+        INGREDIENTS_SYNONIMS_EN = [


INGREDIENTS_SYNONIMS_EN -> INGREDIENTS_SYNONYMS_EN

ValentinRegnault · 2022-09-24T10:09:20Z

I just commited the changes you suggested, thank you for your review.

raphael0202 · 2022-09-26T11:41:48Z

.env

@@ -56,4 +56,4 @@ SENTRY_DSN=
 IPC_AUTHKEY=ipc
 IPC_HOST=0.0.0.0
 IPC_PORT=6650
-WORKER_COUNT=8
+WORKER_COUNT=8


Can you revert to the original version for this file too?

raphael0202 · 2022-10-07T07:43:32Z

Sorry for the (very) late reply! Can you give us an estimate of how much time does it take to process one OCR JSON?
It's important to know this before merging in production :)

edit: you can test it on real-data on this OCR dump: http://static.openfoodfacts.org/data/ocr_text.jsonl.gz
it's also useful to check that there is few false positives
We also switched from blueprint API format to OpenAPI, which is better supported. The new doc is here, do you mind updating it? Otherwise I can take care of it

ValentinRegnault · 2022-10-09T13:30:42Z

Is it possible to update the doc in this PR ? As it was created before de file api.yml, I can't edit this file. If I create it at the same path, would it be merge correctly ?

One OCR processing takes about 0.00442 s, or 4.442ms. It was run on my laptop, with a
i7-1260P (12th gen) and 16go of RAM.

raphael0202 · 2022-10-10T14:25:30Z

There are some conflicts with master, you need to merge openfoodfacts/robotoff master branch into your branch first.

One OCR processing takes about 0.00442 s, or 4.442ms.

Perfect :)

This commit add origin extraction in order to compute the eco-score

ValentinRegnault · 2022-10-11T12:53:52Z

Sorry, i accidently closed the PR :/

This reverts commit 6228811, reversing changes made to 9806baa.

This reverts commit 9806baa.

This reverts commit ee351bb.

Adapted from PR by Pykrom: #890

raphael0202 · 2022-10-12T10:49:36Z

There are still many conflicts (and unwanted changes introduced by the merge). I think it has to do with the fact you're working on master, and not on a feature branch.
I've created a new branch with all your modifications here: https://github.com/openfoodfacts/robotoff/tree/origin-extraction
Is it ok if I merge this one instead?

Adapted from PR by Pykrom: #890

ValentinRegnault · 2022-10-13T07:20:12Z

Yes, thank you a lot !

Added origin extraction

c65438d

ValentinRegnault requested a review from a team as a code owner September 8, 2022 09:27

github-actions bot assigned ValentinRegnault Sep 8, 2022

ValentinRegnault added 2 commits September 8, 2022 14:44

Modifications to follow Semantic PRs

e07cac9

Removed unused import, and one useless statement

bf7dc9f

teolemon changed the title ~~Added origin extraction~~ feat: Added origin extraction Sep 10, 2022

teolemon added the 📍 Origins label Sep 10, 2022

alexgarel requested changes Sep 12, 2022

View reviewed changes

ValentinRegnault and others added 7 commits September 13, 2022 08:24

Update robotoff/prediction/ocr/origin.py

b77360d

Co-authored-by: Alex Garel <[email protected]>

Update robotoff/prediction/ocr/origin.py

eeb2c64

Co-authored-by: Alex Garel <[email protected]>

Update robotoff/prediction/ocr/origin.py

bb64958

Co-authored-by: Alex Garel <[email protected]>

Origin parser put in a class, with lazy initialization + small changes

6bb5767

Origin parser put in a class, with lazy initialization + small changes

668d2fc

repair last commit :/

31f07e5

small tweaks

ccba94c

raphael0202 reviewed Sep 23, 2022

View reviewed changes

.env

# OFF_UID=1000

# OFF_GID=1000

OFF_UID=1000

OFF_GID=1000

Copy link

Collaborator

raphael0202 Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have you introduced these changes?

raphael0202 reviewed Sep 23, 2022

View reviewed changes

ValentinRegnault and others added 3 commits September 24, 2022 11:59

Delete temp.txt

c142504

Delete temp2.txt

6d0af13

Small tweaks suggested by raphael0202

cc72baf

raphael0202 reviewed Sep 26, 2022

View reviewed changes

Revert .env

335e563

stephanegigandet mentioned this pull request Sep 28, 2022

Too many incorrect predictions with Elasticsearch category predictor #918

Closed

ValentinRegnault closed this Oct 11, 2022

ValentinRegnault force-pushed the master branch from 335e563 to 5808112 Compare October 11, 2022 12:14

ValentinRegnault added 3 commits October 11, 2022 14:29

Added origin extraction

ee351bb

This commit add origin extraction in order to compute the eco-score

remove useless comment

9806baa

Merge branch 'master' of https://github.com/Pykorm/robotoff

6228811

ValentinRegnault reopened this Oct 11, 2022

ValentinRegnault added 4 commits October 11, 2022 15:33

Revert "Merge branch 'master' of https://github.com/Pykorm/robotoff"

64bf201

This reverts commit 6228811, reversing changes made to 9806baa.

Revert "remove useless comment"

69c685d

This reverts commit 9806baa.

Revert "Added origin extraction"

8c48f12

This reverts commit ee351bb.

Merge branch 'master' into master

433ae15

raphael0202 added a commit that referenced this pull request Oct 12, 2022

feat: add ingredient origin extraction

04b79b8

Adapted from PR by Pykrom: #890

raphael0202 added a commit that referenced this pull request Oct 12, 2022

feat: add ingredient origin extraction

6a946ed

Adapted from PR by Pykrom: #890

raphael0202 closed this Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Added origin extraction #890

feat: Added origin extraction #890

ValentinRegnault commented Sep 8, 2022 •

edited

Loading

alexgarel left a comment

alexgarel Sep 12, 2022

ValentinRegnault Sep 13, 2022

alexgarel Sep 12, 2022

ValentinRegnault Sep 13, 2022

alexgarel Sep 13, 2022

alexgarel Sep 12, 2022

ValentinRegnault Sep 13, 2022

alexgarel commented Sep 12, 2022

ValentinRegnault commented Sep 14, 2022 •

edited

Loading

stephanegigandet commented Sep 16, 2022

stephanegigandet commented Sep 16, 2022

ValentinRegnault commented Sep 17, 2022

raphael0202 Sep 23, 2022

ValentinRegnault Sep 24, 2022

raphael0202 Sep 23, 2022 •

edited

Loading

ValentinRegnault Sep 24, 2022

raphael0202 Sep 26, 2022

raphael0202 commented Sep 23, 2022

raphael0202 Sep 23, 2022

raphael0202 Sep 23, 2022

raphael0202 Sep 23, 2022 •

edited

Loading

ValentinRegnault Sep 24, 2022

raphael0202 Sep 23, 2022

ValentinRegnault commented Sep 24, 2022

raphael0202 Sep 26, 2022

ValentinRegnault Sep 26, 2022

raphael0202 commented Oct 7, 2022 •

edited

Loading

ValentinRegnault commented Oct 9, 2022 •

edited

Loading

raphael0202 commented Oct 10, 2022 •

edited

Loading

ValentinRegnault commented Oct 11, 2022

raphael0202 commented Oct 12, 2022 •

edited

Loading

ValentinRegnault commented Oct 13, 2022


		# English -----------------------

		INGREDIENTS_SYNONIMS_EN = [

feat: Added origin extraction #890

feat: Added origin extraction #890

Conversation

ValentinRegnault commented Sep 8, 2022 • edited Loading

What

Note :

Part of

alexgarel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexgarel commented Sep 12, 2022

ValentinRegnault commented Sep 14, 2022 • edited Loading

stephanegigandet commented Sep 16, 2022

stephanegigandet commented Sep 16, 2022

ValentinRegnault commented Sep 17, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raphael0202 Sep 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raphael0202 commented Sep 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raphael0202 Sep 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ValentinRegnault commented Sep 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raphael0202 commented Oct 7, 2022 • edited Loading

ValentinRegnault commented Oct 9, 2022 • edited Loading

raphael0202 commented Oct 10, 2022 • edited Loading

ValentinRegnault commented Oct 11, 2022

raphael0202 commented Oct 12, 2022 • edited Loading

ValentinRegnault commented Oct 13, 2022

ValentinRegnault commented Sep 8, 2022 •

edited

Loading

ValentinRegnault commented Sep 14, 2022 •

edited

Loading

raphael0202 Sep 23, 2022 •

edited

Loading

raphael0202 Sep 23, 2022 •

edited

Loading

raphael0202 commented Oct 7, 2022 •

edited

Loading

ValentinRegnault commented Oct 9, 2022 •

edited

Loading

raphael0202 commented Oct 10, 2022 •

edited

Loading

raphael0202 commented Oct 12, 2022 •

edited

Loading