-
-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added origin extraction #890
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Pykorm first of all a big big thanks for this courageous PR that will really help improve on the eco-score front !
I added quite structural comments, because it's always easier reading code to see improvements, but I hope it won't discourage you.
I think we can concentrate on having the Predictions in this PR, and deploy it asap to get predictions.
That said I wanted to make you aware, as this is a new type of prediction that we will then have to:
- add an importer to generate insights
- add an annotator to apply validated insights on open food facts.
But this is better done in a separate PR.
robotoff/prediction/ocr/origin.py
Outdated
# (ex: "quinoa comes from outside the E.U", "quinoa has several origins") | ||
UNKNOW_ORIGIN = "unknow origin" | ||
ALL_INGREDIENTS = "all ingredients" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should really, enclose all this first part in a class (and find origin should be a method of this class) and eventually
Right now doing this at module level will really slow down startup, which is something we will pay a lot when running tests, etc.
We want it to be lazy.
Using a class we can use some caching mechanism to do it only when needed (or may schedule it at startup).
So it would be something like:
class OriginParser:
def __init__(self):
self.initialized = False
def initialize(self):
"""load data needed to parse origins"""
if self.initialized:
return # already done
self.INGREDIENTS = json.load…
...
self.initialized = True
def find_origin(self, content…)
self.initialize()
....
ocr_parser = OCRParser()
The last line creates a parser, but not initialized, so where you now call find_origin
you would instead call ocr_parser.find_origin
. The initialization will happen only once, on the first ocr analyze request.
Note that with this pattern there is a small risk of race condition between threads, and we may consider also handling it, but I'm not sure it's worth it as a first approximation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, good idea, I'll be working on it
robotoff/prediction/ocr/origin.py
Outdated
origin_index = -1 | ||
for index, ing_ori in enumerate(ingredients_origins): | ||
if ing_ori["origin"] == origin: | ||
origin_index = index | ||
|
||
if origin_index == -1: | ||
ingredients_origins.append({ | ||
"origin": origin, | ||
"same_for_all_ingredients": True, # True unless group "ingredients" matched | ||
"concerned_ingredients": None | ||
}) | ||
origin_index = len(ingredients_origins) - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use a dict (that at the end you will transform to a list of prediction) so that you can index by origin, instead to have to scan the list ? It would make the code more readable.
You could even use a collections.defaultdict:
ingredients_origins: Dict[str, Dict[str, Any]] = collections.defaultdict(
lambda: {"same_for_all_ingredients": True, "concerned_ingredients": None}
)
and here:
origin_data = ingredients_origins[origin]
At the end you would use an iterator on items()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's clearly what should be done, I don't why I was scanning the list each time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's always easier to see this kind of thing as a reviewer !
robotoff/prediction/ocr/origin.py
Outdated
return next(ingredient_id for ingredient_id, ingredient in INGREDIENTS.items() | ||
if lang in ingredient["name"] | ||
and ingredient["name"][lang] == s | ||
) | ||
except StopIteration: | ||
return s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we know we will do this kind of lookup, I would build a dict at initialization time, so that we can just check the key directly, it will really improve performance.
So that I can do
return INGREDIENTS_ID[lang].get(s, s)
building the INGREDIENTS_ID is something like:
INGREDIENTS_ID = collections.defaultdict(dict)
for ingredient_id, ingredient in INGREDIENTS.items():
for lang, name in ingredient["name"].items():
INGREDIENTS_ID[lang][name] = ingredient_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'll do this while making all this a class
I forgot to add two things @Pykorm:
|
Co-authored-by: Alex Garel <[email protected]>
Co-authored-by: Alex Garel <[email protected]>
Co-authored-by: Alex Garel <[email protected]>
Just commited the modifications you suggested. Now we have a OriginParser class, with lazy initialization. How can I use the taxonomy properly ? To create a store in taxonomy.py I need a url (and a fallback path). For example, the label taxonomy store points to http://static.openfoodfact.[domain]/labels.full.json countries.json is a modified version of the file at https://static.openfoodfacts.net/data/taxonomies/countries.full.json. Basicly it's the same file with a "nationalities" field added for every country, that may contains the nationality of the inhabitants of this countries, by lang. It would be great to add these modifications to the the file at https://static.openfoodfacts.net/data/taxonomies/countries.full.json, but as far as I understand, it's not robotoff that serves it. So how can I make use of the openfoodfacts taxonomies, and make this code cleaner ? |
Hi @Pykorm , that's very interesting, thank you for your contribution! In Product Opener (the OFF backend written in Perl, also with a lot of regular expressions), we also have some functions to extract things like "Origine du Cacao : Pérou", but the match is not as complete as the one you made. I'm not using the ingredients and country taxonomies, so it's harder to see some patterns with multiple ingredients and/or multiple countries. The corresponding code is there: https://github.com/openfoodfacts/openfoodfacts-server/blob/main/lib/ProductOpener/Ingredients.pm#L1003 |
Regarding
We could add something like that in the OFF taxonomies, as properties. We have to think a bit about it, as we may want to cover the nationalities names for different gender and number (e.g. "grec", "grecque", "grecs", "grecques") so that we match "olives grecques". One thing I'm wondering about is performance of a regex with 10k ingredients (and synonyms) + 200 nationalities. I've never used regexps that big, but maybe that's not an issue at all. |
Hello, I'm not woring about regexes performance, as there is no better solution. I don't see any way to recognize the sentences that say something about the origin that doesn't imply looping over every verbs, every country name... Those computations are mandatory. And I think that the performance of the python's re module are the best we could have. |
temp.txt
Outdated
@@ -0,0 +1,339 @@ | |||
Angleterre |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it expected that these two files temp.txt and temp2.txt are included?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no ! I just missclicked in vscode and push everything, I was thinking that I managed to clean my mistake, but apparently not :/
@@ -2,7 +2,7 @@ | |||
|
|||
# nice way to have our .env in environment for use in makefile | |||
# see https://lithic.tech/blog/2020-05/makefile-dot-env | |||
# Note: this will mask environment variable as opposed to docker-compose priority | |||
# Note: this will mask environment variable as opposed to docker compose priority |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why have you introduced these changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing, I made that because on my computer for unknown reason the "docker compose" don't exists, I have to use "docker-compose". This should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, can you revert to the original version?
Can you also move |
# OFF_UID=1000 | ||
# OFF_GID=1000 | ||
OFF_UID=1000 | ||
OFF_GID=1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why have you introduced these changes?
robotoff/prediction/ocr/origin.py
Outdated
INGREDIENTS = json.load(open(settings.TAXONOMY_CATEGORY_PATH, "r")) | ||
|
||
# French ---------------- | ||
INGREDIENTS_SYNONIMS_FR = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INGREDIENTS_SYNONIMS_FR -> INGREDIENTS_SYNONYMS_FR
robotoff/prediction/ocr/origin.py
Outdated
ingredient | ||
) | ||
|
||
if len(ingredients_origins) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines are not necessary (the return below works well when ingredients_origins is empty)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it was usefull in old commits and is now useless
robotoff/prediction/ocr/origin.py
Outdated
|
||
# English ----------------------- | ||
|
||
INGREDIENTS_SYNONIMS_EN = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INGREDIENTS_SYNONIMS_EN -> INGREDIENTS_SYNONYMS_EN
I just commited the changes you suggested, thank you for your review. |
.env
Outdated
@@ -56,4 +56,4 @@ SENTRY_DSN= | |||
IPC_AUTHKEY=ipc | |||
IPC_HOST=0.0.0.0 | |||
IPC_PORT=6650 | |||
WORKER_COUNT=8 | |||
WORKER_COUNT=8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you revert to the original version for this file too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep !
Sorry for the (very) late reply! Can you give us an estimate of how much time does it take to process one OCR JSON? edit: you can test it on real-data on this OCR dump: http://static.openfoodfacts.org/data/ocr_text.jsonl.gz |
Is it possible to update the doc in this PR ? As it was created before de file api.yml, I can't edit this file. If I create it at the same path, would it be merge correctly ? One OCR processing takes about 0.00442 s, or 4.442ms. It was run on my laptop, with a |
There are some conflicts with master, you need to merge openfoodfacts/robotoff master branch into your branch first.
Perfect :) |
335e563
to
5808112
Compare
This commit add origin extraction in order to compute the eco-score
Sorry, i accidently closed the PR :/ |
Adapted from PR by Pykrom: #890
There are still many conflicts (and unwanted changes introduced by the merge). I think it has to do with the fact you're working on master, and not on a feature branch. |
Adapted from PR by Pykrom: #890
Yes, thank you a lot ! |
feat: Add ocr origin extraction
What
This pull request adds the origin extraction from ocr, for the eco-score. It also adds an api endpoint, to extract origin from any ocr url. It uses one regex by lang (currently French and partially english are support), and this regex is made to match almost any sentence that tells something about the origin of any ingredient. See Extract origins of ingredients #306.
Explanations of the french regex, same reasoning in english :
In french, a sentence always starts with the subject, so we first start with "({INGREDIENTS_FR_JOINED})" that matches any ingredient.
But we can have several ingredients, with 'and' or a coma as separator, like in a sentence like "Caramel and chocolate are made in France". That's why we add "?(?:et |, ?)?" that means "it may be a space, then a 'and' or a coma, but maybe there is nothing"
In a sentence like "Made in France", there is no subject, so all this is optionnal. It can also be repeated like in "Caramel, Chocolate and Wheat are from France". That's why we use a "*" quantifier
Then there is a support for sentence like "100% made in France" or "quinoa is 100% French" with the ?(?:100%)?
Then the verb, that is also optional, like in "french quinoa". In sentence like "made with french quinoa", the verb "made" isn't captured by the regex, but "french quinoa" is.
After that we have some connectors like "dans, depuis" (= in, from), and also their negations, "hors, en dehors" (=outside). Those are also optional, but the regex can match even if they are in the middle of the sentence.
They are followed by other connectors like "de la|de l' |de|du" (="of" adapted for the different genders)
Then we have the country as a name "?(?P{COUNTRIES_FR_JOINED})?", like in "made in France".
and after we have the country as a adjective , like in "french quinoa" (in French, the adjective goes after the noun). "?(?P<country_adj>{COUNTRIES_ADJECTIVES_FR_JOINED})?"
and after we have "?(?P<large_origin>(?:divers(?:es)?|diff[ée]rent(?:es)?|d'autres) ?(?:pays|[ée]tats?|r[ée]gions?|continents?|origines?))?" that supports sentence like "quinoa from different countries".
Finally, you may have noticed that everything is optional. That means that this regex match empty strings. To avoid that we add "\b(?<!\w)" at the beginning and "\b(?!\w)" at the end. Idea comes from https://stackoverflow.com/a/50188154/8966453
Here are some sentence that matches :
"quinoa de France"
"quinoa produit en France"
"quinoa cultivée en dehors de l' Union Européenne"
"Fabriqué en France"
"Tout les ingrédients sont fabriqués en France"
"A partir de quinoa français"
"quinoa 100% français"
"Le quinoa a différentes origines"
"Le quinoa provient de different pays"
NOT SUPPORTED:
"Le quinoa de notre recette provient de different pays" ("de notre recette" is a genetive, hard to support)
"Nos agriculteurs français produisent le blé de notre recette" (subject isn't the ingredient, hard to support)
I added the file data/ocr/countries.json which is a slightly modified version of the file at https://static.openfoodfacts.net/data/taxonomies/countries.full.json. I added a "nationalities" field for every countrie, which contains the name we give to the inhabitants, key is the lang and value is the nationality name. There is currently only a few countries in english and french. This is mandatory to support sentences like "french quinoa". My code also uses the categories.full.json file.
Note :
Part of