Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added origin extraction #890

Closed
wants to merge 21 commits into from

Conversation

ValentinRegnault
Copy link

@ValentinRegnault ValentinRegnault commented Sep 8, 2022

feat: Add ocr origin extraction

What

  • This pull request adds the origin extraction from ocr, for the eco-score. It also adds an api endpoint, to extract origin from any ocr url. It uses one regex by lang (currently French and partially english are support), and this regex is made to match almost any sentence that tells something about the origin of any ingredient. See Extract origins of ingredients #306.

  • Explanations of the french regex, same reasoning in english :
    In french, a sentence always starts with the subject, so we first start with "({INGREDIENTS_FR_JOINED})" that matches any ingredient.
    But we can have several ingredients, with 'and' or a coma as separator, like in a sentence like "Caramel and chocolate are made in France". That's why we add "?(?:et |, ?)?" that means "it may be a space, then a 'and' or a coma, but maybe there is nothing"
    In a sentence like "Made in France", there is no subject, so all this is optionnal. It can also be repeated like in "Caramel, Chocolate and Wheat are from France". That's why we use a "*" quantifier
    Then there is a support for sentence like "100% made in France" or "quinoa is 100% French" with the ?(?:100%)?
    Then the verb, that is also optional, like in "french quinoa". In sentence like "made with french quinoa", the verb "made" isn't captured by the regex, but "french quinoa" is.
    After that we have some connectors like "dans, depuis" (= in, from), and also their negations, "hors, en dehors" (=outside). Those are also optional, but the regex can match even if they are in the middle of the sentence.
    They are followed by other connectors like "de la|de l' |de|du" (="of" adapted for the different genders)
    Then we have the country as a name "?(?P{COUNTRIES_FR_JOINED})?", like in "made in France".
    and after we have the country as a adjective , like in "french quinoa" (in French, the adjective goes after the noun). "?(?P<country_adj>{COUNTRIES_ADJECTIVES_FR_JOINED})?"
    and after we have "?(?P<large_origin>(?:divers(?:es)?|diff[ée]rent(?:es)?|d'autres) ?(?:pays|[ée]tats?|r[ée]gions?|continents?|origines?))?" that supports sentence like "quinoa from different countries".
    Finally, you may have noticed that everything is optional. That means that this regex match empty strings. To avoid that we add "\b(?<!\w)" at the beginning and "\b(?!\w)" at the end. Idea comes from https://stackoverflow.com/a/50188154/8966453

  • Here are some sentence that matches :
    "quinoa de France"
    "quinoa produit en France"
    "quinoa cultivée en dehors de l' Union Européenne"
    "Fabriqué en France"
    "Tout les ingrédients sont fabriqués en France"
    "A partir de quinoa français"
    "quinoa 100% français"
    "Le quinoa a différentes origines"
    "Le quinoa provient de different pays"

  • NOT SUPPORTED:
    "Le quinoa de notre recette provient de different pays" ("de notre recette" is a genetive, hard to support)
    "Nos agriculteurs français produisent le blé de notre recette" (subject isn't the ingredient, hard to support)

  • I added the file data/ocr/countries.json which is a slightly modified version of the file at https://static.openfoodfacts.net/data/taxonomies/countries.full.json. I added a "nationalities" field for every countrie, which contains the name we give to the inhabitants, key is the lang and value is the nationality name. There is currently only a few countries in english and french. This is mandatory to support sentences like "french quinoa". My code also uses the categories.full.json file.

Note :

  • This is my first contribution to an open source project, I tried to do things well. Every tests passes. I started from the code in nutrient.py as suggested in Extract origins of ingredients #306. I added some constants to settings.py, an api endpoint, and a file in data.

Part of

@teolemon teolemon changed the title Added origin extraction feat: Added origin extraction Sep 10, 2022
Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pykorm first of all a big big thanks for this courageous PR that will really help improve on the eco-score front !

I added quite structural comments, because it's always easier reading code to see improvements, but I hope it won't discourage you.

I think we can concentrate on having the Predictions in this PR, and deploy it asap to get predictions.

That said I wanted to make you aware, as this is a new type of prediction that we will then have to:

  • add an importer to generate insights
  • add an annotator to apply validated insights on open food facts.

But this is better done in a separate PR.

robotoff/app/api.py Show resolved Hide resolved
robotoff/prediction/ocr/origin.py Outdated Show resolved Hide resolved
robotoff/prediction/ocr/origin.py Outdated Show resolved Hide resolved
robotoff/prediction/ocr/origin.py Outdated Show resolved Hide resolved
# (ex: "quinoa comes from outside the E.U", "quinoa has several origins")
UNKNOW_ORIGIN = "unknow origin"
ALL_INGREDIENTS = "all ingredients"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should really, enclose all this first part in a class (and find origin should be a method of this class) and eventually

Right now doing this at module level will really slow down startup, which is something we will pay a lot when running tests, etc.
We want it to be lazy.
Using a class we can use some caching mechanism to do it only when needed (or may schedule it at startup).

So it would be something like:

class OriginParser:

    def __init__(self):
         self.initialized = False
    def initialize(self):
        """load data needed to parse origins"""
        if self.initialized:
            return  # already done
        self.INGREDIENTS = json.load…
        ...
        self.initialized = True
        
        def find_origin(self, content…)
            self.initialize()
            ....
            
ocr_parser = OCRParser()

The last line creates a parser, but not initialized, so where you now call find_origin you would instead call ocr_parser.find_origin. The initialization will happen only once, on the first ocr analyze request.

Note that with this pattern there is a small risk of race condition between threads, and we may consider also handling it, but I'm not sure it's worth it as a first approximation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good idea, I'll be working on it

Comment on lines 167 to 178
origin_index = -1
for index, ing_ori in enumerate(ingredients_origins):
if ing_ori["origin"] == origin:
origin_index = index

if origin_index == -1:
ingredients_origins.append({
"origin": origin,
"same_for_all_ingredients": True, # True unless group "ingredients" matched
"concerned_ingredients": None
})
origin_index = len(ingredients_origins) - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use a dict (that at the end you will transform to a list of prediction) so that you can index by origin, instead to have to scan the list ? It would make the code more readable.

You could even use a collections.defaultdict:

ingredients_origins: Dict[str, Dict[str, Any]] = collections.defaultdict(
    lambda: {"same_for_all_ingredients": True, "concerned_ingredients": None}
)

and here:

origin_data = ingredients_origins[origin]

At the end you would use an iterator on items()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's clearly what should be done, I don't why I was scanning the list each time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always easier to see this kind of thing as a reviewer !

robotoff/prediction/ocr/origin.py Outdated Show resolved Hide resolved
robotoff/prediction/ocr/origin.py Outdated Show resolved Hide resolved
Comment on lines 255 to 260
return next(ingredient_id for ingredient_id, ingredient in INGREDIENTS.items()
if lang in ingredient["name"]
and ingredient["name"][lang] == s
)
except StopIteration:
return s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we know we will do this kind of lookup, I would build a dict at initialization time, so that we can just check the key directly, it will really improve performance.
So that I can do

return INGREDIENTS_ID[lang].get(s, s)

building the INGREDIENTS_ID is something like:

INGREDIENTS_ID = collections.defaultdict(dict)
for ingredient_id, ingredient in INGREDIENTS.items():
    for lang, name in ingredient["name"].items():
        INGREDIENTS_ID[lang][name] = ingredient_id

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll do this while making all this a class

tests/unit/prediction/ocr/test_origin.py Outdated Show resolved Hide resolved
@alexgarel
Copy link
Member

I forgot to add two things @Pykorm:

  1. you should use make lint and make checks to have your code in good shape :-)

  2. it's ok to have the json of countries in the code for now, but it would be cleaner if:

    1. we add needed information in the openfoodfacts taxonomy for countries as properties
    2. we use the taxonomy json through the Taxonomy class in robotoff

    this insures having future updates of the taxonomy

@ValentinRegnault
Copy link
Author

ValentinRegnault commented Sep 14, 2022

Just commited the modifications you suggested. Now we have a OriginParser class, with lazy initialization.

How can I use the taxonomy properly ? To create a store in taxonomy.py I need a url (and a fallback path). For example, the label taxonomy store points to http://static.openfoodfact.[domain]/labels.full.json
So if I want to implement a new store, for countries, the file must exists at some url, isn't it ?

countries.json is a modified version of the file at https://static.openfoodfacts.net/data/taxonomies/countries.full.json. Basicly it's the same file with a "nationalities" field added for every country, that may contains the nationality of the inhabitants of this countries, by lang. It would be great to add these modifications to the the file at https://static.openfoodfacts.net/data/taxonomies/countries.full.json, but as far as I understand, it's not robotoff that serves it. So how can I make use of the openfoodfacts taxonomies, and make this code cleaner ?

@stephanegigandet
Copy link

Hi @Pykorm , that's very interesting, thank you for your contribution!

In Product Opener (the OFF backend written in Perl, also with a lot of regular expressions), we also have some functions to extract things like "Origine du Cacao : Pérou", but the match is not as complete as the one you made. I'm not using the ingredients and country taxonomies, so it's harder to see some patterns with multiple ingredients and/or multiple countries.

The corresponding code is there: https://github.com/openfoodfacts/openfoodfacts-server/blob/main/lib/ProductOpener/Ingredients.pm#L1003

@stephanegigandet
Copy link

Regarding

nationalities : {
en: "Hungarian",
fr: "hongrois"
}

We could add something like that in the OFF taxonomies, as properties. We have to think a bit about it, as we may want to cover the nationalities names for different gender and number (e.g. "grec", "grecque", "grecs", "grecques") so that we match "olives grecques".

One thing I'm wondering about is performance of a regex with 10k ingredients (and synonyms) + 200 nationalities. I've never used regexps that big, but maybe that's not an issue at all.

@ValentinRegnault
Copy link
Author

Hello,
You're right, we should care about gender and plural. Maybe we could store the nationalities as regex, something like that:
{ "fr": "grec(?:que)?s?" }
but it would be hard to create or find a database with all those names.

I'm not woring about regexes performance, as there is no better solution. I don't see any way to recognize the sentences that say something about the origin that doesn't imply looping over every verbs, every country name... Those computations are mandatory. And I think that the performance of the python's re module are the best we could have.

temp.txt Outdated
@@ -0,0 +1,339 @@
Angleterre
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected that these two files temp.txt and temp2.txt are included?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no ! I just missclicked in vscode and push everything, I was thinking that I managed to clean my mistake, but apparently not :/

@@ -2,7 +2,7 @@

# nice way to have our .env in environment for use in makefile
# see https://lithic.tech/blog/2020-05/makefile-dot-env
# Note: this will mask environment variable as opposed to docker-compose priority
# Note: this will mask environment variable as opposed to docker compose priority
Copy link
Collaborator

@raphael0202 raphael0202 Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have you introduced these changes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing, I made that because on my computer for unknown reason the "docker compose" don't exists, I have to use "docker-compose". This should be removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, can you revert to the original version?

@raphael0202
Copy link
Collaborator

Can you also move countries.json to data/taxonomies? data/ocr is reserved for pattern/blacklist files used in OCR detections.

# OFF_UID=1000
# OFF_GID=1000
OFF_UID=1000
OFF_GID=1000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have you introduced these changes?

INGREDIENTS = json.load(open(settings.TAXONOMY_CATEGORY_PATH, "r"))

# French ----------------
INGREDIENTS_SYNONIMS_FR = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INGREDIENTS_SYNONIMS_FR -> INGREDIENTS_SYNONYMS_FR

ingredient
)

if len(ingredients_origins) == 0:
Copy link
Collaborator

@raphael0202 raphael0202 Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are not necessary (the return below works well when ingredients_origins is empty)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it was usefull in old commits and is now useless


# English -----------------------

INGREDIENTS_SYNONIMS_EN = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INGREDIENTS_SYNONIMS_EN -> INGREDIENTS_SYNONYMS_EN

@ValentinRegnault
Copy link
Author

I just commited the changes you suggested, thank you for your review.

.env Outdated
@@ -56,4 +56,4 @@ SENTRY_DSN=
IPC_AUTHKEY=ipc
IPC_HOST=0.0.0.0
IPC_PORT=6650
WORKER_COUNT=8
WORKER_COUNT=8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert to the original version for this file too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep !

@raphael0202
Copy link
Collaborator

raphael0202 commented Oct 7, 2022

Sorry for the (very) late reply! Can you give us an estimate of how much time does it take to process one OCR JSON?
It's important to know this before merging in production :)

edit: you can test it on real-data on this OCR dump: http://static.openfoodfacts.org/data/ocr_text.jsonl.gz
it's also useful to check that there is few false positives
We also switched from blueprint API format to OpenAPI, which is better supported. The new doc is here, do you mind updating it? Otherwise I can take care of it

@ValentinRegnault
Copy link
Author

ValentinRegnault commented Oct 9, 2022

Is it possible to update the doc in this PR ? As it was created before de file api.yml, I can't edit this file. If I create it at the same path, would it be merge correctly ?

One OCR processing takes about 0.00442 s, or 4.442ms. It was run on my laptop, with a
i7-1260P (12th gen) and 16go of RAM.

@raphael0202
Copy link
Collaborator

raphael0202 commented Oct 10, 2022

There are some conflicts with master, you need to merge openfoodfacts/robotoff master branch into your branch first.

One OCR processing takes about 0.00442 s, or 4.442ms.

Perfect :)

@ValentinRegnault
Copy link
Author

Sorry, i accidently closed the PR :/

raphael0202 added a commit that referenced this pull request Oct 12, 2022
Adapted from PR by Pykrom:
#890
@raphael0202
Copy link
Collaborator

raphael0202 commented Oct 12, 2022

There are still many conflicts (and unwanted changes introduced by the merge). I think it has to do with the fact you're working on master, and not on a feature branch.
I've created a new branch with all your modifications here: https://github.com/openfoodfacts/robotoff/tree/origin-extraction
Is it ok if I merge this one instead?

raphael0202 added a commit that referenced this pull request Oct 12, 2022
Adapted from PR by Pykrom:
#890
@ValentinRegnault
Copy link
Author

Yes, thank you a lot !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants