-
-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add script to remove nearly empty products with quality issues #11058
base: main
Are you sure you want to change the base?
Conversation
# This script is run daily to remove empty products (without data or pictures) | ||
# in particular products created by the button to add a product without a barcode | ||
|
||
my $cursor = get_products_collection()->query({data_quality_errors_tags => { '$ne' => ''} })->fields({code => 1}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This query is quite broad (any product with data quality errors), it will fetch ~ 150k products, that you will then read individually. You could change your query to put some of the filters in.
e.g. adding state_tags en:photos-to-be-uploaded would divide the number of results by 10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried last_image_t => { '$eq' => '' }
but it failed to load (time out I guess).
I haven't so much trust in state_tags => 'en:photos-to-be-uploaded'
because sometimes (rarely) computation of state_tags seems to fail. I better trust last_image_t
which is raw data, not computed, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference with : https://github.com/openfoodfacts/openfoodfacts-server/blob/main/scripts/remove_empty_products.pl ?
#and (defined $product_ref->{last_image_t}) | ||
and ($product_ref->{last_image_t} eq '') | ||
and ($product_ref->{owner} eq '') | ||
and ($product_ref->{creator} ne 'usda-ndb-import') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you put fields like creator, completeness etc. in the retrieved fields of the mongodb query, then you can test their values before you call retrieve_product($code), which saves disk reads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, but how can I be sure that it won't time out?
) { | ||
|
||
print "updating product $code, $product_ref->{creator}, $product_ref->{data_quality_errors_tags}, $product_ref->{completeness}...\n"; | ||
#$product_ref->{deleted} = 'on'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's missing the actual save of the product, undeleting this line will not delete the product yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither remove_empty_products.pl. Is this script unused? Should I open a bug? or should we delete it?
I don't know how to save a product, is the following code ok?
my $comment = "[remove_nearly_empty_products.pl] automatic removal of product with a data quality issue, few information & without images at all (excluding imports)";
store_product($product_ref, $comment );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to delete it, you need both $product_ref->{deleted} = 'on'; + store_product()
Quality Gate passedIssues Measures |
No description provided.