Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to remove nearly empty products with quality issues #11058

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

CharlesNepote
Copy link
Member

No description provided.

# This script is run daily to remove empty products (without data or pictures)
# in particular products created by the button to add a product without a barcode

my $cursor = get_products_collection()->query({data_quality_errors_tags => { '$ne' => ''} })->fields({code => 1});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query is quite broad (any product with data quality errors), it will fetch ~ 150k products, that you will then read individually. You could change your query to put some of the filters in.
e.g. adding state_tags en:photos-to-be-uploaded would divide the number of results by 10.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried last_image_t => { '$eq' => '' } but it failed to load (time out I guess).

I haven't so much trust in state_tags => 'en:photos-to-be-uploaded' because sometimes (rarely) computation of state_tags seems to fail. I better trust last_image_t which is raw data, not computed, isn't it?

Copy link
Member

@teolemon teolemon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#and (defined $product_ref->{last_image_t})
and ($product_ref->{last_image_t} eq '')
and ($product_ref->{owner} eq '')
and ($product_ref->{creator} ne 'usda-ndb-import')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put fields like creator, completeness etc. in the retrieved fields of the mongodb query, then you can test their values before you call retrieve_product($code), which saves disk reads

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, but how can I be sure that it won't time out?

) {

print "updating product $code, $product_ref->{creator}, $product_ref->{data_quality_errors_tags}, $product_ref->{completeness}...\n";
#$product_ref->{deleted} = 'on';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's missing the actual save of the product, undeleting this line will not delete the product yet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither remove_empty_products.pl. Is this script unused? Should I open a bug? or should we delete it?

I don't know how to save a product, is the following code ok?

my $comment = "[remove_nearly_empty_products.pl] automatic removal of product with a data quality issue, few information & without images at all (excluding imports)";
store_product($product_ref, $comment );

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to delete it, you need both $product_ref->{deleted} = 'on'; + store_product()

Copy link

sonarcloud bot commented Nov 26, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants