Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add row limit to lookup table uploads #35426

Merged
merged 5 commits into from
Dec 4, 2024
Merged

Conversation

gherceg
Copy link
Contributor

@gherceg gherceg commented Nov 22, 2024

Product Description

Technical Summary

https://dimagi.atlassian.net/browse/SAAS-16223

As discussed in the ticket, there aren't any existing lookup tables with more than 500k rows, so that seems like a very safe limit to start with. The risky part of this PR is that it adds an explicit limit to the WorkbookJSONReader class, currently set to 1 million. This limit doesn't have to be set to anything necessarily, but I figured in the spirit of implementing limits, putting a limit like this in code seems ideal. On the flip side, it could be viewed as a lazy way of adding a limit to any code that uses this class under the hood (like scheduling).

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

No

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

@dimagimon dimagimon added the Risk: Medium Change affects files that have been flagged as medium risk. label Nov 22, 2024
@gherceg gherceg marked this pull request as ready for review November 22, 2024 22:48
corehq/util/workbook_json/excel.py Outdated Show resolved Hide resolved
@gherceg
Copy link
Contributor Author

gherceg commented Dec 3, 2024

I think I actually want to replace the implementation of _max_row here to use dimensions instead.

@gherceg
Copy link
Contributor Author

gherceg commented Dec 3, 2024

Just kidding. After reading some source code I realized the dimensions weren't going to work the way I had hoped, though there is some room for a deeper optimization. My understanding is that dimensions are unreliably set depending on the application that created the file in the first place. So we could rework our code to check for the total size of the workbook/worksheet when first loading it, using the dimensions if they are set, otherwise manually iterating over rows, whereas we currently just iterate over all rows to determine table sizes.

Update for clarity: For now, just using the existing max_row call available on the worksheet is sufficient.

millerdev
millerdev previously approved these changes Dec 4, 2024
@millerdev millerdev dismissed their stale review December 4, 2024 18:38

Missed Graham's most recent comment. Sounds like this will not work and is incomlete.

Copy link
Contributor

@millerdev millerdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just kidding. 😄 Apparently this is complete.

@gherceg gherceg merged commit dbe68f4 into master Dec 4, 2024
13 checks passed
@gherceg gherceg deleted the gh/lookup-tables/row-limit branch December 4, 2024 18:46
Copy link

sentry-io bot commented Dec 5, 2024

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ WorkbookTooManyRows /a/{domain}/settings/users/commcare/upload/ View Issue
  • ‼️ FixtureUploadError: ['Lookup tables can contain a maximum of 500000 rows. The uploaded file contains 564182 rows.'] corehq.apps.fixtures.tasks.fixture_upload_async View Issue

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Risk: Medium Change affects files that have been flagged as medium risk.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants