Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving datamap performance #5601

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

galvana
Copy link
Contributor

@galvana galvana commented Dec 12, 2024

Closes LA-207

Description Of Changes

Removes the dataset and undeclared_data_categories properties from the System and PrivacyDeclaration models. The new approach loads the data categories up-front using Postgres JSON operators. This approach is considerably faster since it avoids loading large datasets. In addition to improving the performance of the data map report, any API calls that return Systems or PrivacyDeclarations will also benefit from the removal of the dataset property.

Code Changes

  • Removed the expensive dataset property from System and PrivacyDeclaration
  • Updated the logic for undeclared_data_categories to use the pre-loaded data categories map

Steps to Confirm

  1. The tests should just pass for this PR, steps to confirm can be found in the corresponding Fidesplus PR https://github.com/ethyca/fidesplus/pull/1765

Pre-Merge Checklist

  • Issue requirements met
  • All CI pipelines succeeded
  • CHANGELOG.md updated
  • Followup issues:
    • Followup issues created (include link)
    • No followup issues
  • Database migrations:
    • Ensure that your downrev is up to date with the latest revision on main
    • Ensure that your downgrade() migration is correct and works
      • If a downgrade migration is not possible for this change, please call this out in the PR description!
    • No migrations
  • Documentation:
    • Documentation complete, PR opened in fidesdocs
    • Documentation issue created in fidesdocs
    • If there are any new client scopes created as part of the pull request, remember to update public-facing documentation that references our scope registry
    • No documentation updates required

Copy link

vercel bot commented Dec 12, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
fides-plus-nightly ⬜️ Ignored (Inspect) Visit Preview Dec 14, 2024 2:08am

[
func.jsonb_array_elements_text(
text(
"jsonb_path_query(collections::jsonb, '$.** ? (@.data_categories != null).data_categories')"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key change, the JSONB path returns all the data_categories at all levels of nesting. The create_data_categories_property function is used to create the dataset_data_categories column property for System and PrivacyDeclaration

Comment on lines -408 to -413
datasets = relationship(
"Dataset",
primaryjoin="foreign(Dataset.fides_key)==any_(System.dataset_references)",
lazy="selectin",
uselist=True,
viewonly=True,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more datasets! This was only being used for the undeclared_data_categories so we can get rid of it

system_dataset_data_categories = set()
for dataset in self.datasets:
system_dataset_data_categories.update(dataset.field_data_categories)
system_dataset_data_categories = set(self.dataset_data_categories)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the optimization in use, no more iterating over datasets

Comment on lines 9 to 15
assert set(
privacy_declaration_with_multiple_dataset_references.dataset_data_categories
) == {
"user.behavior",
"user.unique_id",
"user.contact.address.street",
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out the privacy_declaration_with_multiple_dataset_references fixture to see the different levels of nesting for the data categories

):
assert (
privacy_declaration_with_dataset_references.undeclared_data_categories
privacy_declaration_with_single_dataset_reference.undeclared_data_categories
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the tests for undeclared_data_categories were already here

@galvana galvana marked this pull request as ready for review December 12, 2024 20:26
Copy link

cypress bot commented Dec 12, 2024

fides    Run #11487

Run Properties:  status check passed Passed #11487  •  git commit 50ccad590a ℹ️: Merge 5d6126f26f6d1403078b5429c6b82638884adec6 into 0030db7816dfbdbadf22843f6cf8...
Project fides
Branch Review refs/pull/5601/merge
Run status status check passed Passed #11487
Run duration 00m 47s
Commit git commit 50ccad590a ℹ️: Merge 5d6126f26f6d1403078b5429c6b82638884adec6 into 0030db7816dfbdbadf22843f6cf8...
Committer Adrian Galvan
View all properties for this run ↗︎

Test results
Tests that failed  Failures 0
Tests that were flaky  Flaky 0
Tests that did not run due to a developer annotating a test with .skip  Pending 0
Tests that did not run due to a failure in a mocha hook  Skipped 0
Tests that passed  Passing 4
⚠️ You've recorded test results over your free plan limit.
Upgrade your plan to view test results.
View all changes introduced in this branch ↗︎

Copy link
Contributor

@andres-torres-marroquin andres-torres-marroquin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks clean, I'll leave my approve, I'm waiting for the fidesplus PR to come and test it by myself.

@andres-torres-marroquin
Copy link
Contributor

Some tests are still failing: https://github.com/ethyca/fides/actions/runs/12304095531/job/34341403381?pr=5601#step:8:926

Copy link

codecov bot commented Dec 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.11%. Comparing base (7e1832c) to head (667432d).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5601   +/-   ##
=======================================
  Coverage   87.11%   87.11%           
=======================================
  Files         388      388           
  Lines       23906    23902    -4     
  Branches     2585     2583    -2     
=======================================
- Hits        20826    20823    -3     
  Misses       2522     2522           
+ Partials      558      557    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants