fix(ingest/powerbi): use dataset workspace id as key for parent container #8994

looppi · 2023-10-12T10:10:25Z

In our production environment, I noticed that all the datasets were under a single workspace container, which was not the case. On closer inspection, I found out that if there are entities from multiple workspaces in a single "batch", then all the datasets would get the last workspace key reference which was processed.

Here, I created a very basic workaround for this issue, piggybacking the workspace reference from the moment, when it was added to the processed_datasets set and using that reference when making connection between workspace container and dataset.

The problem happens only when both extract_datasets_to_containers and extract_workspaces_to_containers are enabled. As if one of the features is disabled, there's no relation between the two containers. The iteration here emitted wrong workspace references to the dataset containers as it used self.workspace_key instead of a key which matched the datasets origin workspace.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

…iner

looppi · 2023-10-12T12:42:55Z

I managed to reproduce the issue in the test case by allowing multiple workspaces to be scanned at test_workspace_container. The bug is that the bad container references are emitted in the end of ingestion.

looppi · 2023-10-25T08:05:13Z

I just noticed that this "fix" is going to introduce another bug. I think we need to discuss further how to proceed.

Problem is that Workspace container generation doesn't use platform_instance in URN generation, but my fix does use it. This fix will result in broken container references if platform_instance is configured.

I'm going to try to figure out a way, where there would be only one place for workspace container key generation and it would also fix the issue of the workspace keys getting mixed.

In every case, changing the workspace key generation would be major as it creates new resources for every new user.

hsheth2 · 2023-10-27T23:35:44Z

@looppi honestly having a bit of trouble following the issue explanation. It kinda sounds like we previously weren't respecting the platform_instance value set in the config when generating workspace container urns, and that sounds like a bug. I'd be ok with a breaking change to powerbi container urns, so long the change doesn't have any impact if platform_instance was not set in the config (which should be the standard behavior for our urn guid generation).

Also, it feels like the powerbi code is more complex than it needs to be, but that's a separate thing

looppi · 2023-11-01T09:13:11Z

I found it quite difficult to describe the issue. You guessed correctly, the previous implementation wasn't respecting platform_instance in every use case. Previously the platform_instance was used when generating references to the container, but not when actually creating the container.

I made the change in this PR to respect the platform_instance value in every case.

I actually found a more elegant way to get rid of the "workspace reference leakage" issue, for which I originally created this PR. Just set the processed_datasets to empty set every time workspace changes. The implementation already runs the code for every workspace so this might have been just an oversight from earlier changes.

looppi · 2023-11-01T09:14:41Z

metadata-ingestion/src/datahub/ingestion/source/powerbi/powerbi.py

@@ -743,6 +735,7 @@ def generate_container_for_workspace(
    ) -> Iterable[MetadataWorkUnit]:
        self.workspace_key = workspace.get_workspace_key(
            platform_name=self.__config.platform_name,
+            platform_instance=self.__config.platform_instance,


This was missing, which meant that if platform_instance was set, the actual container metadata had a different urn as the references which were supposed to point to the same container.

looppi · 2023-11-01T09:15:53Z

metadata-ingestion/src/datahub/ingestion/source/powerbi/powerbi.py

@@ -1238,6 +1229,8 @@ def extract_independent_datasets(
    def get_workspace_workunit(
        self, workspace: powerbi_data_classes.Workspace
    ) -> Iterable[MetadataWorkUnit]:
+        self.mapper.processed_datasets = set()


Work around the leakage of processed datasets over workspaces by setting the processed_datasets empty for every single workspace instance.

hsheth2

This code mostly looks good, and is a lot cleaner

One thing that I just noticed - we make the assumption that parent container are emitted before their children in a few places downstream (particularly when generating "browse paths v2" for the containers). In effect, that makes us generate browsePathV2 aspects incorrectly if you're using both extract_workspaces_to_containers and extract_datasets_to_containers (both the workspace and the dataset should show up in the browse path v2, but in the golden file here only the dataset shows up)

I'd still be happy to accept this PR as-is because it's still a net improvement, but I was wondering if this is something you'd like to take a stab at

looppi · 2023-11-02T07:12:32Z

I'll try to do this, I'll get back to you when I have something to show!

looppi · 2023-11-07T09:27:41Z

metadata-ingestion/src/datahub/ingestion/source/powerbi/powerbi.py

+            wu.metadata
+            for wu in container_work_units
+            if isinstance(wu.metadata, MetadataChangeProposalWrapper)
+        ]


This is kind of ugly, but I couldn't find any other implementation for gen_containers and decided that for this implementation it's easier to unwrap the metadata from the work unit rather than create another implementation for container creation.

this is fine - we should probably make gen_containers return MCPs instead of workunits

hsheth2

LGTM!

hsheth2 · 2023-11-09T20:55:37Z

metadata-ingestion/src/datahub/ingestion/source/powerbi/powerbi.py

+            wu.metadata
+            for wu in container_work_units
+            if isinstance(wu.metadata, MetadataChangeProposalWrapper)
+        ]


this is fine - we should probably make gen_containers return MCPs instead of workunits

fix(ingest/powerbi): use dataset workspace id as key for parent conta…

ea488fa

…iner

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Oct 12, 2023

vercel bot deployed to Preview October 12, 2023 10:33 View deployment

fix(ingest/powerbi): add tests to check container references

1c79098

looppi marked this pull request as ready for review October 12, 2023 10:41

vercel bot deployed to Preview October 12, 2023 11:07 View deployment

fix(powerbi/ingest): force multiple workspaces for test

6869a13

vercel bot deployed to Preview October 12, 2023 12:32 View deployment

hsheth2 approved these changes Oct 23, 2023

View reviewed changes

hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Oct 23, 2023

looppi added 2 commits October 26, 2023 12:19

fix(ingest/powerbi): use only single reference to workspace urn

a4f7354

Merge branch 'master' into fix-powerbi-dataset-container-handling

660a409

vercel bot deployed to Preview October 26, 2023 09:59 View deployment

fix(ingest/powerbi): fix mypy errors

0b6adf2

vercel bot deployed to Preview October 26, 2023 10:25 View deployment

hsheth2 removed the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Oct 31, 2023

maggiehays added the hacktoberfest-accepted Acceptance for hacktoberfest https://hacktoberfest.com/participation/ label Oct 31, 2023

fix(ingest/powerbi): refactor for simpler approach

9b884a1

looppi commented Nov 1, 2023

View reviewed changes

vercel bot deployed to Preview November 1, 2023 09:22 View deployment

hsheth2 reviewed Nov 2, 2023

View reviewed changes

fix(ingest/powerbi): reorganize dataset container emit order

8f9b395

looppi commented Nov 7, 2023

View reviewed changes

vercel bot deployed to Preview November 7, 2023 09:32 View deployment

fix(ingest/powerbi): cleanup unused code

a0a9e7b

vercel bot deployed to Preview November 7, 2023 10:29 View deployment

fix(ingest/powerbi): flake8 fix

98c4404

vercel bot deployed to Preview November 7, 2023 12:49 View deployment

Merge branch 'master' into fix-powerbi-dataset-container-handling

94e48b7

vercel bot deployed to Preview November 8, 2023 18:43 View deployment

hsheth2 approved these changes Nov 9, 2023

View reviewed changes

hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Nov 9, 2023

hsheth2 merged commit 1077138 into datahub-project:master Nov 10, 2023
50 checks passed

looppi mentioned this pull request Dec 8, 2023

fix(ingest/powerbi): fix powerbi chart input handling #9415

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingest/powerbi): use dataset workspace id as key for parent container #8994

fix(ingest/powerbi): use dataset workspace id as key for parent container #8994

looppi commented Oct 12, 2023 •

edited

Loading

looppi commented Oct 12, 2023

looppi commented Oct 25, 2023 •

edited

Loading

hsheth2 commented Oct 27, 2023

looppi commented Nov 1, 2023

looppi Nov 1, 2023

looppi Nov 1, 2023

hsheth2 left a comment

looppi commented Nov 2, 2023

looppi Nov 7, 2023

hsheth2 Nov 9, 2023

hsheth2 left a comment

hsheth2 Nov 9, 2023

fix(ingest/powerbi): use dataset workspace id as key for parent container #8994

fix(ingest/powerbi): use dataset workspace id as key for parent container #8994

Conversation

looppi commented Oct 12, 2023 • edited Loading

Checklist

looppi commented Oct 12, 2023

looppi commented Oct 25, 2023 • edited Loading

hsheth2 commented Oct 27, 2023

looppi commented Nov 1, 2023

looppi Nov 1, 2023

Choose a reason for hiding this comment

looppi Nov 1, 2023

Choose a reason for hiding this comment

hsheth2 left a comment

Choose a reason for hiding this comment

looppi commented Nov 2, 2023

looppi Nov 7, 2023

Choose a reason for hiding this comment

hsheth2 Nov 9, 2023

Choose a reason for hiding this comment

hsheth2 left a comment

Choose a reason for hiding this comment

hsheth2 Nov 9, 2023

Choose a reason for hiding this comment

looppi commented Oct 12, 2023 •

edited

Loading

looppi commented Oct 25, 2023 •

edited

Loading