Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest/powerbi): use dataset workspace id as key for parent container #8994

Merged

Conversation

looppi
Copy link
Contributor

@looppi looppi commented Oct 12, 2023

In our production environment, I noticed that all the datasets were under a single workspace container, which was not the case. On closer inspection, I found out that if there are entities from multiple workspaces in a single "batch", then all the datasets would get the last workspace key reference which was processed.

Here, I created a very basic workaround for this issue, piggybacking the workspace reference from the moment, when it was added to the processed_datasets set and using that reference when making connection between workspace container and dataset.

The problem happens only when both extract_datasets_to_containers and extract_workspaces_to_containers are enabled. As if one of the features is disabled, there's no relation between the two containers. The iteration here emitted wrong workspace references to the dataset containers as it used self.workspace_key instead of a key which matched the datasets origin workspace.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Oct 12, 2023
@looppi looppi marked this pull request as ready for review October 12, 2023 10:41
@looppi
Copy link
Contributor Author

looppi commented Oct 12, 2023

I managed to reproduce the issue in the test case by allowing multiple workspaces to be scanned at test_workspace_container. The bug is that the bad container references are emitted in the end of ingestion.

@hsheth2 hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Oct 23, 2023
@looppi
Copy link
Contributor Author

looppi commented Oct 25, 2023

I just noticed that this "fix" is going to introduce another bug. I think we need to discuss further how to proceed.

Problem is that Workspace container generation doesn't use platform_instance in URN generation, but my fix does use it. This fix will result in broken container references if platform_instance is configured.

I'm going to try to figure out a way, where there would be only one place for workspace container key generation and it would also fix the issue of the workspace keys getting mixed.

In every case, changing the workspace key generation would be major as it creates new resources for every new user.

@hsheth2
Copy link
Collaborator

hsheth2 commented Oct 27, 2023

@looppi honestly having a bit of trouble following the issue explanation. It kinda sounds like we previously weren't respecting the platform_instance value set in the config when generating workspace container urns, and that sounds like a bug. I'd be ok with a breaking change to powerbi container urns, so long the change doesn't have any impact if platform_instance was not set in the config (which should be the standard behavior for our urn guid generation).

Also, it feels like the powerbi code is more complex than it needs to be, but that's a separate thing

@hsheth2 hsheth2 removed the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Oct 31, 2023
@maggiehays maggiehays added the hacktoberfest-accepted Acceptance for hacktoberfest https://hacktoberfest.com/participation/ label Oct 31, 2023
@looppi
Copy link
Contributor Author

looppi commented Nov 1, 2023

I found it quite difficult to describe the issue. You guessed correctly, the previous implementation wasn't respecting platform_instance in every use case. Previously the platform_instance was used when generating references to the container, but not when actually creating the container.

I made the change in this PR to respect the platform_instance value in every case.

I actually found a more elegant way to get rid of the "workspace reference leakage" issue, for which I originally created this PR. Just set the processed_datasets to empty set every time workspace changes. The implementation already runs the code for every workspace so this might have been just an oversight from earlier changes.

@@ -743,6 +735,7 @@ def generate_container_for_workspace(
) -> Iterable[MetadataWorkUnit]:
self.workspace_key = workspace.get_workspace_key(
platform_name=self.__config.platform_name,
platform_instance=self.__config.platform_instance,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was missing, which meant that if platform_instance was set, the actual container metadata had a different urn as the references which were supposed to point to the same container.

@@ -1238,6 +1229,8 @@ def extract_independent_datasets(
def get_workspace_workunit(
self, workspace: powerbi_data_classes.Workspace
) -> Iterable[MetadataWorkUnit]:
self.mapper.processed_datasets = set()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Work around the leakage of processed datasets over workspaces by setting the processed_datasets empty for every single workspace instance.

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code mostly looks good, and is a lot cleaner

One thing that I just noticed - we make the assumption that parent container are emitted before their children in a few places downstream (particularly when generating "browse paths v2" for the containers). In effect, that makes us generate browsePathV2 aspects incorrectly if you're using both extract_workspaces_to_containers and extract_datasets_to_containers (both the workspace and the dataset should show up in the browse path v2, but in the golden file here only the dataset shows up)

I'd still be happy to accept this PR as-is because it's still a net improvement, but I was wondering if this is something you'd like to take a stab at

@looppi
Copy link
Contributor Author

looppi commented Nov 2, 2023

I'll try to do this, I'll get back to you when I have something to show!

wu.metadata
for wu in container_work_units
if isinstance(wu.metadata, MetadataChangeProposalWrapper)
]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of ugly, but I couldn't find any other implementation for gen_containers and decided that for this implementation it's easier to unwrap the metadata from the work unit rather than create another implementation for container creation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine - we should probably make gen_containers return MCPs instead of workunits

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

wu.metadata
for wu in container_work_units
if isinstance(wu.metadata, MetadataChangeProposalWrapper)
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine - we should probably make gen_containers return MCPs instead of workunits

@hsheth2 hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Nov 9, 2023
@hsheth2 hsheth2 merged commit 1077138 into datahub-project:master Nov 10, 2023
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest-accepted Acceptance for hacktoberfest https://hacktoberfest.com/participation/ ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants