-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ingest/powerbi): use dataset workspace id as key for parent container #8994
fix(ingest/powerbi): use dataset workspace id as key for parent container #8994
Conversation
I managed to reproduce the issue in the test case by allowing multiple workspaces to be scanned at |
I just noticed that this "fix" is going to introduce another bug. I think we need to discuss further how to proceed. Problem is that Workspace container generation doesn't use I'm going to try to figure out a way, where there would be only one place for workspace container key generation and it would also fix the issue of the workspace keys getting mixed. In every case, changing the workspace key generation would be major as it creates new resources for every new user. |
@looppi honestly having a bit of trouble following the issue explanation. It kinda sounds like we previously weren't respecting the Also, it feels like the powerbi code is more complex than it needs to be, but that's a separate thing |
I found it quite difficult to describe the issue. You guessed correctly, the previous implementation wasn't respecting I made the change in this PR to respect the I actually found a more elegant way to get rid of the "workspace reference leakage" issue, for which I originally created this PR. Just set the |
@@ -743,6 +735,7 @@ def generate_container_for_workspace( | |||
) -> Iterable[MetadataWorkUnit]: | |||
self.workspace_key = workspace.get_workspace_key( | |||
platform_name=self.__config.platform_name, | |||
platform_instance=self.__config.platform_instance, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was missing, which meant that if platform_instance
was set, the actual container metadata had a different urn as the references which were supposed to point to the same container.
@@ -1238,6 +1229,8 @@ def extract_independent_datasets( | |||
def get_workspace_workunit( | |||
self, workspace: powerbi_data_classes.Workspace | |||
) -> Iterable[MetadataWorkUnit]: | |||
self.mapper.processed_datasets = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Work around the leakage of processed datasets over workspaces by setting the processed_datasets
empty for every single workspace instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code mostly looks good, and is a lot cleaner
One thing that I just noticed - we make the assumption that parent container are emitted before their children in a few places downstream (particularly when generating "browse paths v2" for the containers). In effect, that makes us generate browsePathV2 aspects incorrectly if you're using both extract_workspaces_to_containers and extract_datasets_to_containers (both the workspace and the dataset should show up in the browse path v2, but in the golden file here only the dataset shows up)
I'd still be happy to accept this PR as-is because it's still a net improvement, but I was wondering if this is something you'd like to take a stab at
I'll try to do this, I'll get back to you when I have something to show! |
wu.metadata | ||
for wu in container_work_units | ||
if isinstance(wu.metadata, MetadataChangeProposalWrapper) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of ugly, but I couldn't find any other implementation for gen_containers
and decided that for this implementation it's easier to unwrap the metadata from the work unit rather than create another implementation for container creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fine - we should probably make gen_containers return MCPs instead of workunits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
wu.metadata | ||
for wu in container_work_units | ||
if isinstance(wu.metadata, MetadataChangeProposalWrapper) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fine - we should probably make gen_containers return MCPs instead of workunits
In our production environment, I noticed that all the datasets were under a single workspace container, which was not the case. On closer inspection, I found out that if there are entities from multiple workspaces in a single "batch", then all the datasets would get the last workspace key reference which was processed.
Here, I created a very basic workaround for this issue, piggybacking the workspace reference from the moment, when it was added to the
processed_datasets
set and using that reference when making connection between workspace container and dataset.The problem happens only when both
extract_datasets_to_containers
andextract_workspaces_to_containers
are enabled. As if one of the features is disabled, there's no relation between the two containers. The iteration here emitted wrong workspace references to the dataset containers as it usedself.workspace_key
instead of a key which matched the datasets origin workspace.Checklist