Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): support view lineage for all sqlalchemy sources #9039

Conversation

mayurinehate
Copy link
Collaborator

@mayurinehate mayurinehate commented Oct 18, 2023

Additional Changes:

  1. Support incremental lineage for all sqlalchemy sources
  2. Keep column level lineage enabled and incremental lineage disabled by default
  3. Monkey-patch hive dialect to extract hive view definitions to extract lineage
  4. Fix incremental_lineage_helper for empty upstreams
  5. Support postgres-like partial view definitions

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Additional Changes:
1. Support incremental lineage for all sqlalchemy sources
2. Keep column level lineage enabled and incremental lineage disabled by default
3. Monkey-patch hive dialect to extract hive view definitions to extract lineage
4. Fix incremental_lineage_helper for empty upstreams

Pending Followup Changes:
1. Support postgres-like partial view definitions
@mayurinehate mayurinehate force-pushed the master+ing-200-view-lineage-for-sql-sources branch from dd85c59 to 75f63a2 Compare October 19, 2023 13:47
@mayurinehate mayurinehate marked this pull request as ready for review October 19, 2023 13:47
@@ -283,7 +283,7 @@ class VersionedConfig(ConfigModel):

class LineageConfig(ConfigModel):
incremental_lineage: bool = Field(
default=True,
default=False,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incremental lineage requires presence of DataHubGraph, which is available by default only when using DataHub rest sink. We plan to keep this default enabled in managed ingestion.

graph, urn, lineage_aspect, wu.metadata.systemMetadata
)
elif lineage_aspect.upstreams:
yield _convert_upstream_lineage_to_patch(
Copy link
Collaborator Author

@mayurinehate mayurinehate Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a table level upstream only aspect with empty upstreams, we ignore it, as part of incremental lineage.

return self._create_upstream_lineage_workunit(
dataset_identifier, upstreams, fine_upstreams
)
if upstreams:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not emit upstream lineage if no upstreams are found.

@@ -90,19 +96,39 @@ def dbapi_get_columns_patched(self, connection, table_name, schema=None, **kw):
logger.warning(f"Failed to patch method due to {e}")


try:
from pyhive.sqlalchemy_hive import HiveDialect
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we can also move this code to acryl pyhive fork - https://github.com/acryldata/PyHive

This seemed simpler and easier to test this end to end. Open to suggestions here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine for now, and we can fix up when we refactor sql common next week

self._view_definition_cache = FileBackedDict[str]()
else:
self._view_definition_cache = {}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved use_file_backed_cache related logic from teraform source here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we fixed the file backed dict to support windows, imo we can probably drop this config flag

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to the PR that fixes this ?

- primarily to reduce adverse effect on other sources, such as dbt
which have their own flavour of incremental lineage implementation
HiveDialect.get_view_definition = get_view_definition_patched
except ModuleNotFoundError:
pass
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failure to patch should cause the source to fail to load right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Let me remove this exception handling.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

self._view_definition_cache = FileBackedDict[str]()
else:
self._view_definition_cache = {}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we fixed the file backed dict to support windows, imo we can probably drop this config flag

@@ -71,6 +71,10 @@ def __init__(self, config, ctx, platform):
super().__init__(config, ctx, platform)
self.config: TwoTierSQLAlchemyConfig = config

def get_db_schema(self, dataset_identifier: str) -> Tuple[Optional[str], str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we do the refactoring, let's put identifiers in a real data class instead of joining into a string in one place and then splitting in other places

@hsheth2 hsheth2 added merge-pending-ci A PR that has passed review and should be merged once CI is green. and removed merge-pending-ci A PR that has passed review and should be merged once CI is green. labels Oct 23, 2023
@hsheth2
Copy link
Collaborator

hsheth2 commented Oct 23, 2023

There's one remaining thing around handling partial view definitions

@hsheth2 hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Oct 25, 2023
@hsheth2
Copy link
Collaborator

hsheth2 commented Oct 25, 2023

@mayurinehate Seeing a real issue in the smoke test - looks like sql_common needs a dependency on sqlparse

tests/test_stateful_ingestion.py:5: in <module>
    from datahub.ingestion.source.sql.mysql import MySQLConfig, MySQLSource
../metadata-ingestion/src/datahub/ingestion/source/sql/mysql.py:18: in <module>
    from datahub.ingestion.source.sql.sql_common import (
../metadata-ingestion/src/datahub/ingestion/source/sql/sql_common.py:34: in <module>
    from datahub.emitter.sql_parsing_builder import SqlParsingBuilder
../metadata-ingestion/src/datahub/emitter/sql_parsing_builder.py:11: in <module>
    from datahub.ingestion.source.usage.usage_common import BaseUsageConfig, UsageAggregator
../metadata-ingestion/src/datahub/ingestion/source/usage/usage_common.py:41: in <module>
    from datahub.utilities.sql_formatter import format_sql_query, trim_query
../metadata-ingestion/src/datahub/utilities/sql_formatter.py:4: in <module>
    import sqlparse
E   ModuleNotFoundError: No module named 'sqlparse'

@maggiehays maggiehays added the hacktoberfest-accepted Acceptance for hacktoberfest https://hacktoberfest.com/participation/ label Oct 26, 2023
fallback to native postgres view lineage extraction for failed views
@mayurinehate
Copy link
Collaborator Author

There's one remaining thing around handling partial view definitions

This is now added in this PR.

@mayurinehate Seeing a real issue in the smoke test - looks like sql_common needs a dependency on sqlparse

Fixed it.

@mayurinehate mayurinehate requested a review from hsheth2 October 26, 2023 04:18
@hsheth2
Copy link
Collaborator

hsheth2 commented Oct 26, 2023

The cypress test failure appears unrelated (glossary/glossary_navigation.js), so merging through

@hsheth2 hsheth2 merged commit f402090 into datahub-project:master Oct 26, 2023
52 of 53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest-accepted Acceptance for hacktoberfest https://hacktoberfest.com/participation/ ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants