Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion/hive): Add lineage functionality for hive tables from/to file storage #11841

Merged
merged 39 commits into from
Dec 21, 2024

Conversation

acrylJonny
Copy link
Contributor

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Nov 12, 2024
@acrylJonny acrylJonny marked this pull request as draft November 12, 2024 21:39
@acrylJonny
Copy link
Contributor Author

@acrylJonny, could you please elaborate your PR working, if possible with examples also. We are also interested in hive lineage.Thanks

@deepgarg-visa this adds the underlying file system files as lineage below the Hive table, e.g. S3, ABS, HDFS, etc. giving the option of this being upstream or downstream lineage. This mirrors the glue_s3_lineage_direction and emit_s3_lineage in the AWS Glue connector here. This is disabled by default, as with the Glue connector.

@acrylJonny acrylJonny closed this Dec 6, 2024
@acrylJonny acrylJonny reopened this Dec 6, 2024
@acrylJonny
Copy link
Contributor Author

Reopening as this PR as it was closed in error

@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 6, 2024
default=False,
description="Whether to emit storage-to-Hive lineage",
)
hive_storage_lineage_direction: str = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't storage always upstream of the hive dataset?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or a sibling?

Copy link
Contributor Author

@acrylJonny acrylJonny Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be to have parity with the glue_s3_lineage_direction parameter in the Glue config, so this was added to ensure that there is consistency.
If it should be removed, we might want to look at the Glue connector also.

@treff7es treff7es self-assigned this Dec 17, 2024
@datahub-cyborg datahub-cyborg bot added merge-pending-ci A PR that has passed review and should be merged once CI is green. and removed needs-review Label for PRs that need review from a maintainer. labels Dec 17, 2024
@anshbansal anshbansal merged commit 8350a4e into datahub-project:master Dec 21, 2024
74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants