Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion): Adding config option to auto lowercase dataset urns #8928

Merged
merged 15 commits into from
Oct 12, 2023

Conversation

treff7es
Copy link
Contributor

@treff7es treff7es commented Oct 2, 2023

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Oct 2, 2023
@@ -57,6 +57,11 @@ class FlagsConfig(ConfigModel):
),
)

auto_lowercase_urns: bool = Field(
default=False,
description="Wether to lowercase dataset entity urns.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: Wether -> whether

@pedro93 pedro93 requested a review from hsheth2 October 2, 2023 15:49
urn = Urn.create_from_string(wu.get_urn())
if urn.get_type() == DatasetUrn.ENTITY_TYPE:
dataset_urn = DatasetUrn.create_from_string(str(urn))
lowercased_urn = DatasetUrn.create_from_ids(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only fixes the urn, but won't fix lineage edges or things

Let's use the lowercase_dataset_urns helper method instead

def lowercase_dataset_urns(model: DictWrapper) -> None:

@vercel
Copy link

vercel bot commented Oct 2, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
docs-website ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 3, 2023 7:32am

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of lowercase_dataset_urns looks good, but we do want to move the config to be per-source

@treff7es treff7es changed the title feat(ingestion): Adding flag to auto lowercase dataset urns feat(ingestion): Adding config option to auto lowercase dataset urns Oct 3, 2023
metadata-ingestion/src/datahub/ingestion/api/source.py Outdated Show resolved Hide resolved
yield wu
except Exception:
logger.warning(
f"Failed to lowercase urns for {wu} the exception was: {traceback.format_exc()}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"Failed to lowercase urns for {wu} the exception was: {traceback.format_exc()}"
f"Failed to lowercase urns for {wu}: {e}", exc_info=True

@@ -57,6 +57,11 @@ class FlagsConfig(ConfigModel):
),
)

auto_lowercase_urns: bool = Field(
default=False,
description="Whether to lowercase dataset entity urns.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this

@@ -32,7 +36,7 @@ def list_urns_with_path(

if isinstance(model, MetadataChangeProposalWrapper):
if model.entityUrn:
urns.append((model.entityUrn, ["urn"]))
urns.append((model.entityUrn, ["entityUrn"]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch

class LowerCaseDatasetUrnConfigMixin(ConfigModel):
convert_urns_to_lowercase: bool = Field(
default=False,
description="Whether to convert dataset urns to lowercase.",
Copy link
Collaborator

@pedro93 pedro93 Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: This property isn't specific to just dataset urns, is it?
We should ideally have a consistent property across all sources that lower cases any urns it generates or references to ensure we have clean lineage across all sources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, it is only lowercase dataset urns

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a design decision? What is the reasoning behind it?

@hsheth2
Copy link
Collaborator

hsheth2 commented Oct 4, 2023

@treff7es looks like CI is failing here

@treff7es
Copy link
Contributor Author

treff7es commented Oct 5, 2023

@treff7es looks like CI is failing here

shoot, sorry, I fixed it and now it is green

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is green, but there's a merge conflict now :(

and self.ctx.pipeline_config.source.config
and hasattr(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that I'm looking at this again, it's actually kinda tricky - pipeline_config.source.config could be either the pydantic config object or the raw config dict - we need to handle both cases here, so we might need to restore that .get(...) logic you had earlier

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now, I handle both case

@treff7es treff7es merged commit c381806 into datahub-project:master Oct 12, 2023
@maggiehays maggiehays added the hacktoberfest-accepted Acceptance for hacktoberfest https://hacktoberfest.com/participation/ label Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest-accepted Acceptance for hacktoberfest https://hacktoberfest.com/participation/ ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants