Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest/databricks): Fix profiling #12060

Merged
merged 34 commits into from
Dec 20, 2024

Conversation

skrydal
Copy link
Collaborator

@skrydal skrydal commented Dec 9, 2024

This PR introduces WorkUnit processor which would attempt to "trim" datasetProfile and schemaMetadata aspects in case there is a risk of them exceeding 16MB size - this is a problem we have seen several times already.

  1. The processor is quite intrusive and verbose - we can discuss whether verbosity or the algorithm should be adjusted.
  2. For now the processor will simply log warning and trim the aspect, but maybe we should consider more sever actions to be taken, up to and including failure of entire ingestion with a clear message.
  3. Moreover there is an open question whether this processor should be just enabled for all sources - I have added it just for databricks source as it was the one causing problems due to big field samples. In general such cases could be also avoided if we decided not to profile complex fields at all - it requires more changes in the code though.
  4. I have added some sanity check to the rest emitter as well - at the moment of sending the aspect we know exactly the size of the payload, I think it is reasonable to print it and print a warning in case it exceeds known limits.

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Dec 9, 2024
@skrydal skrydal marked this pull request as ready for review December 17, 2024 13:19
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Dec 17, 2024
Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍 IMO we can move forward with it as is. I left some comments for future improvements, if needed. Of course, you might want to wait for other approvals too! 😅

@@ -338,8 +346,15 @@ def emit_usage(self, usageStats: UsageAggregation) -> None:

def _emit_generic(self, url: str, payload: str) -> None:
curl_command = make_curl_command(self._session, "POST", url, payload)
payload_size = len(payload)
if payload_size > INGEST_MAX_PAYLOAD_BYTES:
# since we know total payload size here, we could simply avoid sending such payload at all and report a warning, with current approach we are going to cause whole ingestion to fail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I’m a bit confused about this comment.
If we send a payload that's too big, it’ll fail on the backend only and no impact in the ingestion pipeline, right?
this is how it works both before and with the changes in this PR, correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ingestion will be marked as failed if the payload size is exceeded (GMS will return 400).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it!

The backend may better respond with 413 Content Too Large and then ingestor may easily skip https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/413

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Whether to skip such error though - that's another question.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it comes to truncating the profile, the warnings in the ingestion reports might be enough. But with schema, it could be a bigger deal, so it might be good to show in the UI if the schema’s been cut. We could either extend the SchemaMetadata model or just add a fixed field to flag that some fields are missing.

This is just my opinion, and it might need input from the product/UX folks to decide.

Of course, we could tackle this in later PRs to keep things moving for users affected by the issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be good to introduce a flag in the aspect itself to indicate truncation happening. We are actively truncating columns in BigQuery source if there are more than 300 (it is configurable though). Users were confused by this since no information of this trimming appeared in the UI.

logger.debug(
f"Field {field.fieldPath} has {len(field.sampleValues)} sample values, taking total bytes {values_len}"
)
if sample_fields_size + values_len > INGEST_MAX_PAYLOAD_BYTES:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different sink types might have their own limits. Even the limit could be a configuration parameter for the sink rather than sticking with INGEST_MAX_PAYLOAD_BYTES.

Again, just a possible future improvement!

@datahub-cyborg datahub-cyborg bot added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Dec 17, 2024
f"Field {field.fieldPath} has {len(field.sampleValues)} sample values, taking total bytes {values_len}"
)
if sample_fields_size + values_len > INGEST_MAX_PAYLOAD_BYTES:
logger.warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we'd pass the source reporter, and then we'd do report.warning(...) and this would show up as a warning in the final ingestion report in the UI

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsheth2 adjusted the code to use source report

@anshbansal anshbansal merged commit e52a4de into datahub-project:master Dec 20, 2024
73 checks passed
@skrydal skrydal deleted the fix_databricks_profiling branch December 20, 2024 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants