-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ingest/databricks): Fix profiling #12060
fix(ingest/databricks): Fix profiling #12060
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍 IMO we can move forward with it as is. I left some comments for future improvements, if needed. Of course, you might want to wait for other approvals too! 😅
@@ -338,8 +346,15 @@ def emit_usage(self, usageStats: UsageAggregation) -> None: | |||
|
|||
def _emit_generic(self, url: str, payload: str) -> None: | |||
curl_command = make_curl_command(self._session, "POST", url, payload) | |||
payload_size = len(payload) | |||
if payload_size > INGEST_MAX_PAYLOAD_BYTES: | |||
# since we know total payload size here, we could simply avoid sending such payload at all and report a warning, with current approach we are going to cause whole ingestion to fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I’m a bit confused about this comment.
If we send a payload that's too big, it’ll fail on the backend only and no impact in the ingestion pipeline, right?
this is how it works both before and with the changes in this PR, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ingestion will be marked as failed if the payload size is exceeded (GMS will return 400).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it!
The backend may better respond with 413 Content Too Large
and then ingestor may easily skip https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/413
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Whether to skip such error though - that's another question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it comes to truncating the profile, the warnings in the ingestion reports might be enough. But with schema, it could be a bigger deal, so it might be good to show in the UI if the schema’s been cut. We could either extend the SchemaMetadata
model or just add a fixed field to flag that some fields are missing.
This is just my opinion, and it might need input from the product/UX folks to decide.
Of course, we could tackle this in later PRs to keep things moving for users affected by the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it would be good to introduce a flag in the aspect itself to indicate truncation happening. We are actively truncating columns in BigQuery
source if there are more than 300 (it is configurable though). Users were confused by this since no information of this trimming appeared in the UI.
metadata-ingestion/tests/unit/api/source_helpers/test_ensure_aspect_size.py
Outdated
Show resolved
Hide resolved
logger.debug( | ||
f"Field {field.fieldPath} has {len(field.sampleValues)} sample values, taking total bytes {values_len}" | ||
) | ||
if sample_fields_size + values_len > INGEST_MAX_PAYLOAD_BYTES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different sink types might have their own limits. Even the limit could be a configuration parameter for the sink rather than sticking with INGEST_MAX_PAYLOAD_BYTES
.
Again, just a possible future improvement!
f"Field {field.fieldPath} has {len(field.sampleValues)} sample values, taking total bytes {values_len}" | ||
) | ||
if sample_fields_size + values_len > INGEST_MAX_PAYLOAD_BYTES: | ||
logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally we'd pass the source reporter, and then we'd do report.warning(...)
and this would show up as a warning in the final ingestion report in the UI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hsheth2 adjusted the code to use source report
This PR introduces WorkUnit processor which would attempt to "trim"
datasetProfile
andschemaMetadata
aspects in case there is a risk of them exceeding 16MB size - this is a problem we have seen several times already.databricks
source as it was the one causing problems due to big field samples. In general such cases could be also avoided if we decided not to profile complex fields at all - it requires more changes in the code though.