fix(ingest/databricks): Fix profiling #12060

skrydal · 2024-12-09T13:19:25Z

This PR introduces WorkUnit processor which would attempt to "trim" datasetProfile and schemaMetadata aspects in case there is a risk of them exceeding 16MB size - this is a problem we have seen several times already.

The processor is quite intrusive and verbose - we can discuss whether verbosity or the algorithm should be adjusted.
For now the processor will simply log warning and trim the aspect, but maybe we should consider more sever actions to be taken, up to and including failure of entire ingestion with a clear message.
Moreover there is an open question whether this processor should be just enabled for all sources - I have added it just for databricks source as it was the one causing problems due to big field samples. In general such cases could be also avoided if we decided not to profile complex fields at all - it requires more changes in the code though.
I have added some sanity check to the rest emitter as well - at the moment of sending the aspect we know exactly the size of the payload, I think it is reasonable to print it and print a warning in case it exceeds known limits.

…ahub-project#11999)" This reverts commit 2291c71.

…tahub-project#11999)" This reverts commit 6ed6fe8.

sgomezvillamor

LGTM! 👍 IMO we can move forward with it as is. I left some comments for future improvements, if needed. Of course, you might want to wait for other approvals too! 😅

sgomezvillamor · 2024-12-17T14:39:40Z

metadata-ingestion/src/datahub/emitter/rest_emitter.py

@@ -338,8 +346,15 @@ def emit_usage(self, usageStats: UsageAggregation) -> None:

    def _emit_generic(self, url: str, payload: str) -> None:
        curl_command = make_curl_command(self._session, "POST", url, payload)
+        payload_size = len(payload)
+        if payload_size > INGEST_MAX_PAYLOAD_BYTES:
+            # since we know total payload size here, we could simply avoid sending such payload at all and report a warning, with current approach we are going to cause whole ingestion to fail


Hey, I’m a bit confused about this comment.
If we send a payload that's too big, it’ll fail on the backend only and no impact in the ingestion pipeline, right?
this is how it works both before and with the changes in this PR, correct?

The ingestion will be marked as failed if the payload size is exceeded (GMS will return 400).

Got it!

The backend may better respond with 413 Content Too Large and then ingestor may easily skip https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/413

Agreed. Whether to skip such error though - that's another question.

sgomezvillamor · 2024-12-17T14:50:59Z

metadata-ingestion/src/datahub/ingestion/api/auto_work_units/auto_ensure_aspect_size.py

When it comes to truncating the profile, the warnings in the ingestion reports might be enough. But with schema, it could be a bigger deal, so it might be good to show in the UI if the schema’s been cut. We could either extend the SchemaMetadata model or just add a fixed field to flag that some fields are missing.

This is just my opinion, and it might need input from the product/UX folks to decide.

Of course, we could tackle this in later PRs to keep things moving for users affected by the issue.

I agree it would be good to introduce a flag in the aspect itself to indicate truncation happening. We are actively truncating columns in BigQuery source if there are more than 300 (it is configurable though). Users were confused by this since no information of this trimming appeared in the UI.

metadata-ingestion/tests/unit/api/source_helpers/test_ensure_aspect_size.py

sgomezvillamor · 2024-12-17T14:55:02Z

metadata-ingestion/src/datahub/ingestion/api/auto_work_units/auto_ensure_aspect_size.py

+                logger.debug(
+                    f"Field {field.fieldPath} has {len(field.sampleValues)} sample values, taking total bytes {values_len}"
+                )
+                if sample_fields_size + values_len > INGEST_MAX_PAYLOAD_BYTES:


Different sink types might have their own limits. Even the limit could be a configuration parameter for the sink rather than sticking with INGEST_MAX_PAYLOAD_BYTES.

Again, just a possible future improvement!

hsheth2 · 2024-12-17T20:37:26Z

metadata-ingestion/src/datahub/ingestion/api/auto_work_units/auto_ensure_aspect_size.py

+                    f"Field {field.fieldPath} has {len(field.sampleValues)} sample values, taking total bytes {values_len}"
+                )
+                if sample_fields_size + values_len > INGEST_MAX_PAYLOAD_BYTES:
+                    logger.warning(


ideally we'd pass the source reporter, and then we'd do report.warning(...) and this would show up as a warning in the final ingestion report in the UI

@hsheth2 adjusted the code to use source report

skrydal added 2 commits December 9, 2024 14:14

Adding a new workunit processor to check correctness of an aspect

d259e91

Better verbosity

9b220ed

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Dec 9, 2024

vercel bot deployed to Preview December 9, 2024 13:36 View deployment

skrydal added 2 commits December 9, 2024 15:37

Adjustment

87ef5cf

Adjustment 2

8c3f83b

vercel bot deployed to Preview December 9, 2024 15:14 View deployment

skrydal added 3 commits December 9, 2024 17:48

Fixing output

3684130

More verbosity

54dc33e

Added verbosity to profiler

622f780

vercel bot deployed to Preview December 9, 2024 17:11 View deployment

Adding more robust problems handling

a66a0f3

vercel bot deployed to Preview December 9, 2024 17:59 View deployment

skrydal added 3 commits December 12, 2024 22:24

Corrected the workunit processor

83e1c62

Even more verbose logging

96c54d5

Linting

7458341

vercel bot deployed to Preview December 12, 2024 22:54 View deployment

skrydal added 2 commits December 13, 2024 09:06

Removed incorrect debug log

24ab30b

More logging

aa8d649

vercel bot deployed to Preview December 13, 2024 08:48 View deployment

Initial logic for reducing aspects size

3832aa9

vercel bot deployed to Preview December 13, 2024 11:46 View deployment

Fixing return

613ed36

vercel bot deployed to Preview December 13, 2024 14:16 View deployment

Fixed sample values blanking

65ac369

vercel bot deployed to Preview December 13, 2024 18:56 View deployment

skrydal added 3 commits December 16, 2024 00:31

Refined approach

4dba84b

Linting

229572d

Merge branch 'master' into fix_databricks_profiling

01b4203

vercel bot had a problem deploying to Preview December 15, 2024 23:39 Failure

vercel bot had a problem deploying to Preview December 16, 2024 13:06 Failure

Added extensive testing for schema metadata aspect truncating

b37ee89

vercel bot had a problem deploying to Preview December 16, 2024 13:23 Failure

skrydal added 2 commits December 16, 2024 14:37

Imports fix

12693c1

Merge branch 'master' into fix_databricks_profiling

83797ea

vercel bot had a problem deploying to Preview December 16, 2024 13:44 Failure

skrydal added 2 commits December 16, 2024 15:16

Better logging

c2ae385

Revert "build(gradle): version change (Gradle and shadow plugin) (dat…

6ed6fe8

…ahub-project#11999)" This reverts commit 2291c71.

vercel bot deployed to Preview December 16, 2024 14:35 View deployment

skrydal added 5 commits December 16, 2024 19:56

Reapply "build(gradle): version change (Gradle and shadow plugin) (da…

f77f047

…tahub-project#11999)" This reverts commit 6ed6fe8.

Removing redundant verbosity

b260b7a

Merge branch 'master' into fix_databricks_profiling

957d8ac

Adjusting logging

b415622

Adjusting again

257d2f0

vercel bot deployed to Preview December 16, 2024 19:42 View deployment

skrydal marked this pull request as ready for review December 17, 2024 13:19

datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Dec 17, 2024

sgomezvillamor approved these changes Dec 17, 2024

View reviewed changes

datahub-cyborg bot added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Dec 17, 2024

hsheth2 reviewed Dec 17, 2024

View reviewed changes

Merge branch 'master' into fix_databricks_profiling

34e3bb2

vercel bot deployed to Preview December 19, 2024 11:36 View deployment

Using SourceReport for reporting aspects truncation

3e94a4b

skrydal requested a review from hsheth2 December 19, 2024 13:25

vercel bot deployed to Preview December 19, 2024 13:43 View deployment

Spelling change

e1c99ab

vercel bot deployed to Preview December 19, 2024 14:23 View deployment

anshbansal merged commit e52a4de into datahub-project:master Dec 20, 2024
73 checks passed

skrydal deleted the fix_databricks_profiling branch December 20, 2024 16:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingest/databricks): Fix profiling #12060

fix(ingest/databricks): Fix profiling #12060

skrydal commented Dec 9, 2024 •

edited

Loading

sgomezvillamor left a comment

sgomezvillamor Dec 17, 2024

skrydal Dec 17, 2024

sgomezvillamor Dec 17, 2024

skrydal Dec 17, 2024

sgomezvillamor Dec 17, 2024

skrydal Dec 17, 2024

sgomezvillamor Dec 17, 2024

hsheth2 Dec 17, 2024

skrydal Dec 19, 2024

fix(ingest/databricks): Fix profiling #12060

fix(ingest/databricks): Fix profiling #12060

Conversation

skrydal commented Dec 9, 2024 • edited Loading

sgomezvillamor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skrydal commented Dec 9, 2024 •

edited

Loading