Skip to content

Commit

Permalink
fix(docs): Add improvements in examples for PATCH documentation (data…
Browse files Browse the repository at this point in the history
…hub-project#12165)

Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>
  • Loading branch information
3 people authored and yoonhyejin committed Dec 23, 2024
1 parent d95fb66 commit ed33292
Show file tree
Hide file tree
Showing 14 changed files with 321 additions and 148 deletions.
110 changes: 78 additions & 32 deletions docs/advanced/patch.md
Original file line number Diff line number Diff line change
@@ -1,69 +1,120 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# But First, Semantics: Upsert versus Patch
# Emitting Patch Updates to DataHub

## Why Would You Use Patch

By default, most of the SDK tutorials and API-s involve applying full upserts at the aspect level. This means that typically, when you want to change one field within an aspect without modifying others, you need to do a read-modify-write to not overwrite existing fields.
To support these scenarios, DataHub supports PATCH based operations so that targeted changes to single fields or values within arrays of fields are possible without impacting other existing metadata.
By default, most of the SDK tutorials and APIs involve applying full upserts at the aspect level, e.g. replacing the aspect entirely.
This means that when you want to change even a single field within an aspect without modifying others, you need to do a read-modify-write to avoid overwriting existing fields.
To support these scenarios, DataHub supports `PATCH` operations to perform targeted changes for individual fields or values within arrays of fields are possible without impacting other existing metadata.

:::note

Currently, PATCH support is only available for a selected set of aspects, so before pinning your hopes on using PATCH as a way to make modifications to aspect values, confirm whether your aspect supports PATCH semantics. The complete list of Aspects that are supported are maintained [here](https://github.com/datahub-project/datahub/blob/9588440549f3d99965085e97b214a7dabc181ed2/entity-registry/src/main/java/com/linkedin/metadata/models/registry/template/AspectTemplateEngine.java#L24). In the near future, we do have plans to automatically support PATCH semantics for aspects by default.
Currently, PATCH support is only available for a selected set of aspects, so before pinning your hopes on using PATCH as a way to make modifications to aspect values, confirm whether your aspect supports PATCH semantics. The complete list of Aspects that are supported are maintained [here](https://github.com/datahub-project/datahub/blob/9588440549f3d99965085e97b214a7dabc181ed2/entity-registry/src/main/java/com/linkedin/metadata/models/registry/template/AspectTemplateEngine.java#L24).

:::

## How To Use Patch
## How To Use Patches

Examples for using Patch are sprinkled throughout the API guides.
Here's how to find the appropriate classes for the language for your choice.


<Tabs>
<TabItem value="Java" label="Java SDK">
<TabItem value="Python" label="Python SDK" default>

The Java Patch builders are aspect-oriented and located in the [datahub-client](https://github.com/datahub-project/datahub/tree/master/metadata-integration/java/datahub-client/src/main/java/datahub/client/patch) module under the `datahub.client.patch` namespace.
The Python Patch builders are entity-oriented and located in the [metadata-ingestion](https://github.com/datahub-project/datahub/tree/9588440549f3d99965085e97b214a7dabc181ed2/metadata-ingestion/src/datahub/specific) module and located in the `datahub.specific` module.
Patch builder helper classes exist for

Here are a few illustrative examples using the Java Patch builders:
- [Datasets](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/specific/dataset.py)
- [Charts](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/specific/chart.py)
- [Dashboards](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/specific/dashboard.py)
- [Data Jobs (Tasks)](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/specific/datajob.py)
- [Data Products](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/specific/dataproduct.py)

And we are gladly accepting contributions for Containers, Data Flows (Pipelines), Tags, Glossary Terms, Domains, and ML Models.

### Add Custom Properties
### Add & Remove Owners for Dataset

```java
{{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/DatasetCustomPropertiesAdd.java show_path_as_comment }}
To add & remove specific owners for a dataset:

```python
{{ inline /metadata-ingestion/examples/library/dataset_add_owner_patch.py show_path_as_comment }}
```

### Add and Remove Custom Properties
### Add & Remove Tags for Dataset

```java
{{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/DatasetCustomPropertiesAddRemove.java show_path_as_comment }}
To add & remove specific tags for a dataset:

```python
{{ inline /metadata-ingestion/examples/library/dataset_add_tag_patch.py show_path_as_comment }}
```

### Add Data Job Lineage
And for a specific schema field within the Dataset:

```java
{{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/DataJobLineageAdd.java show_path_as_comment }}
```python
{{ inline /metadata-ingestion/examples/library/dataset_field_add_tag_patch.py show_path_as_comment }}
```

</TabItem>
<TabItem value="Python" label="Python SDK" default>
### Add & Remove Glossary Terms for Dataset

To add & remove specific glossary terms for a dataset:

```python
{{ inline /metadata-ingestion/examples/library/dataset_add_glossary_term_patch.py show_path_as_comment }}
```

And for a specific schema field within the Dataset:

```python
{{ inline /metadata-ingestion/examples/library/dataset_field_add_glossary_term_patch.py show_path_as_comment }}
```

### Add & Remove Structured Properties for Dataset

The Python Patch builders are entity-oriented and located in the [metadata-ingestion](https://github.com/datahub-project/datahub/tree/9588440549f3d99965085e97b214a7dabc181ed2/metadata-ingestion/src/datahub/specific) module and located in the `datahub.specific` module.
To add & remove structured properties for a dataset:

Here are a few illustrative examples using the Python Patch builders:
```python
{{ inline /metadata-ingestion/examples/library/dataset_add_structured_properties_patch.py show_path_as_comment }}
```

### Add Properties to Dataset
### Add & Remove Upstream Lineage for Dataset

To add & remove a lineage edge connecting a dataset to it's upstream or input at both the dataset and schema field level:

```python
{{ inline /metadata-ingestion/examples/library/dataset_add_properties.py show_path_as_comment }}
{{ inline /metadata-ingestion/examples/library/dataset_add_upstream_lineage_patch.py show_path_as_comment }}
```

### Add & Remove Read-Only Custom Properties for Dataset

To add & remove specific custom properties for a dataset:

```python
{{ inline /metadata-ingestion/examples/library/dataset_add_remove_custom_properties_patch.py show_path_as_comment }}
```

</TabItem>
<TabItem value="Java" label="Java SDK">

The Java Patch builders are aspect-oriented and located in the [datahub-client](https://github.com/datahub-project/datahub/tree/master/metadata-integration/java/datahub-client/src/main/java/datahub/client/patch) module under the `datahub.client.patch` namespace.

### Add & Remove Read-Only Custom Properties

```java
{{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/DatasetCustomPropertiesAddRemove.java show_path_as_comment }}
```

### Add Data Job Lineage

```java
{{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/DataJobLineageAdd.java show_path_as_comment }}
```

</TabItem>
</Tabs>


## How Patch works
## Advanced: How Patch works

To understand how patching works, it's important to understand a bit about our [models](../what/aspect.md). Entities are comprised of Aspects
which can be reasoned about as JSON representations of the object models. To be able to patch these we utilize [JsonPatch](https://jsonpatch.com/). The components of a JSON Patch are the path, operation, and value.
Expand All @@ -73,9 +124,6 @@ which can be reasoned about as JSON representations of the object models. To be
The JSON path refers to a value within the schema. This can be a single field or can be an entire object reference depending on what the path is.
For our patches we are primarily targeting single fields or even single array elements within a field. To be able to target array elements by id, we go through a translation process
of the schema to transform arrays into maps. This allows a path to reference a particular array element by key rather than by index, for example a specific tag urn being added to a dataset.
This is important to note that for some fields in our schema that are arrays which do not necessarily restrict uniqueness, this puts a uniqueness constraint on the key.
The key for objects stored in arrays is determined manually by examining the schema and a long term goal is to make these keys annotation driven to reduce the amount of code needed to support
additional aspects to be patched. There is a generic patch endpoint, but it requires any array field keys to be specified at request time, putting a lot of burden on the API user.

#### Examples

Expand All @@ -87,8 +135,7 @@ Breakdown:
* `/upstreams` -> References the upstreams field of the UpstreamLineage aspect, this is an array of Upstream objects where the key is the Urn
* `/urn:...` -> The dataset to be targeted by the operation


A patch path for targeting a fine grained lineage upstream:
A patch path for targeting a fine-grained lineage upstream:

`/fineGrainedLineages/TRANSFORM/urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD),foo)/urn:li:query:queryId/urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created_upstream,PROD),bar)`

Expand Down Expand Up @@ -118,7 +165,6 @@ using adds, but generally the most useful use case for patch is to add elements

Remove operations require the path specified to be present, or an error will be thrown, otherwise they operate as one would expect. The specified path will be removed from the aspect.


### Value

Value is the actual information that will be stored at a path. If the path references an object then this will include the JSON key value pairs for that object.
Expand Down
4 changes: 2 additions & 2 deletions docs/api/tutorials/custom-properties.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ The following code adds custom properties `cluster_name` and `retention_time` to
<TabItem value="python" label="Python" default>

```python
{{ inline /metadata-ingestion/examples/library/dataset_add_properties.py show_path_as_comment }}
{{ inline /metadata-ingestion/examples/library/dataset_add_custom_properties_patch.py show_path_as_comment }}
```

</TabItem>
Expand Down Expand Up @@ -128,7 +128,7 @@ The following code shows you how can add and remove custom properties in the sam
<TabItem value="python" label="Python" default>

```python
{{ inline /metadata-ingestion/examples/library/dataset_add_remove_properties.py show_path_as_comment }}
{{ inline /metadata-ingestion/examples/library/dataset_add_remove_custom_properties_patch.py show_path_as_comment }}
```

</TabItem>
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
from datahub.emitter.mce_builder import make_dataset_urn
from datahub.ingestion.graph.client import DataHubGraph, DataHubGraphConfig
from datahub.specific.dataset import DatasetPatchBuilder

# Create DataHub Client
datahub_client = DataHubGraph(DataHubGraphConfig(server="http://localhost:8080"))

# Create Dataset URN
dataset_urn = make_dataset_urn(platform="hive", name="fct_users_created", env="PROD")

# Create Dataset Patch to Add Custom Properties
patch_builder = DatasetPatchBuilder(dataset_urn)
patch_builder.add_custom_property("cluster_name", "datahubproject.acryl.io")
patch_builder.add_custom_property("retention_time", "2 years")
patch_mcps = patch_builder.build()

# Emit Dataset Patch
for patch_mcp in patch_mcps:
datahub_client.emit(patch_mcp)
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
from datahub.emitter.mce_builder import make_dataset_urn, make_term_urn
from datahub.ingestion.graph.client import DataHubGraph, DataHubGraphConfig
from datahub.metadata.schema_classes import GlossaryTermAssociationClass
from datahub.specific.dataset import DatasetPatchBuilder

# Create DataHub Client
datahub_client = DataHubGraph(DataHubGraphConfig(server="http://localhost:8080"))

# Create Dataset URN
dataset_urn = make_dataset_urn(
platform="snowflake", name="fct_users_created", env="PROD"
)

# Create Dataset Patch to Add + Remove Term for 'profile_id' column
patch_builder = DatasetPatchBuilder(dataset_urn)
patch_builder.add_term(GlossaryTermAssociationClass(make_term_urn("term-to-add-id")))
patch_builder.remove_term(make_term_urn("term-to-remove-id"))
patch_mcps = patch_builder.build()

# Emit Dataset Patch
for patch_mcp in patch_mcps:
datahub_client.emit(patch_mcp)
24 changes: 24 additions & 0 deletions metadata-ingestion/examples/library/dataset_add_owner_patch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
from datahub.emitter.mce_builder import make_dataset_urn, make_group_urn, make_user_urn
from datahub.ingestion.graph.client import DataHubGraph, DataHubGraphConfig
from datahub.metadata.schema_classes import OwnerClass, OwnershipTypeClass
from datahub.specific.dataset import DatasetPatchBuilder

# Create DataHub Client
datahub_client = DataHubGraph(DataHubGraphConfig(server="http://localhost:8080"))

# Create Dataset URN
dataset_urn = make_dataset_urn(
platform="snowflake", name="fct_users_created", env="PROD"
)

# Create Dataset Patch to Add + Remove Owners
patch_builder = DatasetPatchBuilder(dataset_urn)
patch_builder.add_owner(
OwnerClass(make_user_urn("user-to-add-id"), OwnershipTypeClass.TECHNICAL_OWNER)
)
patch_builder.remove_owner(make_group_urn("group-to-remove-id"))
patch_mcps = patch_builder.build()

# Emit Dataset Patch
for patch_mcp in patch_mcps:
datahub_client.emit(patch_mcp)
44 changes: 0 additions & 44 deletions metadata-ingestion/examples/library/dataset_add_properties.py

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
from datahub.emitter.mce_builder import make_dataset_urn
from datahub.ingestion.graph.client import DataHubGraph, DataHubGraphConfig
from datahub.specific.dataset import DatasetPatchBuilder

# Create DataHub Client
datahub_client = DataHubGraph(DataHubGraphConfig(server="http://localhost:8080"))

# Create Dataset URN
dataset_urn = make_dataset_urn(platform="hive", name="fct_users_created", env="PROD")

# Create Dataset Patch to Add + Remove Custom Properties
patch_builder = DatasetPatchBuilder(dataset_urn)
patch_builder.add_custom_property("cluster_name", "datahubproject.acryl.io")
patch_builder.remove_custom_property("retention_time")
patch_mcps = patch_builder.build()

# Emit Dataset Patch
for patch_mcp in patch_mcps:
datahub_client.emit(patch_mcp)

This file was deleted.

This file was deleted.

Loading

0 comments on commit ed33292

Please sign in to comment.