+COPY /truststore-directory /certificates
+```
+
+Building this Dockerfile will result in your own custom docker image on your local machine.
+You will then be able to tag it, publish it to your own registry, etc.
+
+#### Option b) Mount truststore from your host machine using a docker volume
+
+Adapt your docker-compose.yml to include a new volume mount in the `datahub-frontend-react` container
+
+```docker
+ datahub-frontend-react:
+ # ...
+ volumes:
+ # ...
+ - /truststore-directory:/certificates
+```
+
+### Reference new truststore
+
+Add the following environment values to the `datahub-frontend-react` container:
+
+```
+SSL_TRUSTSTORE_FILE=path/to/truststore.jks (e.g. /certificates)
+SSL_TRUSTSTORE_TYPE=jks
+SSL_TRUSTSTORE_PASSWORD=MyTruststorePassword
+```
+
+Once these steps are done, your frontend container will use the new truststore when validating SSL/HTTPS connections.
diff --git a/docs/authentication/guides/sso/configure-oidc-react-azure.md b/docs/authentication/guides/sso/configure-oidc-react-azure.md
index d1859579678821..177387327c0e8e 100644
--- a/docs/authentication/guides/sso/configure-oidc-react-azure.md
+++ b/docs/authentication/guides/sso/configure-oidc-react-azure.md
@@ -32,7 +32,11 @@ Azure supports more than one redirect URI, so both can be configured at the same
At this point, your app registration should look like the following:
-![azure-setup-app-registration](img/azure-setup-app-registration.png)
+
+
+
+
+
e. Click **Register**.
@@ -40,7 +44,11 @@ e. Click **Register**.
Once registration is done, you will land on the app registration **Overview** tab. On the left-side navigation bar, click on **Authentication** under **Manage** and add extra redirect URIs if need be (if you want to support both local testing and Azure deployments).
-![azure-setup-authentication](img/azure-setup-authentication.png)
+
+
+
+
+
Click **Save**.
@@ -51,7 +59,11 @@ Select **Client secrets**, then **New client secret**. Type in a meaningful des
**IMPORTANT:** Copy the `value` of your newly create secret since Azure will never display its value afterwards.
-![azure-setup-certificates-secrets](img/azure-setup-certificates-secrets.png)
+
+
+
+
+
### 4. Configure API permissions
@@ -66,7 +78,11 @@ Click on **Add a permission**, then from the **Microsoft APIs** tab select **Mic
At this point, you should be looking at a screen like the following:
-![azure-setup-api-permissions](img/azure-setup-api-permissions.png)
+
+
+
+
+
### 5. Obtain Application (Client) ID
diff --git a/docs/authentication/guides/sso/configure-oidc-react-google.md b/docs/authentication/guides/sso/configure-oidc-react-google.md
index 474538097aae20..af62185e6e7872 100644
--- a/docs/authentication/guides/sso/configure-oidc-react-google.md
+++ b/docs/authentication/guides/sso/configure-oidc-react-google.md
@@ -31,7 +31,11 @@ Note that in order to complete this step you should be logged into a Google acco
c. Fill out the details in the App Information & Domain sections. Make sure the 'Application Home Page' provided matches where DataHub is deployed
at your organization.
-![google-setup-1](img/google-setup-1.png)
+
+
+
+
+
Once you've completed this, **Save & Continue**.
@@ -70,7 +74,11 @@ f. You will now receive a pair of values, a client id and a client secret. Bookm
At this point, you should be looking at a screen like the following:
-![google-setup-2](img/google-setup-2.png)
+
+
+
+
+
Success!
diff --git a/docs/authentication/guides/sso/configure-oidc-react-okta.md b/docs/authentication/guides/sso/configure-oidc-react-okta.md
index cfede999f1e700..320b887a28f163 100644
--- a/docs/authentication/guides/sso/configure-oidc-react-okta.md
+++ b/docs/authentication/guides/sso/configure-oidc-react-okta.md
@@ -69,8 +69,16 @@ for example, `https://dev-33231928.okta.com/.well-known/openid-configuration`.
At this point, you should be looking at a screen like the following:
-![okta-setup-1](img/okta-setup-1.png)
-![okta-setup-2](img/okta-setup-2.png)
+
+
+
+
+
+
+
+
+
+
Success!
@@ -96,7 +104,11 @@ Replacing the placeholders above with the client id & client secret received fro
>
> By default, we assume that the groups will appear in a claim named "groups". This can be customized using the `AUTH_OIDC_GROUPS_CLAIM` container configuration.
>
-> ![okta-setup-2](img/okta-setup-groups-claim.png)
+>
+
+
+
+
### 5. Restart `datahub-frontend-react` docker container
diff --git a/docs/authentication/guides/sso/configure-oidc-react.md b/docs/authentication/guides/sso/configure-oidc-react.md
index b7efb94f842d62..512d6adbf916fc 100644
--- a/docs/authentication/guides/sso/configure-oidc-react.md
+++ b/docs/authentication/guides/sso/configure-oidc-react.md
@@ -26,7 +26,7 @@ please see [this guide](../jaas.md) to mount a custom user.props file for a JAAS
To configure OIDC in React, you will most often need to register yourself as a client with your identity provider (Google, Okta, etc). Each provider may
have their own instructions. Provided below are links to examples for Okta, Google, Azure AD, & Keycloak.
-- [Registering an App in Okta](https://developer.okta.com/docs/guides/add-an-external-idp/apple/register-app-in-okta/)
+- [Registering an App in Okta](https://developer.okta.com/docs/guides/add-an-external-idp/openidconnect/main/)
- [OpenID Connect in Google Identity](https://developers.google.com/identity/protocols/oauth2/openid-connect)
- [OpenID Connect authentication with Azure Active Directory](https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/auth-oidc)
- [Keycloak - Securing Applications and Services Guide](https://www.keycloak.org/docs/latest/securing_apps/)
@@ -72,6 +72,7 @@ AUTH_OIDC_BASE_URL=your-datahub-url
- `AUTH_OIDC_CLIENT_SECRET`: Unique client secret received from identity provider
- `AUTH_OIDC_DISCOVERY_URI`: Location of the identity provider OIDC discovery API. Suffixed with `.well-known/openid-configuration`
- `AUTH_OIDC_BASE_URL`: The base URL of your DataHub deployment, e.g. https://yourorgdatahub.com (prod) or http://localhost:9002 (testing)
+- `AUTH_SESSION_TTL_HOURS`: The length of time in hours before a user will be prompted to login again. Session tokens are stateless so this determines at what time a session token may no longer be used and a valid session token can be used until this time has passed.
Providing these configs will cause DataHub to delegate authentication to your identity
provider, requesting the "oidc email profile" scopes and parsing the "preferred_username" claim from
diff --git a/docs/authentication/guides/sso/img/azure-setup-api-permissions.png b/docs/authentication/guides/sso/img/azure-setup-api-permissions.png
deleted file mode 100755
index 4964b7d48ffec2..00000000000000
Binary files a/docs/authentication/guides/sso/img/azure-setup-api-permissions.png and /dev/null differ
diff --git a/docs/authentication/guides/sso/img/azure-setup-app-registration.png b/docs/authentication/guides/sso/img/azure-setup-app-registration.png
deleted file mode 100755
index ffb23a7e3ddec5..00000000000000
Binary files a/docs/authentication/guides/sso/img/azure-setup-app-registration.png and /dev/null differ
diff --git a/docs/authentication/guides/sso/img/azure-setup-authentication.png b/docs/authentication/guides/sso/img/azure-setup-authentication.png
deleted file mode 100755
index 2d27ec88fb40b9..00000000000000
Binary files a/docs/authentication/guides/sso/img/azure-setup-authentication.png and /dev/null differ
diff --git a/docs/authentication/guides/sso/img/azure-setup-certificates-secrets.png b/docs/authentication/guides/sso/img/azure-setup-certificates-secrets.png
deleted file mode 100755
index db6585d84d8eeb..00000000000000
Binary files a/docs/authentication/guides/sso/img/azure-setup-certificates-secrets.png and /dev/null differ
diff --git a/docs/authentication/guides/sso/img/google-setup-1.png b/docs/authentication/guides/sso/img/google-setup-1.png
deleted file mode 100644
index 88c674146f1e44..00000000000000
Binary files a/docs/authentication/guides/sso/img/google-setup-1.png and /dev/null differ
diff --git a/docs/authentication/guides/sso/img/google-setup-2.png b/docs/authentication/guides/sso/img/google-setup-2.png
deleted file mode 100644
index 850512b891d5f3..00000000000000
Binary files a/docs/authentication/guides/sso/img/google-setup-2.png and /dev/null differ
diff --git a/docs/authentication/guides/sso/img/okta-setup-1.png b/docs/authentication/guides/sso/img/okta-setup-1.png
deleted file mode 100644
index 3949f18657c5ec..00000000000000
Binary files a/docs/authentication/guides/sso/img/okta-setup-1.png and /dev/null differ
diff --git a/docs/authentication/guides/sso/img/okta-setup-2.png b/docs/authentication/guides/sso/img/okta-setup-2.png
deleted file mode 100644
index fa6ea4d9918948..00000000000000
Binary files a/docs/authentication/guides/sso/img/okta-setup-2.png and /dev/null differ
diff --git a/docs/authentication/guides/sso/img/okta-setup-groups-claim.png b/docs/authentication/guides/sso/img/okta-setup-groups-claim.png
deleted file mode 100644
index ed35426685e467..00000000000000
Binary files a/docs/authentication/guides/sso/img/okta-setup-groups-claim.png and /dev/null differ
diff --git a/docs/authentication/personal-access-tokens.md b/docs/authentication/personal-access-tokens.md
index 0188aab49444ea..dc57a989a4e0c8 100644
--- a/docs/authentication/personal-access-tokens.md
+++ b/docs/authentication/personal-access-tokens.md
@@ -71,7 +71,11 @@ curl 'http://localhost:8080/entities/urn:li:corpuser:datahub' -H 'Authorization:
Since authorization happens at the GMS level, this means that ingestion is also protected behind access tokens, to use them simply add a `token` to the sink config property as seen below:
-![](../imgs/ingestion-with-token.png)
+
+
+
+
+
:::note
diff --git a/docs/authorization/access-policies-guide.md b/docs/authorization/access-policies-guide.md
index 5820e513a83e30..1eabb64d2878f6 100644
--- a/docs/authorization/access-policies-guide.md
+++ b/docs/authorization/access-policies-guide.md
@@ -110,10 +110,13 @@ In the second step, we can simply select the Privileges that this Platform Polic
| Manage Tags | Allow the actor to create and remove any Tags |
| Manage Public Views | Allow the actor to create, edit, and remove any public (shared) Views. |
| Manage Ownership Types | Allow the actor to create, edit, and remove any Ownership Types. |
+| Manage Platform Settings | (Acryl DataHub only) Allow the actor to manage global integrations and notification settings |
+| Manage Monitors | (Acryl DataHub only) Allow the actor to create, remove, start, or stop any entity assertion monitors |
| Restore Indices API[^1] | Allow the actor to restore indices for a set of entities via API |
| Enable/Disable Writeability API[^1] | Allow the actor to enable or disable GMS writeability for use in data migrations |
| Apply Retention API[^1] | Allow the actor to apply aspect retention via API |
+
[^1]: Only active if REST_API_AUTHORIZATION_ENABLED environment flag is enabled
#### Step 3: Choose Policy Actors
@@ -204,8 +207,15 @@ The common Metadata Privileges, which span across entity types, include:
| Edit Status | Allow actor to edit the status of an entity (soft deleted or not). |
| Edit Domain | Allow actor to edit the Domain of an entity. |
| Edit Deprecation | Allow actor to edit the Deprecation status of an entity. |
-| Edit Assertions | Allow actor to add and remove assertions from an entity. |
-| Edit All | Allow actor to edit any information about an entity. Super user privileges. Controls the ability to ingest using API when REST API Authorization is enabled. |
+| Edit Lineage | Allow actor to edit custom lineage edges for the entity. |
+| Edit Data Product | Allow actor to edit the data product that an entity is part of |
+| Propose Tags | (Acryl DataHub only) Allow actor to propose new Tags for the entity. |
+| Propose Glossary Terms | (Acryl DataHub only) Allow actor to propose new Glossary Terms for the entity. |
+| Propose Documentation | (Acryl DataHub only) Allow actor to propose new Documentation for the entity. |
+| Manage Tag Proposals | (Acryl DataHub only) Allow actor to accept or reject proposed Tags for the entity. |
+| Manage Glossary Terms Proposals | (Acryl DataHub only) Allow actor to accept or reject proposed Glossary Terms for the entity. |
+| Manage Documentation Proposals | (Acryl DataHub only) Allow actor to accept or reject proposed Documentation for the entity |
+| Edit Entity | Allow actor to edit any information about an entity. Super user privileges. Controls the ability to ingest using API when REST API Authorization is enabled. |
| Get Timeline API[^1] | Allow actor to get the timeline of an entity via API. |
| Get Entity API[^1] | Allow actor to get an entity via API. |
| Get Timeseries Aspect API[^1] | Allow actor to get a timeseries aspect via API. |
@@ -225,10 +235,19 @@ The common Metadata Privileges, which span across entity types, include:
| Dataset | Edit Dataset Queries | Allow actor to edit the Highlighted Queries on the Queries tab of the dataset. |
| Dataset | View Dataset Usage | Allow actor to access usage metadata about a dataset both in the UI and in the GraphQL API. This includes example queries, number of queries, etc. Also applies to REST APIs when REST API Authorization is enabled. |
| Dataset | View Dataset Profile | Allow actor to access a dataset's profile both in the UI and in the GraphQL API. This includes snapshot statistics like #rows, #columns, null percentage per field, etc. |
+| Dataset | Edit Assertions | Allow actor to change the assertions associated with a dataset. |
+| Dataset | Edit Incidents | (Acryl DataHub only) Allow actor to change the incidents associated with a dataset. |
+| Dataset | Edit Monitors | (Acryl DataHub only) Allow actor to change the assertion monitors associated with a dataset. |
| Tag | Edit Tag Color | Allow actor to change the color of a Tag. |
| Group | Edit Group Members | Allow actor to add and remove members to a group. |
+| Group | Edit Contact Information | Allow actor to change email, slack handle associated with the group. |
+| Group | Manage Group Subscriptions | (Acryl DataHub only) Allow actor to subscribe the group to entities. |
+| Group | Manage Group Notifications | (Acryl DataHub only) Allow actor to change notification settings for the group. |
| User | Edit User Profile | Allow actor to change the user's profile including display name, bio, title, profile image, etc. |
| User + Group | Edit Contact Information | Allow actor to change the contact information such as email & chat handles. |
+| Term Group | Manage Direct Glossary Children | Allow actor to change the direct child Term Groups or Terms of the group. |
+| Term Group | Manage All Glossary Children | Allow actor to change any direct or indirect child Term Groups or Terms of the group. |
+
> **Still have questions about Privileges?** Let us know in [Slack](https://slack.datahubproject.io)!
diff --git a/docs/cli.md b/docs/cli.md
index eb8bb406b01074..267f289d9f54a6 100644
--- a/docs/cli.md
+++ b/docs/cli.md
@@ -547,7 +547,7 @@ Old Entities Migrated = {'urn:li:dataset:(urn:li:dataPlatform:hive,logging_event
### Using docker
[![Docker Hub](https://img.shields.io/docker/pulls/acryldata/datahub-ingestion?style=plastic)](https://hub.docker.com/r/acryldata/datahub-ingestion)
-[![datahub-ingestion docker](https://github.com/acryldata/datahub/actions/workflows/docker-ingestion.yml/badge.svg)](https://github.com/acryldata/datahub/actions/workflows/docker-ingestion.yml)
+[![datahub-ingestion docker](https://github.com/acryldata/datahub/workflows/datahub-ingestion%20docker/badge.svg)](https://github.com/acryldata/datahub/actions/workflows/docker-ingestion.yml)
If you don't want to install locally, you can alternatively run metadata ingestion within a Docker container.
We have prebuilt images available on [Docker hub](https://hub.docker.com/r/acryldata/datahub-ingestion). All plugins will be installed and enabled automatically.
diff --git a/docs/components.md b/docs/components.md
index ef76729bb37fbf..b59dabcf999cce 100644
--- a/docs/components.md
+++ b/docs/components.md
@@ -6,7 +6,11 @@ title: "Components"
The DataHub platform consists of the components shown in the following diagram.
-![DataHub Component Overview](./imgs/datahub-components.png)
+
+
+
+
+
## Metadata Store
diff --git a/docs/demo/DataHub-UIOverview.pdf b/docs/demo/DataHub-UIOverview.pdf
deleted file mode 100644
index cd6106e84ac236..00000000000000
Binary files a/docs/demo/DataHub-UIOverview.pdf and /dev/null differ
diff --git a/docs/demo/DataHub_-_Powering_LinkedIn_Metadata.pdf b/docs/demo/DataHub_-_Powering_LinkedIn_Metadata.pdf
deleted file mode 100644
index 71498045f9b5bf..00000000000000
Binary files a/docs/demo/DataHub_-_Powering_LinkedIn_Metadata.pdf and /dev/null differ
diff --git a/docs/demo/Data_Discoverability_at_SpotHero.pdf b/docs/demo/Data_Discoverability_at_SpotHero.pdf
deleted file mode 100644
index 83e37d8606428a..00000000000000
Binary files a/docs/demo/Data_Discoverability_at_SpotHero.pdf and /dev/null differ
diff --git a/docs/demo/Datahub_-_Strongly_Consistent_Secondary_Indexing.pdf b/docs/demo/Datahub_-_Strongly_Consistent_Secondary_Indexing.pdf
deleted file mode 100644
index 2d6a33a464650e..00000000000000
Binary files a/docs/demo/Datahub_-_Strongly_Consistent_Secondary_Indexing.pdf and /dev/null differ
diff --git a/docs/demo/Datahub_at_Grofers.pdf b/docs/demo/Datahub_at_Grofers.pdf
deleted file mode 100644
index c29cece9e250ac..00000000000000
Binary files a/docs/demo/Datahub_at_Grofers.pdf and /dev/null differ
diff --git a/docs/demo/Designing_the_next_generation_of_metadata_events_for_scale.pdf b/docs/demo/Designing_the_next_generation_of_metadata_events_for_scale.pdf
deleted file mode 100644
index 0d067eef28d03b..00000000000000
Binary files a/docs/demo/Designing_the_next_generation_of_metadata_events_for_scale.pdf and /dev/null differ
diff --git a/docs/demo/Metadata_Use-Cases_at_LinkedIn_-_Lightning_Talk.pdf b/docs/demo/Metadata_Use-Cases_at_LinkedIn_-_Lightning_Talk.pdf
deleted file mode 100644
index 382754f863c8a3..00000000000000
Binary files a/docs/demo/Metadata_Use-Cases_at_LinkedIn_-_Lightning_Talk.pdf and /dev/null differ
diff --git a/docs/demo/Saxo Bank Data Workbench.pdf b/docs/demo/Saxo Bank Data Workbench.pdf
deleted file mode 100644
index c43480d32b8f24..00000000000000
Binary files a/docs/demo/Saxo Bank Data Workbench.pdf and /dev/null differ
diff --git a/docs/demo/Taming the Data Beast Using DataHub.pdf b/docs/demo/Taming the Data Beast Using DataHub.pdf
deleted file mode 100644
index d0062465d92200..00000000000000
Binary files a/docs/demo/Taming the Data Beast Using DataHub.pdf and /dev/null differ
diff --git a/docs/demo/Town_Hall_Presentation_-_12-2020_-_UI_Development_Part_2.pdf b/docs/demo/Town_Hall_Presentation_-_12-2020_-_UI_Development_Part_2.pdf
deleted file mode 100644
index fb7bd2b693e877..00000000000000
Binary files a/docs/demo/Town_Hall_Presentation_-_12-2020_-_UI_Development_Part_2.pdf and /dev/null differ
diff --git a/docs/demo/ViasatMetadataJourney.pdf b/docs/demo/ViasatMetadataJourney.pdf
deleted file mode 100644
index ccffd18a06d187..00000000000000
Binary files a/docs/demo/ViasatMetadataJourney.pdf and /dev/null differ
diff --git a/docs/deploy/aws.md b/docs/deploy/aws.md
index 7b01ffa02a7446..228fcb51d1a28f 100644
--- a/docs/deploy/aws.md
+++ b/docs/deploy/aws.md
@@ -201,7 +201,11 @@ Provision a MySQL database in AWS RDS that shares the VPC with the kubernetes cl
the VPC of the kubernetes cluster. Once the database is provisioned, you should be able to see the following page. Take
a note of the endpoint marked by the red box.
-![AWS RDS](../imgs/aws/aws-rds.png)
+
+
+
+
+
First, add the DB password to kubernetes by running the following.
@@ -234,7 +238,11 @@ Provision an elasticsearch domain running elasticsearch version 7.10 or above th
cluster or has VPC peering set up between the VPC of the kubernetes cluster. Once the domain is provisioned, you should
be able to see the following page. Take a note of the endpoint marked by the red box.
-![AWS Elasticsearch Service](../imgs/aws/aws-elasticsearch.png)
+
+
+
+
+
Update the elasticsearch settings under global in the values.yaml as follows.
@@ -330,7 +338,11 @@ Provision an MSK cluster that shares the VPC with the kubernetes cluster or has
the kubernetes cluster. Once the domain is provisioned, click on the “View client information” button in the ‘Cluster
Summary” section. You should see a page like below. Take a note of the endpoints marked by the red boxes.
-![AWS MSK](../imgs/aws/aws-msk.png)
+
+
+
+
+
Update the kafka settings under global in the values.yaml as follows.
diff --git a/docs/deploy/confluent-cloud.md b/docs/deploy/confluent-cloud.md
index d93ffcceaecee1..794b55d4686bfb 100644
--- a/docs/deploy/confluent-cloud.md
+++ b/docs/deploy/confluent-cloud.md
@@ -24,7 +24,11 @@ decommissioned.
To create the topics, navigate to your **Cluster** and click "Create Topic". Feel free to tweak the default topic configurations to
match your preferences.
-![CreateTopic](../imgs/confluent-create-topic.png)
+
+
+
+
+
## Step 2: Configure DataHub Container to use Confluent Cloud Topics
@@ -140,12 +144,20 @@ and another for the user info used for connecting to the schema registry. You'll
select "Clients" -> "Configure new Java Client". You should see a page like the following:
-![Config](../imgs/confluent-cloud-config.png)
+
+
+
+
+
You'll want to generate both a Kafka Cluster API Key & a Schema Registry key. Once you do so,you should see the config
automatically populate with your new secrets:
-![Config](../imgs/confluent-cloud-config-2.png)
+
+
+
+
+
You'll need to copy the values of `sasl.jaas.config` and `basic.auth.user.info`
for the next step.
diff --git a/docs/deploy/environment-vars.md b/docs/deploy/environment-vars.md
index af4ae09c009fd0..0689db9b173310 100644
--- a/docs/deploy/environment-vars.md
+++ b/docs/deploy/environment-vars.md
@@ -19,7 +19,7 @@ DataHub works.
| Variable | Default | Unit/Type | Components | Description |
|------------------------------------|---------|-----------|-------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `ASYNC_INGESTION_DEFAULT` | `false` | boolean | [`GMS`] | Asynchronously process ingestProposals by writing the ingestion MCP to Kafka. Typically enabled with standalone consumers. |
+| `ASYNC_INGEST_DEFAULT` | `false` | boolean | [`GMS`] | Asynchronously process ingestProposals by writing the ingestion MCP to Kafka. Typically enabled with standalone consumers. |
| `MCP_CONSUMER_ENABLED` | `true` | boolean | [`GMS`, `MCE Consumer`] | When running in standalone mode, disabled on `GMS` and enabled on separate `MCE Consumer`. |
| `MCL_CONSUMER_ENABLED` | `true` | boolean | [`GMS`, `MAE Consumer`] | When running in standalone mode, disabled on `GMS` and enabled on separate `MAE Consumer`. |
| `PE_CONSUMER_ENABLED` | `true` | boolean | [`GMS`, `MAE Consumer`] | When running in standalone mode, disabled on `GMS` and enabled on separate `MAE Consumer`. |
@@ -79,8 +79,9 @@ Simply replace the dot, `.`, with an underscore, `_`, and convert to uppercase.
## Frontend
-| Variable | Default | Unit/Type | Components | Description |
-|------------------------------------|----------|-----------|--------------|-------------------------------------------------------------------------------------------------------------------------------------|
-| `AUTH_VERBOSE_LOGGING` | `false` | boolean | [`Frontend`] | Enable verbose authentication logging. Enabling this will leak sensisitve information in the logs. Disable when finished debugging. |
-| `AUTH_OIDC_GROUPS_CLAIM` | `groups` | string | [`Frontend`] | Claim to use as the user's group. |
-| `AUTH_OIDC_EXTRACT_GROUPS_ENABLED` | `false` | boolean | [`Frontend`] | Auto-provision the group from the user's group claim. |
+| Variable | Default | Unit/Type | Components | Description |
+|------------------------------------|----------|-----------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `AUTH_VERBOSE_LOGGING` | `false` | boolean | [`Frontend`] | Enable verbose authentication logging. Enabling this will leak sensisitve information in the logs. Disable when finished debugging. |
+| `AUTH_OIDC_GROUPS_CLAIM` | `groups` | string | [`Frontend`] | Claim to use as the user's group. |
+| `AUTH_OIDC_EXTRACT_GROUPS_ENABLED` | `false` | boolean | [`Frontend`] | Auto-provision the group from the user's group claim. |
+| `AUTH_SESSION_TTL_HOURS` | `24` | string | [`Frontend`] | The number of hours a user session is valid. [User session tokens are stateless and will become invalid after this time](https://www.playframework.com/documentation/2.8.x/SettingsSession#Session-Timeout-/-Expiration) requiring a user to login again. |
\ No newline at end of file
diff --git a/docs/deploy/gcp.md b/docs/deploy/gcp.md
index 3713d69f90636c..0cd3d92a8f3cd3 100644
--- a/docs/deploy/gcp.md
+++ b/docs/deploy/gcp.md
@@ -65,16 +65,28 @@ the GKE page on [GCP website](https://console.cloud.google.com/kubernetes/discov
Once all deploy is successful, you should see a page like below in the "Services & Ingress" tab on the left.
-![Services and Ingress](../imgs/gcp/services_ingress.png)
+
+
+
+
+
Tick the checkbox for datahub-datahub-frontend and click "CREATE INGRESS" button. You should land on the following page.
-![Ingress1](../imgs/gcp/ingress1.png)
+
+
+
+
+
Type in an arbitrary name for the ingress and click on the second step "Host and path rules". You should land on the
following page.
-![Ingress2](../imgs/gcp/ingress2.png)
+
+
+
+
+
Select "datahub-datahub-frontend" in the dropdown menu for backends, and then click on "ADD HOST AND PATH RULE" button.
In the second row that got created, add in the host name of choice (here gcp.datahubproject.io) and select
@@ -83,14 +95,22 @@ In the second row that got created, add in the host name of choice (here gcp.dat
This step adds the rule allowing requests from the host name of choice to get routed to datahub-frontend service. Click
on step 3 "Frontend configuration". You should land on the following page.
-![Ingress3](../imgs/gcp/ingress3.png)
+
+
+
+
+
Choose HTTPS in the dropdown menu for protocol. To enable SSL, you need to add a certificate. If you do not have one,
you can click "CREATE A NEW CERTIFICATE" and input the host name of choice. GCP will create a certificate for you.
Now press "CREATE" button on the left to create ingress! After around 5 minutes, you should see the following.
-![Ingress Ready](../imgs/gcp/ingress_ready.png)
+
+
+
+
+
In your domain provider, add an A record for the host name set above using the IP address on the ingress page (noted
with the red box). Once DNS updates, you should be able to access DataHub through the host name!!
@@ -98,5 +118,9 @@ with the red box). Once DNS updates, you should be able to access DataHub throug
Note, ignore the warning icon next to ingress. It takes about ten minutes for ingress to check that the backend service
is ready and show a check mark as follows. However, ingress is fully functional once you see the above page.
-![Ingress Final](../imgs/gcp/ingress_final.png)
+
+
+
+
+
diff --git a/docs/dev-guides/timeline.md b/docs/dev-guides/timeline.md
index 966e659b909915..829aef1d3eefa1 100644
--- a/docs/dev-guides/timeline.md
+++ b/docs/dev-guides/timeline.md
@@ -14,7 +14,11 @@ The Timeline API is available in server versions `0.8.28` and higher. The `cli`
## Entity Timeline Conceptually
For the visually inclined, here is a conceptual diagram that illustrates how to think about the entity timeline with categorical changes overlaid on it.
-![../imgs/timeline/timeline-conceptually.png](../imgs/timeline/timeline-conceptually.png)
+
+
+
+
+
## Change Event
Each modification is modeled as a
@@ -228,8 +232,16 @@ http://localhost:8080/openapi/timeline/v1/urn%3Ali%3Adataset%3A%28urn%3Ali%3Adat
The API is browse-able via the UI through through the dropdown.
Here are a few screenshots showing how to navigate to it. You can try out the API and send example requests.
-![../imgs/timeline/dropdown-apis.png](../imgs/timeline/dropdown-apis.png)
-![../imgs/timeline/swagger-ui.png](../imgs/timeline/swagger-ui.png)
+
+
+
+
+
+
+
+
+
+
# Future Work
diff --git a/docs/docker/development.md b/docs/docker/development.md
index 2153aa9dc613f1..91a303744a03bd 100644
--- a/docs/docker/development.md
+++ b/docs/docker/development.md
@@ -92,7 +92,11 @@ Environment variables control the debugging ports for GMS and the frontend.
The screenshot shows an example configuration for IntelliJ using the default GMS debugging port of 5001.
-![](../imgs/development/intellij-remote-debug.png)
+
+
+
+
+
## Tips for People New To Docker
diff --git a/docs/domains.md b/docs/domains.md
index c846a753417c59..1b2ebc9d47f397 100644
--- a/docs/domains.md
+++ b/docs/domains.md
@@ -22,20 +22,20 @@ You can create this privileges by creating a new [Metadata Policy](./authorizati
To create a Domain, first navigate to the **Domains** tab in the top-right menu of DataHub.
-
+
Once you're on the Domains page, you'll see a list of all the Domains that have been created on DataHub. Additionally, you can
view the number of entities inside each Domain.
-
+
To create a new Domain, click '+ New Domain'.
-
+
Inside the form, you can choose a name for your Domain. Most often, this will align with your business units or groups, for example
@@ -48,7 +48,7 @@ for the Domain. This option is useful if you intend to refer to Domains by a com
key to be human-readable. Proceed with caution: once you select a custom id, it cannot be easily changed.
-
+
By default, you don't need to worry about this. DataHub will auto-generate a unique Domain id for you.
@@ -64,7 +64,7 @@ To assign an asset to a Domain, simply navigate to the asset's profile page. At
see a 'Domain' section. Click 'Set Domain', and then search for the Domain you'd like to add to. When you're done, click 'Add'.
-
+
To remove an asset from a Domain, click the 'x' icon on the Domain tag.
@@ -149,27 +149,27 @@ source:
Once you've created a Domain, you can use the search bar to find it.
-
+
Clicking on the search result will take you to the Domain's profile, where you
can edit its description, add / remove owners, and view the assets inside the Domain.
-
+
Once you've added assets to a Domain, you can filter search results to limit to those Assets
within a particular Domain using the left-side search filters.
-
+
On the homepage, you'll also find a list of the most popular Domains in your organization.
-
+
## Additional Resources
@@ -242,7 +242,6 @@ DataHub supports Tags, Glossary Terms, & Domains as distinct types of Metadata t
- **Tags**: Informal, loosely controlled labels that serve as a tool for search & discovery. Assets may have multiple tags. No formal, central management.
- **Glossary Terms**: A controlled vocabulary, with optional hierarchy. Terms are typically used to standardize types of leaf-level attributes (i.e. schema fields) for governance. E.g. (EMAIL_PLAINTEXT)
- **Domains**: A set of top-level categories. Usually aligned to business units / disciplines to which the assets are most relevant. Central or distributed management. Single Domain assignment per data asset.
-
*Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!*
### Related Features
diff --git a/docs/glossary/business-glossary.md b/docs/glossary/business-glossary.md
index faab6f12fc55e7..e10cbed30b9132 100644
--- a/docs/glossary/business-glossary.md
+++ b/docs/glossary/business-glossary.md
@@ -31,59 +31,103 @@ In order to view a Business Glossary, users must have the Platform Privilege cal
Once granted this privilege, you can access your Glossary by clicking the dropdown at the top of the page called **Govern** and then click **Glossary**:
-![](../imgs/glossary/glossary-button.png)
+
+
+
+
+
You are now at the root of your Glossary and should see all Terms and Term Groups with no parents assigned to them. You should also notice a hierarchy navigator on the left where you can easily check out the structure of your Glossary!
-![](../imgs/glossary/root-glossary.png)
+
+
+
+
+
## Creating a Term or Term Group
There are two ways to create Terms and Term Groups through the UI. First, you can create directly from the Glossary home page by clicking the menu dots on the top right and selecting your desired option:
-![](../imgs/glossary/root-glossary-create.png)
+
+
+
+
+
You can also create Terms or Term Groups directly from a Term Group's page. In order to do that you need to click the menu dots on the top right and select what you want:
-![](../imgs/glossary/create-from-node.png)
+
+
+
+
+
Note that the modal that pops up will automatically set the current Term Group you are in as the **Parent**. You can easily change this by selecting the input and navigating through your Glossary to find your desired Term Group. In addition, you could start typing the name of a Term Group to see it appear by searching. You can also leave this input blank in order to create a Term or Term Group with no parent.
-![](../imgs/glossary/create-modal.png)
+
+
+
+
+
## Editing a Term or Term Group
In order to edit a Term or Term Group, you first need to go the page of the Term or Term group you want to edit. Then simply click the edit icon right next to the name to open up an inline editor. Change the text and it will save when you click outside or hit Enter.
-![](../imgs/glossary/edit-term.png)
+
+
+
+
+
## Moving a Term or Term Group
Once a Term or Term Group has been created, you can always move it to be under a different Term Group parent. In order to do this, click the menu dots on the top right of either entity and select **Move**.
-![](../imgs/glossary/move-term-button.png)
+
+
+
+
+
This will open a modal where you can navigate through your Glossary to find your desired Term Group.
-![](../imgs/glossary/move-term-modal.png)
+
+
+
+
+
## Deleting a Term or Term Group
In order to delete a Term or Term Group, you need to go to the entity page of what you want to delete then click the menu dots on the top right. From here you can select **Delete** followed by confirming through a separate modal. **Note**: at the moment we only support deleting Term Groups that do not have any children. Until cascade deleting is supported, you will have to delete all children first, then delete the Term Group.
-![](../imgs/glossary/delete-button.png)
+
+
+
+
+
## Adding a Term to an Entity
Once you've defined your Glossary, you can begin attaching terms to data assets. To add a Glossary Term to an asset, go to the entity page of your asset and find the **Add Terms** button on the right sidebar.
-![](../imgs/glossary/add-term-to-entity.png)
+
+
+
+
+
In the modal that pops up you can select the Term you care about in one of two ways:
- Search for the Term by name in the input
- Navigate through the Glossary dropdown that appears after clicking into the input
-![](../imgs/glossary/add-term-modal.png)
+
+
+
+
+
## Privileges
diff --git a/docs/how/add-new-aspect.md b/docs/how/add-new-aspect.md
index 6ea7256ed75cc0..d1fe567018903b 100644
--- a/docs/how/add-new-aspect.md
+++ b/docs/how/add-new-aspect.md
@@ -1,20 +1,20 @@
# How to add a new metadata aspect?
Adding a new metadata [aspect](../what/aspect.md) is one of the most common ways to extend an existing [entity](../what/entity.md).
-We'll use the [CorpUserEditableInfo](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/identity/CorpUserEditableInfo.pdl) as an example here.
+We'll use the CorpUserEditableInfo as an example here.
1. Add the aspect model to the corresponding namespace (e.g. [`com.linkedin.identity`](https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin/identity))
-2. Extend the entity's aspect union to include the new aspect (e.g. [`CorpUserAspect`](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/aspect/CorpUserAspect.pdl))
+2. Extend the entity's aspect union to include the new aspect.
3. Rebuild the rest.li [IDL & snapshot](https://linkedin.github.io/rest.li/modeling/compatibility_check) by running the following command from the project root
```
./gradlew :metadata-service:restli-servlet-impl:build -Prest.model.compatibility=ignore
```
-4. To surface the new aspect at the top-level [resource endpoint](https://linkedin.github.io/rest.li/user_guide/restli_server#writing-resources), extend the resource data model (e.g. [`CorpUser`](https://github.com/datahub-project/datahub/blob/master/gms/api/src/main/pegasus/com/linkedin/identity/CorpUser.pdl)) with an optional field (e.g. [`editableInfo`](https://github.com/datahub-project/datahub/blob/master/gms/api/src/main/pegasus/com/linkedin/identity/CorpUser.pdl#L21)). You'll also need to extend the `toValue` & `toSnapshot` methods of the top-level resource (e.g. [`CorpUsers`](https://github.com/datahub-project/datahub/blob/master/gms/impl/src/main/java/com/linkedin/metadata/resources/identity/CorpUsers.java)) to convert between the snapshot & value models.
+4. To surface the new aspect at the top-level [resource endpoint](https://linkedin.github.io/rest.li/user_guide/restli_server#writing-resources), extend the resource data model with an optional field. You'll also need to extend the `toValue` & `toSnapshot` methods of the top-level resource (e.g. [`CorpUsers`](https://github.com/datahub-project/datahub/blob/master/gms/impl/src/main/java/com/linkedin/metadata/resources/identity/CorpUsers.java)) to convert between the snapshot & value models.
-5. (Optional) If there's need to update the aspect via API (instead of/in addition to MCE), add a [sub-resource](https://linkedin.github.io/rest.li/user_guide/restli_server#sub-resources) endpoint for the new aspect (e.g. [`CorpUsersEditableInfoResource`](https://github.com/datahub-project/datahub/blob/master/gms/impl/src/main/java/com/linkedin/metadata/resources/identity/CorpUsersEditableInfoResource.java)). The sub-resource endpiont also allows you to retrieve previous versions of the aspect as well as additional metadata such as the audit stamp.
+5. (Optional) If there's need to update the aspect via API (instead of/in addition to MCE), add a [sub-resource](https://linkedin.github.io/rest.li/user_guide/restli_server#sub-resources) endpoint for the new aspect (e.g. `CorpUsersEditableInfoResource`). The sub-resource endpiont also allows you to retrieve previous versions of the aspect as well as additional metadata such as the audit stamp.
-6. After rebuilding & restarting [gms](https://github.com/datahub-project/datahub/tree/master/gms), [mce-consumer-job](https://github.com/datahub-project/datahub/tree/master/metadata-jobs/mce-consumer-job) & [mae-consumer-job](https://github.com/datahub-project/datahub/tree/master/metadata-jobs/mae-consumer-job),
+6. After rebuilding & restarting gms, [mce-consumer-job](https://github.com/datahub-project/datahub/tree/master/metadata-jobs/mce-consumer-job) & [mae-consumer-job](https://github.com/datahub-project/datahub/tree/master/metadata-jobs/mae-consumer-job),z
you should be able to start emitting [MCE](../what/mxe.md) with the new aspect and have it automatically ingested & stored in DB.
diff --git a/docs/how/configuring-authorization-with-apache-ranger.md b/docs/how/configuring-authorization-with-apache-ranger.md
index 26d3be6d358b2e..46f9432e6c18a7 100644
--- a/docs/how/configuring-authorization-with-apache-ranger.md
+++ b/docs/how/configuring-authorization-with-apache-ranger.md
@@ -67,7 +67,11 @@ Now, you should have the DataHub plugin registered with Apache Ranger. Next, we'
**DATAHUB** plugin and **ranger_datahub** service is shown in below screenshot:
- ![Privacera Portal DATAHUB screenshot](../imgs/apache-ranger/datahub-plugin.png)
+
+
+
+
+
4. Create a new policy under service **ranger_datahub** - this will be used to control DataHub authorization.
5. Create a test user & assign them to a policy. We'll use the `datahub` user, which is the default root user inside DataHub.
@@ -80,7 +84,11 @@ Now, you should have the DataHub plugin registered with Apache Ranger. Next, we'
DataHub platform access policy screenshot:
- ![Privacera Portal DATAHUB screenshot](../imgs/apache-ranger/datahub-platform-access-policy.png)
+
+
+
+
+
Once we've created our first policy, we can set up DataHub to start authorizing requests using Ranger policies.
@@ -178,7 +186,11 @@ then follow the below sections to undo the configuration steps you have performe
**ranger_datahub** service is shown in below screenshot:
- ![Privacera Portal DATAHUB screenshot](../imgs/apache-ranger/datahub-plugin.png)
+
+
+
+
+
2. Delete **datahub** plugin: Execute below curl command to delete **datahub** plugin
Replace variables with corresponding values in curl command
diff --git a/docs/how/delete-metadata.md b/docs/how/delete-metadata.md
index acbb573020be08..f720a66ce57652 100644
--- a/docs/how/delete-metadata.md
+++ b/docs/how/delete-metadata.md
@@ -43,6 +43,9 @@ datahub delete --platform snowflake
# Filters can be combined, which will select entities that match all filters.
datahub delete --platform looker --entity-type chart
datahub delete --platform bigquery --env PROD
+
+# You can also do recursive deletes for container and dataPlatformInstance entities.
+datahub delete --urn "urn:li:container:f76..." --recursive
```
When performing hard deletes, you can optionally add the `--only-soft-deleted` flag to only hard delete entities that were previously soft deleted.
@@ -122,6 +125,14 @@ datahub delete --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted
datahub delete --platform snowflake --env DEV
```
+#### Delete everything within a specific Snowflake DB
+
+```shell
+# You can find your container urn by navigating to the relevant
+# DB in the DataHub UI and clicking the "copy urn" button.
+datahub delete --urn "urn:li:container:77644901c4f574845578ebd18b7c14fa" --recursive
+```
+
#### Delete all BigQuery datasets in the PROD environment
```shell
@@ -129,6 +140,13 @@ datahub delete --platform snowflake --env DEV
datahub delete --env PROD --entity-type dataset --platform bigquery
```
+#### Delete everything within a MySQL platform instance
+
+```shell
+# The instance name comes from the `platform_instance` config option in the ingestion recipe.
+datahub delete --urn 'urn:li:dataPlatformInstance:(urn:li:dataPlatform:mysql,my_instance_name)' --recursive
+```
+
#### Delete all pipelines and tasks from Airflow
```shell
@@ -138,6 +156,7 @@ datahub delete --platform "airflow"
#### Delete all containers for a particular platform
```shell
+# Note: this will leave S3 datasets intact.
datahub delete --entity-type container --platform s3
```
diff --git a/docs/how/search.md b/docs/how/search.md
index bf1c8e8632e243..6a5e85e547fc50 100644
--- a/docs/how/search.md
+++ b/docs/how/search.md
@@ -2,14 +2,6 @@ import FeatureAvailability from '@site/src/components/FeatureAvailability';
# About DataHub Search
-
-
-
-
The **search bar** is an important mechanism for discovering data assets in DataHub. From the search bar, you can find Datasets, Columns, Dashboards, Charts, Data Pipelines, and more. Simply type in a term and press 'enter'.
diff --git a/docs/how/updating-datahub.md b/docs/how/updating-datahub.md
index 2b6fd5571cc9ec..9b19291ee246ae 100644
--- a/docs/how/updating-datahub.md
+++ b/docs/how/updating-datahub.md
@@ -5,16 +5,42 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
## Next
### Breaking Changes
+- #8810 - Removed support for SQLAlchemy 1.3.x. Only SQLAlchemy 1.4.x is supported now.
### Potential Downtime
+### Deprecations
+
+### Other Notable Changes
+
+## 0.11.0
+
+### Breaking Changes
+
+### Potential Downtime
+- #8611 Search improvements requires reindexing indices. A `system-update` job will run which will set indices to read-only and create a backup/clone of each index. During the reindexing new components will be prevented from start-up until the reindex completes. The logs of this job will indicate a % complete per index. Depending on index sizes and infrastructure this process can take 5 minutes to hours however as a rough estimate 1 hour for every 2.3 million entities.
+
### Deprecations
- #8525: In LDAP ingestor, the `manager_pagination_enabled` changed to general `pagination_enabled`
+- MAE Events are no longer produced. MAE events have been deprecated for over a year.
### Other Notable Changes
+- In this release we now enable you to create and delete pinned announcements on your DataHub homepage! If you have the “Manage Home Page Posts” platform privilege you’ll see a new section in settings called “Home Page Posts” where you can create and delete text posts and link posts that your users see on the home page.
+- The new search and browse experience, which was first made available in the previous release behind a feature flag, is now on by default. Check out our release notes for v0.10.5 to get more information and documentation on this new Browse experience.
+- In addition to the ranking changes mentioned above, this release includes changes to the highlighting of search entities to understand why they match your query. You can also sort your results alphabetically or by last updated times, in addition to relevance. In this release, we suggest a correction if your query has a typo in it.
- #8300: Clickhouse source now inherited from TwoTierSQLAlchemy. In old way we have platform_instance -> container -> co
container db (None) -> container schema and now we have platform_instance -> container database.
- #8300: Added `uri_opts` argument; now we can add any options for clickhouse client.
+- #8659: BigQuery ingestion no longer creates DataPlatformInstance aspects by default.
+ This will only affect users that were depending on this aspect for custom functionality,
+ and can be enabled via the `include_data_platform_instance` config option.
+- OpenAPI entity and aspect endpoints expanded to improve developer experience when using this API with additional aspects to be added in the near future.
+- The CLI now supports recursive deletes.
+- Batching of default aspects on initial ingestion (SQL)
+- Improvements to multi-threading. Ingestion recipes, if previously reduced to 1 thread, can be restored to the 15 thread default.
+- Gradle 7 upgrade moderately improves build speed
+- DataHub Ingestion slim images reduced in size by 2GB+
+- Glue Schema Registry fixed
## 0.10.5
diff --git a/docs/imgs/add-schema-tag.png b/docs/imgs/add-schema-tag.png
deleted file mode 100644
index b6fd273389c904..00000000000000
Binary files a/docs/imgs/add-schema-tag.png and /dev/null differ
diff --git a/docs/imgs/add-tag-search.png b/docs/imgs/add-tag-search.png
deleted file mode 100644
index a129f5eba4271b..00000000000000
Binary files a/docs/imgs/add-tag-search.png and /dev/null differ
diff --git a/docs/imgs/add-tag.png b/docs/imgs/add-tag.png
deleted file mode 100644
index 386b4cdcd99113..00000000000000
Binary files a/docs/imgs/add-tag.png and /dev/null differ
diff --git a/docs/imgs/added-tag.png b/docs/imgs/added-tag.png
deleted file mode 100644
index 96ae48318a35a1..00000000000000
Binary files a/docs/imgs/added-tag.png and /dev/null differ
diff --git a/docs/imgs/airflow/connection_error.png b/docs/imgs/airflow/connection_error.png
deleted file mode 100644
index c2f3344b8cc452..00000000000000
Binary files a/docs/imgs/airflow/connection_error.png and /dev/null differ
diff --git a/docs/imgs/airflow/datahub_lineage_view.png b/docs/imgs/airflow/datahub_lineage_view.png
deleted file mode 100644
index c7c774c203d2f2..00000000000000
Binary files a/docs/imgs/airflow/datahub_lineage_view.png and /dev/null differ
diff --git a/docs/imgs/airflow/datahub_pipeline_entity.png b/docs/imgs/airflow/datahub_pipeline_entity.png
deleted file mode 100644
index 715baefd784ca4..00000000000000
Binary files a/docs/imgs/airflow/datahub_pipeline_entity.png and /dev/null differ
diff --git a/docs/imgs/airflow/datahub_pipeline_view.png b/docs/imgs/airflow/datahub_pipeline_view.png
deleted file mode 100644
index 5b3afd13c4ce69..00000000000000
Binary files a/docs/imgs/airflow/datahub_pipeline_view.png and /dev/null differ
diff --git a/docs/imgs/airflow/datahub_task_view.png b/docs/imgs/airflow/datahub_task_view.png
deleted file mode 100644
index 66b3487d87319d..00000000000000
Binary files a/docs/imgs/airflow/datahub_task_view.png and /dev/null differ
diff --git a/docs/imgs/airflow/entity_page_screenshot.png b/docs/imgs/airflow/entity_page_screenshot.png
deleted file mode 100644
index a782969a1f17b1..00000000000000
Binary files a/docs/imgs/airflow/entity_page_screenshot.png and /dev/null differ
diff --git a/docs/imgs/airflow/find_the_dag.png b/docs/imgs/airflow/find_the_dag.png
deleted file mode 100644
index 37cda041e4b750..00000000000000
Binary files a/docs/imgs/airflow/find_the_dag.png and /dev/null differ
diff --git a/docs/imgs/airflow/finding_failed_log.png b/docs/imgs/airflow/finding_failed_log.png
deleted file mode 100644
index 96552ba1e19839..00000000000000
Binary files a/docs/imgs/airflow/finding_failed_log.png and /dev/null differ
diff --git a/docs/imgs/airflow/paused_dag.png b/docs/imgs/airflow/paused_dag.png
deleted file mode 100644
index c314de5d38d750..00000000000000
Binary files a/docs/imgs/airflow/paused_dag.png and /dev/null differ
diff --git a/docs/imgs/airflow/successful_run.png b/docs/imgs/airflow/successful_run.png
deleted file mode 100644
index b997cc7210ff6b..00000000000000
Binary files a/docs/imgs/airflow/successful_run.png and /dev/null differ
diff --git a/docs/imgs/airflow/trigger_dag.png b/docs/imgs/airflow/trigger_dag.png
deleted file mode 100644
index a44999c929d4e2..00000000000000
Binary files a/docs/imgs/airflow/trigger_dag.png and /dev/null differ
diff --git a/docs/imgs/airflow/unpaused_dag.png b/docs/imgs/airflow/unpaused_dag.png
deleted file mode 100644
index 8462562f31d973..00000000000000
Binary files a/docs/imgs/airflow/unpaused_dag.png and /dev/null differ
diff --git a/docs/imgs/apache-ranger/datahub-platform-access-policy.png b/docs/imgs/apache-ranger/datahub-platform-access-policy.png
deleted file mode 100644
index 7e3ff6fd372a9d..00000000000000
Binary files a/docs/imgs/apache-ranger/datahub-platform-access-policy.png and /dev/null differ
diff --git a/docs/imgs/apache-ranger/datahub-plugin.png b/docs/imgs/apache-ranger/datahub-plugin.png
deleted file mode 100644
index 5dd044c0146570..00000000000000
Binary files a/docs/imgs/apache-ranger/datahub-plugin.png and /dev/null differ
diff --git a/docs/imgs/apis/postman-graphql.png b/docs/imgs/apis/postman-graphql.png
deleted file mode 100644
index 1cffd226fdf772..00000000000000
Binary files a/docs/imgs/apis/postman-graphql.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/column-description-added.png b/docs/imgs/apis/tutorials/column-description-added.png
deleted file mode 100644
index ed8cbd3bf56220..00000000000000
Binary files a/docs/imgs/apis/tutorials/column-description-added.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/column-level-lineage-added.png b/docs/imgs/apis/tutorials/column-level-lineage-added.png
deleted file mode 100644
index 6092436e0a6a83..00000000000000
Binary files a/docs/imgs/apis/tutorials/column-level-lineage-added.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/custom-properties-added.png b/docs/imgs/apis/tutorials/custom-properties-added.png
deleted file mode 100644
index a7e85d875045c9..00000000000000
Binary files a/docs/imgs/apis/tutorials/custom-properties-added.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/datahub-main-ui.png b/docs/imgs/apis/tutorials/datahub-main-ui.png
deleted file mode 100644
index b058e2683a8513..00000000000000
Binary files a/docs/imgs/apis/tutorials/datahub-main-ui.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/dataset-created.png b/docs/imgs/apis/tutorials/dataset-created.png
deleted file mode 100644
index 086dd8b7c9b16e..00000000000000
Binary files a/docs/imgs/apis/tutorials/dataset-created.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/dataset-deleted.png b/docs/imgs/apis/tutorials/dataset-deleted.png
deleted file mode 100644
index d94ad7e85195fa..00000000000000
Binary files a/docs/imgs/apis/tutorials/dataset-deleted.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/dataset-description-added.png b/docs/imgs/apis/tutorials/dataset-description-added.png
deleted file mode 100644
index 41aa9f109115b2..00000000000000
Binary files a/docs/imgs/apis/tutorials/dataset-description-added.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/dataset-properties-added-removed.png b/docs/imgs/apis/tutorials/dataset-properties-added-removed.png
deleted file mode 100644
index 9eb0284776f13c..00000000000000
Binary files a/docs/imgs/apis/tutorials/dataset-properties-added-removed.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/dataset-properties-added.png b/docs/imgs/apis/tutorials/dataset-properties-added.png
deleted file mode 100644
index e0d2acbb66eb5e..00000000000000
Binary files a/docs/imgs/apis/tutorials/dataset-properties-added.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/dataset-properties-before.png b/docs/imgs/apis/tutorials/dataset-properties-before.png
deleted file mode 100644
index b4915121a8c650..00000000000000
Binary files a/docs/imgs/apis/tutorials/dataset-properties-before.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/dataset-properties-replaced.png b/docs/imgs/apis/tutorials/dataset-properties-replaced.png
deleted file mode 100644
index 8624689c20ada4..00000000000000
Binary files a/docs/imgs/apis/tutorials/dataset-properties-replaced.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/deprecation-updated.png b/docs/imgs/apis/tutorials/deprecation-updated.png
deleted file mode 100644
index 06fedf746f694d..00000000000000
Binary files a/docs/imgs/apis/tutorials/deprecation-updated.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/domain-added.png b/docs/imgs/apis/tutorials/domain-added.png
deleted file mode 100644
index cb2002ec9ab4df..00000000000000
Binary files a/docs/imgs/apis/tutorials/domain-added.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/domain-created.png b/docs/imgs/apis/tutorials/domain-created.png
deleted file mode 100644
index cafab2a5e8d5cb..00000000000000
Binary files a/docs/imgs/apis/tutorials/domain-created.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/domain-removed.png b/docs/imgs/apis/tutorials/domain-removed.png
deleted file mode 100644
index 1b21172be11d23..00000000000000
Binary files a/docs/imgs/apis/tutorials/domain-removed.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/feature-added-to-model.png b/docs/imgs/apis/tutorials/feature-added-to-model.png
deleted file mode 100644
index 311506e4b27839..00000000000000
Binary files a/docs/imgs/apis/tutorials/feature-added-to-model.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/feature-table-created.png b/docs/imgs/apis/tutorials/feature-table-created.png
deleted file mode 100644
index 0541cbe572435f..00000000000000
Binary files a/docs/imgs/apis/tutorials/feature-table-created.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/group-upserted.png b/docs/imgs/apis/tutorials/group-upserted.png
deleted file mode 100644
index 5283f6273f02a6..00000000000000
Binary files a/docs/imgs/apis/tutorials/group-upserted.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/lineage-added.png b/docs/imgs/apis/tutorials/lineage-added.png
deleted file mode 100644
index b381498bad5ac4..00000000000000
Binary files a/docs/imgs/apis/tutorials/lineage-added.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/model-group-added-to-model.png b/docs/imgs/apis/tutorials/model-group-added-to-model.png
deleted file mode 100644
index 360b7fbb2d9220..00000000000000
Binary files a/docs/imgs/apis/tutorials/model-group-added-to-model.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/model-group-created.png b/docs/imgs/apis/tutorials/model-group-created.png
deleted file mode 100644
index 2e0fdcea4803f8..00000000000000
Binary files a/docs/imgs/apis/tutorials/model-group-created.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/owner-added.png b/docs/imgs/apis/tutorials/owner-added.png
deleted file mode 100644
index 6508c231cfb4ba..00000000000000
Binary files a/docs/imgs/apis/tutorials/owner-added.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/owner-removed.png b/docs/imgs/apis/tutorials/owner-removed.png
deleted file mode 100644
index a7b6567888caf0..00000000000000
Binary files a/docs/imgs/apis/tutorials/owner-removed.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/sample-ingestion.png b/docs/imgs/apis/tutorials/sample-ingestion.png
deleted file mode 100644
index 40aa0469048417..00000000000000
Binary files a/docs/imgs/apis/tutorials/sample-ingestion.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/tag-added.png b/docs/imgs/apis/tutorials/tag-added.png
deleted file mode 100644
index fd99a04f6cceba..00000000000000
Binary files a/docs/imgs/apis/tutorials/tag-added.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/tag-created.png b/docs/imgs/apis/tutorials/tag-created.png
deleted file mode 100644
index 99e3fea8a14e16..00000000000000
Binary files a/docs/imgs/apis/tutorials/tag-created.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/tag-removed.png b/docs/imgs/apis/tutorials/tag-removed.png
deleted file mode 100644
index 31a267549843e5..00000000000000
Binary files a/docs/imgs/apis/tutorials/tag-removed.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/term-added.png b/docs/imgs/apis/tutorials/term-added.png
deleted file mode 100644
index 62e285a92e7af0..00000000000000
Binary files a/docs/imgs/apis/tutorials/term-added.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/term-created.png b/docs/imgs/apis/tutorials/term-created.png
deleted file mode 100644
index deff0179b155ee..00000000000000
Binary files a/docs/imgs/apis/tutorials/term-created.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/term-removed.png b/docs/imgs/apis/tutorials/term-removed.png
deleted file mode 100644
index dbf9f35f093399..00000000000000
Binary files a/docs/imgs/apis/tutorials/term-removed.png and /dev/null differ
diff --git a/docs/imgs/apis/tutorials/user-upserted.png b/docs/imgs/apis/tutorials/user-upserted.png
deleted file mode 100644
index 38c5bbb9ad8280..00000000000000
Binary files a/docs/imgs/apis/tutorials/user-upserted.png and /dev/null differ
diff --git a/docs/imgs/aws/aws-elasticsearch.png b/docs/imgs/aws/aws-elasticsearch.png
deleted file mode 100644
index e16d5eee26fd85..00000000000000
Binary files a/docs/imgs/aws/aws-elasticsearch.png and /dev/null differ
diff --git a/docs/imgs/aws/aws-msk.png b/docs/imgs/aws/aws-msk.png
deleted file mode 100644
index 96a3173747007e..00000000000000
Binary files a/docs/imgs/aws/aws-msk.png and /dev/null differ
diff --git a/docs/imgs/aws/aws-rds.png b/docs/imgs/aws/aws-rds.png
deleted file mode 100644
index ab329952c77560..00000000000000
Binary files a/docs/imgs/aws/aws-rds.png and /dev/null differ
diff --git a/docs/imgs/browse-domains.png b/docs/imgs/browse-domains.png
deleted file mode 100644
index 41444470517d2a..00000000000000
Binary files a/docs/imgs/browse-domains.png and /dev/null differ
diff --git a/docs/imgs/cancelled-ingestion.png b/docs/imgs/cancelled-ingestion.png
deleted file mode 100644
index 0c4af7b66a8ff2..00000000000000
Binary files a/docs/imgs/cancelled-ingestion.png and /dev/null differ
diff --git a/docs/imgs/confluent-cloud-config-2.png b/docs/imgs/confluent-cloud-config-2.png
deleted file mode 100644
index 543101154f42cf..00000000000000
Binary files a/docs/imgs/confluent-cloud-config-2.png and /dev/null differ
diff --git a/docs/imgs/confluent-cloud-config.png b/docs/imgs/confluent-cloud-config.png
deleted file mode 100644
index a2490eab5c6a77..00000000000000
Binary files a/docs/imgs/confluent-cloud-config.png and /dev/null differ
diff --git a/docs/imgs/confluent-create-topic.png b/docs/imgs/confluent-create-topic.png
deleted file mode 100644
index 1972bb3770388f..00000000000000
Binary files a/docs/imgs/confluent-create-topic.png and /dev/null differ
diff --git a/docs/imgs/create-domain.png b/docs/imgs/create-domain.png
deleted file mode 100644
index 1db2090fca6b89..00000000000000
Binary files a/docs/imgs/create-domain.png and /dev/null differ
diff --git a/docs/imgs/create-new-ingestion-source-button.png b/docs/imgs/create-new-ingestion-source-button.png
deleted file mode 100644
index c425f0837c51d3..00000000000000
Binary files a/docs/imgs/create-new-ingestion-source-button.png and /dev/null differ
diff --git a/docs/imgs/create-secret.png b/docs/imgs/create-secret.png
deleted file mode 100644
index a0cc63e3b4892f..00000000000000
Binary files a/docs/imgs/create-secret.png and /dev/null differ
diff --git a/docs/imgs/custom-ingestion-cli-version.png b/docs/imgs/custom-ingestion-cli-version.png
deleted file mode 100644
index 43d4736684abb1..00000000000000
Binary files a/docs/imgs/custom-ingestion-cli-version.png and /dev/null differ
diff --git a/docs/imgs/datahub-architecture.png b/docs/imgs/datahub-architecture.png
deleted file mode 100644
index 236f939f74198b..00000000000000
Binary files a/docs/imgs/datahub-architecture.png and /dev/null differ
diff --git a/docs/imgs/datahub-architecture.svg b/docs/imgs/datahub-architecture.svg
deleted file mode 100644
index 842194a5e377ce..00000000000000
--- a/docs/imgs/datahub-architecture.svg
+++ /dev/null
@@ -1 +0,0 @@
-
\ No newline at end of file
diff --git a/docs/imgs/datahub-components.png b/docs/imgs/datahub-components.png
deleted file mode 100644
index 8b7d0e5330275a..00000000000000
Binary files a/docs/imgs/datahub-components.png and /dev/null differ
diff --git a/docs/imgs/datahub-logo-color-mark.svg b/docs/imgs/datahub-logo-color-mark.svg
deleted file mode 100644
index a984092952bae2..00000000000000
--- a/docs/imgs/datahub-logo-color-mark.svg
+++ /dev/null
@@ -1 +0,0 @@
-
\ No newline at end of file
diff --git a/docs/imgs/datahub-metadata-ingestion-framework.png b/docs/imgs/datahub-metadata-ingestion-framework.png
deleted file mode 100644
index 1319329710906d..00000000000000
Binary files a/docs/imgs/datahub-metadata-ingestion-framework.png and /dev/null differ
diff --git a/docs/imgs/datahub-metadata-model.png b/docs/imgs/datahub-metadata-model.png
deleted file mode 100644
index 59449cd0d4ef59..00000000000000
Binary files a/docs/imgs/datahub-metadata-model.png and /dev/null differ
diff --git a/docs/imgs/datahub-sequence-diagram.png b/docs/imgs/datahub-sequence-diagram.png
deleted file mode 100644
index b5a8f8a9c25ce2..00000000000000
Binary files a/docs/imgs/datahub-sequence-diagram.png and /dev/null differ
diff --git a/docs/imgs/datahub-serving.png b/docs/imgs/datahub-serving.png
deleted file mode 100644
index 67a2f8eb3f0856..00000000000000
Binary files a/docs/imgs/datahub-serving.png and /dev/null differ
diff --git a/docs/imgs/development/intellij-remote-debug.png b/docs/imgs/development/intellij-remote-debug.png
deleted file mode 100644
index 32a41a75d1dc38..00000000000000
Binary files a/docs/imgs/development/intellij-remote-debug.png and /dev/null differ
diff --git a/docs/imgs/domain-entities.png b/docs/imgs/domain-entities.png
deleted file mode 100644
index 5766d051fa209f..00000000000000
Binary files a/docs/imgs/domain-entities.png and /dev/null differ
diff --git a/docs/imgs/domains-tab.png b/docs/imgs/domains-tab.png
deleted file mode 100644
index 20be5b103fdcaa..00000000000000
Binary files a/docs/imgs/domains-tab.png and /dev/null differ
diff --git a/docs/imgs/entity-registry-diagram.png b/docs/imgs/entity-registry-diagram.png
deleted file mode 100644
index 08cb5edd8e13f2..00000000000000
Binary files a/docs/imgs/entity-registry-diagram.png and /dev/null differ
diff --git a/docs/imgs/entity.png b/docs/imgs/entity.png
deleted file mode 100644
index cfe9eb38b2921e..00000000000000
Binary files a/docs/imgs/entity.png and /dev/null differ
diff --git a/docs/imgs/example-mysql-recipe.png b/docs/imgs/example-mysql-recipe.png
deleted file mode 100644
index 9cb2cbb169a569..00000000000000
Binary files a/docs/imgs/example-mysql-recipe.png and /dev/null differ
diff --git a/docs/imgs/failed-ingestion.png b/docs/imgs/failed-ingestion.png
deleted file mode 100644
index 4f9de8eb002d2f..00000000000000
Binary files a/docs/imgs/failed-ingestion.png and /dev/null differ
diff --git a/docs/imgs/feature-create-new-tag.gif b/docs/imgs/feature-create-new-tag.gif
deleted file mode 100644
index 57b8ad852dd5b2..00000000000000
Binary files a/docs/imgs/feature-create-new-tag.gif and /dev/null differ
diff --git a/docs/imgs/feature-datahub-analytics.png b/docs/imgs/feature-datahub-analytics.png
deleted file mode 100644
index 7fe66b84682f9a..00000000000000
Binary files a/docs/imgs/feature-datahub-analytics.png and /dev/null differ
diff --git a/docs/imgs/feature-rich-documentation.gif b/docs/imgs/feature-rich-documentation.gif
deleted file mode 100644
index 48ad7956700226..00000000000000
Binary files a/docs/imgs/feature-rich-documentation.gif and /dev/null differ
diff --git a/docs/imgs/feature-tag-browse.gif b/docs/imgs/feature-tag-browse.gif
deleted file mode 100644
index e70a30db7d3ba9..00000000000000
Binary files a/docs/imgs/feature-tag-browse.gif and /dev/null differ
diff --git a/docs/imgs/feature-validation-timeseries.png b/docs/imgs/feature-validation-timeseries.png
deleted file mode 100644
index 28ce1daec5f32e..00000000000000
Binary files a/docs/imgs/feature-validation-timeseries.png and /dev/null differ
diff --git a/docs/imgs/feature-view-entitiy-details-via-lineage-vis.gif b/docs/imgs/feature-view-entitiy-details-via-lineage-vis.gif
deleted file mode 100644
index aad77df3735747..00000000000000
Binary files a/docs/imgs/feature-view-entitiy-details-via-lineage-vis.gif and /dev/null differ
diff --git a/docs/imgs/gcp/ingress1.png b/docs/imgs/gcp/ingress1.png
deleted file mode 100644
index 4cb49834af5b60..00000000000000
Binary files a/docs/imgs/gcp/ingress1.png and /dev/null differ
diff --git a/docs/imgs/gcp/ingress2.png b/docs/imgs/gcp/ingress2.png
deleted file mode 100644
index cdf2446b0e923b..00000000000000
Binary files a/docs/imgs/gcp/ingress2.png and /dev/null differ
diff --git a/docs/imgs/gcp/ingress3.png b/docs/imgs/gcp/ingress3.png
deleted file mode 100644
index cc3745ad97f5bd..00000000000000
Binary files a/docs/imgs/gcp/ingress3.png and /dev/null differ
diff --git a/docs/imgs/gcp/ingress_final.png b/docs/imgs/gcp/ingress_final.png
deleted file mode 100644
index a30ca744c49f76..00000000000000
Binary files a/docs/imgs/gcp/ingress_final.png and /dev/null differ
diff --git a/docs/imgs/gcp/ingress_ready.png b/docs/imgs/gcp/ingress_ready.png
deleted file mode 100644
index d14016e420fd3d..00000000000000
Binary files a/docs/imgs/gcp/ingress_ready.png and /dev/null differ
diff --git a/docs/imgs/gcp/services_ingress.png b/docs/imgs/gcp/services_ingress.png
deleted file mode 100644
index 1d9ff2b313715c..00000000000000
Binary files a/docs/imgs/gcp/services_ingress.png and /dev/null differ
diff --git a/docs/imgs/glossary/add-term-modal.png b/docs/imgs/glossary/add-term-modal.png
deleted file mode 100644
index e32a9cb8d648c6..00000000000000
Binary files a/docs/imgs/glossary/add-term-modal.png and /dev/null differ
diff --git a/docs/imgs/glossary/add-term-to-entity.png b/docs/imgs/glossary/add-term-to-entity.png
deleted file mode 100644
index 7487a68c0d7559..00000000000000
Binary files a/docs/imgs/glossary/add-term-to-entity.png and /dev/null differ
diff --git a/docs/imgs/glossary/create-from-node.png b/docs/imgs/glossary/create-from-node.png
deleted file mode 100644
index 70638d083343c2..00000000000000
Binary files a/docs/imgs/glossary/create-from-node.png and /dev/null differ
diff --git a/docs/imgs/glossary/create-modal.png b/docs/imgs/glossary/create-modal.png
deleted file mode 100644
index e84fb5a36e2d40..00000000000000
Binary files a/docs/imgs/glossary/create-modal.png and /dev/null differ
diff --git a/docs/imgs/glossary/delete-button.png b/docs/imgs/glossary/delete-button.png
deleted file mode 100644
index 3e0cc2a5b0a54a..00000000000000
Binary files a/docs/imgs/glossary/delete-button.png and /dev/null differ
diff --git a/docs/imgs/glossary/edit-term.png b/docs/imgs/glossary/edit-term.png
deleted file mode 100644
index 62b0e425c8c4f3..00000000000000
Binary files a/docs/imgs/glossary/edit-term.png and /dev/null differ
diff --git a/docs/imgs/glossary/glossary-button.png b/docs/imgs/glossary/glossary-button.png
deleted file mode 100644
index e4b8fd23935877..00000000000000
Binary files a/docs/imgs/glossary/glossary-button.png and /dev/null differ
diff --git a/docs/imgs/glossary/move-term-button.png b/docs/imgs/glossary/move-term-button.png
deleted file mode 100644
index df03c820340eff..00000000000000
Binary files a/docs/imgs/glossary/move-term-button.png and /dev/null differ
diff --git a/docs/imgs/glossary/move-term-modal.png b/docs/imgs/glossary/move-term-modal.png
deleted file mode 100644
index 0fda501911b2b0..00000000000000
Binary files a/docs/imgs/glossary/move-term-modal.png and /dev/null differ
diff --git a/docs/imgs/glossary/root-glossary-create.png b/docs/imgs/glossary/root-glossary-create.png
deleted file mode 100644
index c91f397eb6213c..00000000000000
Binary files a/docs/imgs/glossary/root-glossary-create.png and /dev/null differ
diff --git a/docs/imgs/glossary/root-glossary.png b/docs/imgs/glossary/root-glossary.png
deleted file mode 100644
index 1296c16b0dc3d1..00000000000000
Binary files a/docs/imgs/glossary/root-glossary.png and /dev/null differ
diff --git a/docs/imgs/ingestion-architecture.png b/docs/imgs/ingestion-architecture.png
deleted file mode 100644
index fc7bc74acacfaf..00000000000000
Binary files a/docs/imgs/ingestion-architecture.png and /dev/null differ
diff --git a/docs/imgs/ingestion-logs.png b/docs/imgs/ingestion-logs.png
deleted file mode 100644
index 42211be7379d6e..00000000000000
Binary files a/docs/imgs/ingestion-logs.png and /dev/null differ
diff --git a/docs/imgs/ingestion-privileges.png b/docs/imgs/ingestion-privileges.png
deleted file mode 100644
index 8e23868309676c..00000000000000
Binary files a/docs/imgs/ingestion-privileges.png and /dev/null differ
diff --git a/docs/imgs/ingestion-tab.png b/docs/imgs/ingestion-tab.png
deleted file mode 100644
index 046068c63bdb7b..00000000000000
Binary files a/docs/imgs/ingestion-tab.png and /dev/null differ
diff --git a/docs/imgs/ingestion-with-token.png b/docs/imgs/ingestion-with-token.png
deleted file mode 100644
index 5e1a2cce036f7a..00000000000000
Binary files a/docs/imgs/ingestion-with-token.png and /dev/null differ
diff --git a/docs/imgs/invite-users-button.png b/docs/imgs/invite-users-button.png
deleted file mode 100644
index a5d07a1c1e7e75..00000000000000
Binary files a/docs/imgs/invite-users-button.png and /dev/null differ
diff --git a/docs/imgs/invite-users-popup.png b/docs/imgs/invite-users-popup.png
deleted file mode 100644
index 621b1521eae752..00000000000000
Binary files a/docs/imgs/invite-users-popup.png and /dev/null differ
diff --git a/docs/imgs/lineage.png b/docs/imgs/lineage.png
deleted file mode 100644
index 7488c1e04c31b2..00000000000000
Binary files a/docs/imgs/lineage.png and /dev/null differ
diff --git a/docs/imgs/list-domains.png b/docs/imgs/list-domains.png
deleted file mode 100644
index 98a28130f8c990..00000000000000
Binary files a/docs/imgs/list-domains.png and /dev/null differ
diff --git a/docs/imgs/locust-example.png b/docs/imgs/locust-example.png
deleted file mode 100644
index bbae3e0ca19d07..00000000000000
Binary files a/docs/imgs/locust-example.png and /dev/null differ
diff --git a/docs/imgs/metadata-model-chart.png b/docs/imgs/metadata-model-chart.png
deleted file mode 100644
index 2fb74836549063..00000000000000
Binary files a/docs/imgs/metadata-model-chart.png and /dev/null differ
diff --git a/docs/imgs/metadata-model-to-fork-or-not-to.png b/docs/imgs/metadata-model-to-fork-or-not-to.png
deleted file mode 100644
index f9d89d555196d1..00000000000000
Binary files a/docs/imgs/metadata-model-to-fork-or-not-to.png and /dev/null differ
diff --git a/docs/imgs/metadata-modeling.png b/docs/imgs/metadata-modeling.png
deleted file mode 100644
index cbad7613e04e43..00000000000000
Binary files a/docs/imgs/metadata-modeling.png and /dev/null differ
diff --git a/docs/imgs/metadata-service-auth.png b/docs/imgs/metadata-service-auth.png
deleted file mode 100644
index 15a3ac51876c23..00000000000000
Binary files a/docs/imgs/metadata-service-auth.png and /dev/null differ
diff --git a/docs/imgs/metadata-serving.png b/docs/imgs/metadata-serving.png
deleted file mode 100644
index 54b928a0cff52e..00000000000000
Binary files a/docs/imgs/metadata-serving.png and /dev/null differ
diff --git a/docs/imgs/metadata.png b/docs/imgs/metadata.png
deleted file mode 100644
index 45bb0cdce12e95..00000000000000
Binary files a/docs/imgs/metadata.png and /dev/null differ
diff --git a/docs/imgs/name-ingestion-source.png b/docs/imgs/name-ingestion-source.png
deleted file mode 100644
index bde12082484738..00000000000000
Binary files a/docs/imgs/name-ingestion-source.png and /dev/null differ
diff --git a/docs/imgs/no-code-after.png b/docs/imgs/no-code-after.png
deleted file mode 100644
index c0eee88625ace9..00000000000000
Binary files a/docs/imgs/no-code-after.png and /dev/null differ
diff --git a/docs/imgs/no-code-before.png b/docs/imgs/no-code-before.png
deleted file mode 100644
index 50315578b1804a..00000000000000
Binary files a/docs/imgs/no-code-before.png and /dev/null differ
diff --git a/docs/imgs/platform-instances-for-ingestion.png b/docs/imgs/platform-instances-for-ingestion.png
deleted file mode 100644
index 740249a805fb85..00000000000000
Binary files a/docs/imgs/platform-instances-for-ingestion.png and /dev/null differ
diff --git a/docs/imgs/quickstart-ingestion-config.png b/docs/imgs/quickstart-ingestion-config.png
deleted file mode 100644
index de51777ccddc3a..00000000000000
Binary files a/docs/imgs/quickstart-ingestion-config.png and /dev/null differ
diff --git a/docs/imgs/reset-credentials-screen.png b/docs/imgs/reset-credentials-screen.png
deleted file mode 100644
index 4b680837b77ab1..00000000000000
Binary files a/docs/imgs/reset-credentials-screen.png and /dev/null differ
diff --git a/docs/imgs/reset-user-password-button.png b/docs/imgs/reset-user-password-button.png
deleted file mode 100644
index 5b1f3ee153d072..00000000000000
Binary files a/docs/imgs/reset-user-password-button.png and /dev/null differ
diff --git a/docs/imgs/reset-user-password-popup.png b/docs/imgs/reset-user-password-popup.png
deleted file mode 100644
index ac2456dde4d4d3..00000000000000
Binary files a/docs/imgs/reset-user-password-popup.png and /dev/null differ
diff --git a/docs/imgs/running-ingestion.png b/docs/imgs/running-ingestion.png
deleted file mode 100644
index a03fb444a029ed..00000000000000
Binary files a/docs/imgs/running-ingestion.png and /dev/null differ
diff --git a/docs/imgs/s3-ingestion/10_outputs.png b/docs/imgs/s3-ingestion/10_outputs.png
deleted file mode 100644
index e0d1ed3376ade9..00000000000000
Binary files a/docs/imgs/s3-ingestion/10_outputs.png and /dev/null differ
diff --git a/docs/imgs/s3-ingestion/1_crawler-info.png b/docs/imgs/s3-ingestion/1_crawler-info.png
deleted file mode 100644
index 12882473920479..00000000000000
Binary files a/docs/imgs/s3-ingestion/1_crawler-info.png and /dev/null differ
diff --git a/docs/imgs/s3-ingestion/2_crawler-type.png b/docs/imgs/s3-ingestion/2_crawler-type.png
deleted file mode 100644
index 4898438417913c..00000000000000
Binary files a/docs/imgs/s3-ingestion/2_crawler-type.png and /dev/null differ
diff --git a/docs/imgs/s3-ingestion/3_data-store.png b/docs/imgs/s3-ingestion/3_data-store.png
deleted file mode 100644
index d29e4b1be05d65..00000000000000
Binary files a/docs/imgs/s3-ingestion/3_data-store.png and /dev/null differ
diff --git a/docs/imgs/s3-ingestion/4_data-store-2.png b/docs/imgs/s3-ingestion/4_data-store-2.png
deleted file mode 100644
index c0a6f140bedb22..00000000000000
Binary files a/docs/imgs/s3-ingestion/4_data-store-2.png and /dev/null differ
diff --git a/docs/imgs/s3-ingestion/5_iam.png b/docs/imgs/s3-ingestion/5_iam.png
deleted file mode 100644
index 73a631cb74f560..00000000000000
Binary files a/docs/imgs/s3-ingestion/5_iam.png and /dev/null differ
diff --git a/docs/imgs/s3-ingestion/6_schedule.png b/docs/imgs/s3-ingestion/6_schedule.png
deleted file mode 100644
index c5df59348fbc69..00000000000000
Binary files a/docs/imgs/s3-ingestion/6_schedule.png and /dev/null differ
diff --git a/docs/imgs/s3-ingestion/7_output.png b/docs/imgs/s3-ingestion/7_output.png
deleted file mode 100644
index 6201fa40bcfb33..00000000000000
Binary files a/docs/imgs/s3-ingestion/7_output.png and /dev/null differ
diff --git a/docs/imgs/s3-ingestion/8_review.png b/docs/imgs/s3-ingestion/8_review.png
deleted file mode 100644
index 2d27e79c2128b8..00000000000000
Binary files a/docs/imgs/s3-ingestion/8_review.png and /dev/null differ
diff --git a/docs/imgs/s3-ingestion/9_run.png b/docs/imgs/s3-ingestion/9_run.png
deleted file mode 100644
index 2b0644f6ad0384..00000000000000
Binary files a/docs/imgs/s3-ingestion/9_run.png and /dev/null differ
diff --git a/docs/imgs/schedule-ingestion.png b/docs/imgs/schedule-ingestion.png
deleted file mode 100644
index 0e6ec8e268c32a..00000000000000
Binary files a/docs/imgs/schedule-ingestion.png and /dev/null differ
diff --git a/docs/imgs/schema-blame-blame-activated.png b/docs/imgs/schema-blame-blame-activated.png
deleted file mode 100644
index 363466c39aedfb..00000000000000
Binary files a/docs/imgs/schema-blame-blame-activated.png and /dev/null differ
diff --git a/docs/imgs/schema-history-audit-activated.png b/docs/imgs/schema-history-audit-activated.png
deleted file mode 100644
index f59676b9b8a8fd..00000000000000
Binary files a/docs/imgs/schema-history-audit-activated.png and /dev/null differ
diff --git a/docs/imgs/schema-history-latest-version.png b/docs/imgs/schema-history-latest-version.png
deleted file mode 100644
index 0a54df4d520d53..00000000000000
Binary files a/docs/imgs/schema-history-latest-version.png and /dev/null differ
diff --git a/docs/imgs/schema-history-older-version.png b/docs/imgs/schema-history-older-version.png
deleted file mode 100644
index 8d295f176104f7..00000000000000
Binary files a/docs/imgs/schema-history-older-version.png and /dev/null differ
diff --git a/docs/imgs/search-by-domain.png b/docs/imgs/search-by-domain.png
deleted file mode 100644
index 4b92e589591877..00000000000000
Binary files a/docs/imgs/search-by-domain.png and /dev/null differ
diff --git a/docs/imgs/search-domain.png b/docs/imgs/search-domain.png
deleted file mode 100644
index b1359e07d5fc21..00000000000000
Binary files a/docs/imgs/search-domain.png and /dev/null differ
diff --git a/docs/imgs/search-tag.png b/docs/imgs/search-tag.png
deleted file mode 100644
index cf4b6b629d1e23..00000000000000
Binary files a/docs/imgs/search-tag.png and /dev/null differ
diff --git a/docs/imgs/select-platform-template.png b/docs/imgs/select-platform-template.png
deleted file mode 100644
index 4f78e2b7309edc..00000000000000
Binary files a/docs/imgs/select-platform-template.png and /dev/null differ
diff --git a/docs/imgs/set-domain-id.png b/docs/imgs/set-domain-id.png
deleted file mode 100644
index 3e1dde4ae51ee1..00000000000000
Binary files a/docs/imgs/set-domain-id.png and /dev/null differ
diff --git a/docs/imgs/set-domain.png b/docs/imgs/set-domain.png
deleted file mode 100644
index 1c4460e747835d..00000000000000
Binary files a/docs/imgs/set-domain.png and /dev/null differ
diff --git a/docs/imgs/successful-ingestion.png b/docs/imgs/successful-ingestion.png
deleted file mode 100644
index fa8dbdff7501ed..00000000000000
Binary files a/docs/imgs/successful-ingestion.png and /dev/null differ
diff --git a/docs/imgs/timeline/dropdown-apis.png b/docs/imgs/timeline/dropdown-apis.png
deleted file mode 100644
index f7aba08bbc061f..00000000000000
Binary files a/docs/imgs/timeline/dropdown-apis.png and /dev/null differ
diff --git a/docs/imgs/timeline/swagger-ui.png b/docs/imgs/timeline/swagger-ui.png
deleted file mode 100644
index e52a57e8ca6706..00000000000000
Binary files a/docs/imgs/timeline/swagger-ui.png and /dev/null differ
diff --git a/docs/imgs/timeline/timeline-conceptually.png b/docs/imgs/timeline/timeline-conceptually.png
deleted file mode 100644
index 70bd843bf8aed7..00000000000000
Binary files a/docs/imgs/timeline/timeline-conceptually.png and /dev/null differ
diff --git a/docs/imgs/user-sign-up-screen.png b/docs/imgs/user-sign-up-screen.png
deleted file mode 100644
index 88c2589203bd18..00000000000000
Binary files a/docs/imgs/user-sign-up-screen.png and /dev/null differ
diff --git a/docs/lineage/airflow.md b/docs/lineage/airflow.md
index 21d59b777dd7c6..49de5352f6d58c 100644
--- a/docs/lineage/airflow.md
+++ b/docs/lineage/airflow.md
@@ -65,7 +65,7 @@ lazy_load_plugins = False
| datahub.capture_executions | true | If true, we'll capture task runs in DataHub in addition to DAG definitions. |
| datahub.graceful_exceptions | true | If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall task to fail. Note that configuration issues will still throw exceptions. |
-5. Configure `inlets` and `outlets` for your Airflow operators. For reference, look at the sample DAG in [`lineage_backend_demo.py`](../../metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_demo.py), or reference [`lineage_backend_taskflow_demo.py`](../../metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_taskflow_demo.py) if you're using the [TaskFlow API](https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html).
+5. Configure `inlets` and `outlets` for your Airflow operators. For reference, look at the sample DAG in [`lineage_backend_demo.py`](../../metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_demo.py), or reference [`lineage_backend_taskflow_demo.py`](../../metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_taskflow_demo.py) if you're using the [TaskFlow API](https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html).
6. [optional] Learn more about [Airflow lineage](https://airflow.apache.org/docs/apache-airflow/stable/lineage.html), including shorthand notation and some automation.
### How to validate installation
@@ -160,14 +160,14 @@ pip install acryl-datahub[airflow,datahub-kafka]
- `capture_executions` (defaults to false): If true, it captures task runs as DataHub DataProcessInstances.
- `graceful_exceptions` (defaults to true): If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall task to fail. Note that configuration issues will still throw exceptions.
-4. Configure `inlets` and `outlets` for your Airflow operators. For reference, look at the sample DAG in [`lineage_backend_demo.py`](../../metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_demo.py), or reference [`lineage_backend_taskflow_demo.py`](../../metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_taskflow_demo.py) if you're using the [TaskFlow API](https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html).
+4. Configure `inlets` and `outlets` for your Airflow operators. For reference, look at the sample DAG in [`lineage_backend_demo.py`](../../metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_demo.py), or reference [`lineage_backend_taskflow_demo.py`](../../metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_taskflow_demo.py) if you're using the [TaskFlow API](https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html).
5. [optional] Learn more about [Airflow lineage](https://airflow.apache.org/docs/apache-airflow/stable/lineage.html), including shorthand notation and some automation.
## Emitting lineage via a separate operator
Take a look at this sample DAG:
-- [`lineage_emission_dag.py`](../../metadata-ingestion/src/datahub_provider/example_dags/lineage_emission_dag.py) - emits lineage using the DatahubEmitterOperator.
+- [`lineage_emission_dag.py`](../../metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_emission_dag.py) - emits lineage using the DatahubEmitterOperator.
In order to use this example, you must first configure the Datahub hook. Like in ingestion, we support a Datahub REST hook and a Kafka-based hook. See step 1 above for details.
diff --git a/docs/links.md b/docs/links.md
index f175262b9b5d93..45ba391e557cdb 100644
--- a/docs/links.md
+++ b/docs/links.md
@@ -39,7 +39,7 @@
* [Creating Notebook-based Dynamic Dashboards](https://towardsdatascience.com/creating-notebook-based-dynamic-dashboards-91f936adc6f3)
## Talks & Presentations
-* [DataHub: Powering LinkedIn's Metadata](demo/DataHub_-_Powering_LinkedIn_Metadata.pdf) @ [Budapest Data Forum 2020](https://budapestdata.hu/2020/en/)
+* [DataHub: Powering LinkedIn's Metadata](https://github.com/acryldata/static-assets-test/raw/master/imgs/demo/DataHub_-_Powering_LinkedIn_Metadata.pdf) @ [Budapest Data Forum 2020](https://budapestdata.hu/2020/en/)
* [Taming the Data Beast Using DataHub](https://www.youtube.com/watch?v=bo4OhiPro7Y) @ [Data Engineering Melbourne Meetup November 2020](https://www.meetup.com/Data-Engineering-Melbourne/events/kgnvlrybcpbjc/)
* [Metadata Management And Integration At LinkedIn With DataHub](https://www.dataengineeringpodcast.com/datahub-metadata-management-episode-147/) @ [Data Engineering Podcast](https://www.dataengineeringpodcast.com)
* [The evolution of metadata: LinkedIn’s story](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019) @ [Strata Data Conference 2019](https://conferences.oreilly.com/strata/strata-ny-2019.html)
diff --git a/docs/managed-datahub/chrome-extension.md b/docs/managed-datahub/chrome-extension.md
index a614327c7fd29d..0aa0860d03b67a 100644
--- a/docs/managed-datahub/chrome-extension.md
+++ b/docs/managed-datahub/chrome-extension.md
@@ -10,7 +10,11 @@ import FeatureAvailability from '@site/src/components/FeatureAvailability';
In order to use the Acryl DataHub Chrome extension, you need to download it onto your browser from the Chrome web store [here](https://chrome.google.com/webstore/detail/datahub-chrome-extension/aoenebhmfokhglijmoacfjcnebdpchfj).
-![](imgs/saas/chrome-store-extension-screenshot.png)
+
+
+
+
+
Simply click "Add to Chrome" then "Add extension" on the ensuing popup.
@@ -20,11 +24,19 @@ Once you have your extension installed, you'll need to configure it to work with
1. Click the extension button on the right of your browser's address bar to view all of your installed extensions. Click on the newly installed DataHub extension.
-![](imgs/saas/extension_open_popup.png)
+
+
+
+
+
2. Fill in your DataHub domain and click "Continue" in the extension popup that appears.
-![](imgs/saas/extension_enter_domain.png)
+
+
+
+
+
If your organization uses standard SaaS domains for Looker, you should be ready to go!
@@ -34,11 +46,19 @@ Some organizations have custom SaaS domains for Looker and some Acryl DataHub de
1. Click on the extension button and select your DataHub extension to open the popup again. Now click the settings icon in order to open the configurations page.
-![](imgs/saas/extension_open_options_page.png)
+
+
+
+
+
2. Fill out any and save custom configurations you have in the **TOOL CONFIGURATIONS** section. Here you can configure a custom domain, a Platform Instance associated with that domain, and the Environment set on your DataHub assets. If you don't have a custom domain but do have a custom Platform Instance or Environment, feel free to leave the field domain empty.
-![](imgs/saas/extension_custom_configs.png)
+
+
+
+
+
## Using the Extension
@@ -52,7 +72,11 @@ Once you have everything configured on your extension, it's time to use it!
4. Click the Acryl DataHub extension button on the bottom right of your page to open a drawer where you can now see additional information about this asset right from your DataHub instance.
-![](imgs/saas/extension_view_in_looker.png)
+
+
+
+
+
## Advanced: Self-Hosted DataHub
diff --git a/docs/managed-datahub/datahub-api/graphql-api/getting-started.md b/docs/managed-datahub/datahub-api/graphql-api/getting-started.md
index 3c57b0a21d96e4..736bf6fea6811d 100644
--- a/docs/managed-datahub/datahub-api/graphql-api/getting-started.md
+++ b/docs/managed-datahub/datahub-api/graphql-api/getting-started.md
@@ -10,7 +10,11 @@ For a full reference to the Queries & Mutations available for consumption, check
### Connecting to the API
-![](../../imgs/saas/image-(3).png)
+
+
+
+
+
When you generate the token you will see an example of `curl` command which you can use to connect to the GraphQL API.
diff --git a/docs/managed-datahub/datahub-api/graphql-api/incidents-api-beta.md b/docs/managed-datahub/datahub-api/graphql-api/incidents-api-beta.md
index 89bacb2009e494..16d83d2f575752 100644
--- a/docs/managed-datahub/datahub-api/graphql-api/incidents-api-beta.md
+++ b/docs/managed-datahub/datahub-api/graphql-api/incidents-api-beta.md
@@ -404,7 +404,11 @@ You can configure Acryl to send slack notifications to a specific channel when i
These notifications are also able to tag the immediate asset's owners, along with the owners of downstream assets consuming it.
-![](../../imgs/saas/Screen-Shot-2022-03-22-at-6.46.41-PM.png)
+
+
+
+
+
To do so, simply follow the [Slack Integration Guide](docs/managed-datahub/saas-slack-setup.md) and contact your Acryl customer success team to enable the feature!
diff --git a/docs/managed-datahub/imgs/saas/DataHub-Architecture.png b/docs/managed-datahub/imgs/saas/DataHub-Architecture.png
deleted file mode 100644
index 95b3ab0b06ad64..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/DataHub-Architecture.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-01-13-at-7.45.56-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-01-13-at-7.45.56-PM.png
deleted file mode 100644
index 721989a6c37e11..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-01-13-at-7.45.56-PM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-01-24-at-4.35.17-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-01-24-at-4.35.17-PM.png
deleted file mode 100644
index dffac92f257c7b..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-01-24-at-4.35.17-PM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-01-24-at-4.37.22-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-01-24-at-4.37.22-PM.png
deleted file mode 100644
index ff0c29de1fbad3..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-01-24-at-4.37.22-PM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-07-at-10.23.31-AM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-07-at-10.23.31-AM.png
deleted file mode 100644
index 070bfd9f6b8975..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-07-at-10.23.31-AM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-22-at-6.43.25-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-22-at-6.43.25-PM.png
deleted file mode 100644
index b4bb4e2ba60edc..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-22-at-6.43.25-PM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-22-at-6.44.15-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-22-at-6.44.15-PM.png
deleted file mode 100644
index b0397afd1b3a40..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-22-at-6.44.15-PM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-22-at-6.46.41-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-22-at-6.46.41-PM.png
deleted file mode 100644
index 9258badb6f0889..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-03-22-at-6.46.41-PM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-4.52.55-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-4.52.55-PM.png
deleted file mode 100644
index 386b4cdcd99113..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-4.52.55-PM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-4.56.50-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-4.56.50-PM.png
deleted file mode 100644
index a129f5eba4271b..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-4.56.50-PM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-4.58.46-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-4.58.46-PM.png
deleted file mode 100644
index 96ae48318a35a1..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-4.58.46-PM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-5.01.16-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-5.01.16-PM.png
deleted file mode 100644
index b6fd273389c904..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-5.01.16-PM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-5.03.36-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-5.03.36-PM.png
deleted file mode 100644
index 0acd4e75bc6d2c..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-05-at-5.03.36-PM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-13-at-2.34.24-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-13-at-2.34.24-PM.png
deleted file mode 100644
index 364b9292cfaab7..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-04-13-at-2.34.24-PM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-13-at-7.56.16-AM-(1).png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-13-at-7.56.16-AM-(1).png
deleted file mode 100644
index 6a12dc545ec62c..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-13-at-7.56.16-AM-(1).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-13-at-7.56.16-AM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-13-at-7.56.16-AM.png
deleted file mode 100644
index 6a12dc545ec62c..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-13-at-7.56.16-AM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-13-at-8.02.55-AM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-13-at-8.02.55-AM.png
deleted file mode 100644
index 83645e00d724a4..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-13-at-8.02.55-AM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-24-at-11.02.47-AM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-24-at-11.02.47-AM.png
deleted file mode 100644
index a2f239ce847e07..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-24-at-11.02.47-AM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-24-at-12.59.38-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-24-at-12.59.38-PM.png
deleted file mode 100644
index e31d4b089d9292..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-06-24-at-12.59.38-PM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.21.42-AM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.21.42-AM.png
deleted file mode 100644
index c003581c9d1b63..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.21.42-AM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.22.23-AM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.22.23-AM.png
deleted file mode 100644
index 660dd121dd0a41..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.22.23-AM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.23.08-AM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.23.08-AM.png
deleted file mode 100644
index 07e3c71dba262f..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.23.08-AM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.47.57-AM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.47.57-AM.png
deleted file mode 100644
index 579e7f62af7085..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-22-at-11.47.57-AM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-29-at-6.07.25-PM-(1).png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-29-at-6.07.25-PM-(1).png
deleted file mode 100644
index f85f4d5c79bfb9..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-29-at-6.07.25-PM-(1).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-29-at-6.07.25-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-29-at-6.07.25-PM.png
deleted file mode 100644
index f85f4d5c79bfb9..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2022-08-29-at-6.07.25-PM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-4.16.52-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-4.16.52-PM.png
deleted file mode 100644
index cb8b7470cd957d..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-4.16.52-PM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-4.23.32-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-4.23.32-PM.png
deleted file mode 100644
index 1de51e33d87c23..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-4.23.32-PM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-5.12.47-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-5.12.47-PM.png
deleted file mode 100644
index df687dabe345c4..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-5.12.47-PM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-5.12.56-PM-(1).png b/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-5.12.56-PM-(1).png
deleted file mode 100644
index a8d9ee37c7a558..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-5.12.56-PM-(1).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-5.12.56-PM.png b/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-5.12.56-PM.png
deleted file mode 100644
index a8d9ee37c7a558..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Screen-Shot-2023-01-19-at-5.12.56-PM.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Untitled(1).png b/docs/managed-datahub/imgs/saas/Untitled(1).png
deleted file mode 100644
index 87846e7897f6ed..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Untitled(1).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Untitled-(2)-(1).png b/docs/managed-datahub/imgs/saas/Untitled-(2)-(1).png
deleted file mode 100644
index 7715bf4a51fbe4..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Untitled-(2)-(1).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Untitled-(2).png b/docs/managed-datahub/imgs/saas/Untitled-(2).png
deleted file mode 100644
index a01a1af370442d..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Untitled-(2).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Untitled-(3).png b/docs/managed-datahub/imgs/saas/Untitled-(3).png
deleted file mode 100644
index 02d84b326896c8..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Untitled-(3).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Untitled-(4).png b/docs/managed-datahub/imgs/saas/Untitled-(4).png
deleted file mode 100644
index a01a1af370442d..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Untitled-(4).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/Untitled.png b/docs/managed-datahub/imgs/saas/Untitled.png
deleted file mode 100644
index a01a1af370442d..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/Untitled.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/chrome-store-extension-screenshot.png b/docs/managed-datahub/imgs/saas/chrome-store-extension-screenshot.png
deleted file mode 100644
index e00a4d57f32ddc..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/chrome-store-extension-screenshot.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/extension_custom_configs.png b/docs/managed-datahub/imgs/saas/extension_custom_configs.png
deleted file mode 100644
index b3d70dfac00ff4..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/extension_custom_configs.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/extension_developer_mode.png b/docs/managed-datahub/imgs/saas/extension_developer_mode.png
deleted file mode 100644
index e740d15912e174..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/extension_developer_mode.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/extension_enter_domain.png b/docs/managed-datahub/imgs/saas/extension_enter_domain.png
deleted file mode 100644
index 3304fa168beaf1..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/extension_enter_domain.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/extension_load_unpacked.png b/docs/managed-datahub/imgs/saas/extension_load_unpacked.png
deleted file mode 100644
index 8f56705cd91769..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/extension_load_unpacked.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/extension_open_options_page.png b/docs/managed-datahub/imgs/saas/extension_open_options_page.png
deleted file mode 100644
index c1366d5673b599..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/extension_open_options_page.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/extension_open_popup.png b/docs/managed-datahub/imgs/saas/extension_open_popup.png
deleted file mode 100644
index 216056b847fb51..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/extension_open_popup.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/extension_view_in_looker.png b/docs/managed-datahub/imgs/saas/extension_view_in_looker.png
deleted file mode 100644
index bf854b3e840f7b..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/extension_view_in_looker.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/home-(1).png b/docs/managed-datahub/imgs/saas/home-(1).png
deleted file mode 100644
index 88cf2017dd7e71..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/home-(1).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/home.png b/docs/managed-datahub/imgs/saas/home.png
deleted file mode 100644
index 8ad63deec75c9b..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/home.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/image-(1).png b/docs/managed-datahub/imgs/saas/image-(1).png
deleted file mode 100644
index c1a249125fcf7c..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/image-(1).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/image-(10).png b/docs/managed-datahub/imgs/saas/image-(10).png
deleted file mode 100644
index a580fdc3d67309..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/image-(10).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/image-(11).png b/docs/managed-datahub/imgs/saas/image-(11).png
deleted file mode 100644
index ee95eb43842723..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/image-(11).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/image-(12).png b/docs/managed-datahub/imgs/saas/image-(12).png
deleted file mode 100644
index bbd8e6a66cf85b..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/image-(12).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/image-(13).png b/docs/managed-datahub/imgs/saas/image-(13).png
deleted file mode 100644
index bbd8e6a66cf85b..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/image-(13).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/image-(14).png b/docs/managed-datahub/imgs/saas/image-(14).png
deleted file mode 100644
index a580fdc3d67309..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/image-(14).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/image-(15).png b/docs/managed-datahub/imgs/saas/image-(15).png
deleted file mode 100644
index f282e2d92c1a1d..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/image-(15).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/image-(16).png b/docs/managed-datahub/imgs/saas/image-(16).png
deleted file mode 100644
index 1340c77bd648c8..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/image-(16).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/image-(17).png b/docs/managed-datahub/imgs/saas/image-(17).png
deleted file mode 100644
index 6eee2fb2d821fe..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/image-(17).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/image-(2).png b/docs/managed-datahub/imgs/saas/image-(2).png
deleted file mode 100644
index cf475edd7b95da..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/image-(2).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/image-(3).png b/docs/managed-datahub/imgs/saas/image-(3).png
deleted file mode 100644
index b08818ff3e97c6..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/image-(3).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/image-(4).png b/docs/managed-datahub/imgs/saas/image-(4).png
deleted file mode 100644
index a580fdc3d67309..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/image-(4).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/image-(5).png b/docs/managed-datahub/imgs/saas/image-(5).png
deleted file mode 100644
index 48438c6001e4f5..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/image-(5).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/image-(6).png b/docs/managed-datahub/imgs/saas/image-(6).png
deleted file mode 100644
index 54e569e853f246..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/image-(6).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/image-(7).png b/docs/managed-datahub/imgs/saas/image-(7).png
deleted file mode 100644
index 6e89e5881cfa78..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/image-(7).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/image-(8).png b/docs/managed-datahub/imgs/saas/image-(8).png
deleted file mode 100644
index ee0a3c89d58faa..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/image-(8).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/image-(9).png b/docs/managed-datahub/imgs/saas/image-(9).png
deleted file mode 100644
index 301ca98593ef9c..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/image-(9).png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/image.png b/docs/managed-datahub/imgs/saas/image.png
deleted file mode 100644
index a1cfc3e74c5dd2..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/image.png and /dev/null differ
diff --git a/docs/managed-datahub/imgs/saas/settings.png b/docs/managed-datahub/imgs/saas/settings.png
deleted file mode 100644
index ca99984abbbc99..00000000000000
Binary files a/docs/managed-datahub/imgs/saas/settings.png and /dev/null differ
diff --git a/docs/managed-datahub/integrations/oidc-sso-integration.md b/docs/managed-datahub/integrations/oidc-sso-integration.md
index 6a9e085186b446..ec4ca311a0de54 100644
--- a/docs/managed-datahub/integrations/oidc-sso-integration.md
+++ b/docs/managed-datahub/integrations/oidc-sso-integration.md
@@ -42,4 +42,8 @@ To enable the OIDC integration, start by navigating to **Settings > Platform > S
4. If there are any advanced settings you would like to configure, click on the **Advanced** button. These come with defaults, so only input settings here if there is something you need changed from the default configuration.
5. Click **Update** to save your settings.
-![](../imgs/saas/image-(10).png)
+
+
+
+
+
diff --git a/docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md b/docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md
index 95ca6e5e33e160..0444d15b3627cb 100644
--- a/docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md
+++ b/docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md
@@ -56,9 +56,17 @@ In Acryl DataHub deployments, you _must_ use a sink of type `datahub-rest`, whic
2. **token**: a unique API key used to authenticate requests to your instance's REST API
The token can be retrieved by logging in as admin. You can go to Settings page and generate a Personal Access Token with your desired expiration date.
-![](../imgs/saas/home-(1).png)
-![](../imgs/saas/settings.png)
+
+
+
+
+
+
+
+
+
+
To configure your instance of DataHub as the destination for ingestion, set the "server" field of your recipe to point to your Acryl instance's domain suffixed by the path `/gms`, as shown below.
A complete example of a DataHub recipe file, which reads from MySQL and writes into a DataHub instance:
diff --git a/docs/managed-datahub/observe/freshness-assertions.md b/docs/managed-datahub/observe/freshness-assertions.md
index 54b3134151d3a2..c5d4ca9081b43d 100644
--- a/docs/managed-datahub/observe/freshness-assertions.md
+++ b/docs/managed-datahub/observe/freshness-assertions.md
@@ -59,7 +59,7 @@ Tables.
For example, imagine that we work for a company with a Snowflake Table that stores user clicks collected from our e-commerce website.
This table is updated with new data on a specific cadence: once per hour (In practice, daily or even weekly are also common).
In turn, there is a downstream Business Analytics Dashboard in Looker that shows important metrics like
-the number of people clicking our "Daily Sale" banners, and this dashboard pulls is generated from data stored in our "clicks" table.
+the number of people clicking our "Daily Sale" banners, and this dashboard is generated from data stored in our "clicks" table.
It is important that our clicks Table continues to be updated each hour because if it stops being updated, it could mean
that our downstream metrics dashboard becomes incorrect. And the risk of this situation is obvious: our organization
may make bad decisions based on incomplete information.
@@ -122,8 +122,12 @@ Change Source types vary by the platform, but generally fall into these categori
is higher than the previously observed value, in order to determine whether the Table has been changed within a given period of time.
Note that this approach is only supported if the Change Window does not use a fixed interval.
- Using the final 2 approaches - column value queries - to determine whether a Table has changed useful because it can be customized to determine whether
- specific types of important changes have been made to a given Table.
+ - **DataHub Operation**: A DataHub "Operation" aspect contains timeseries information used to describe changes made to an entity. Using this
+ option avoids contacting your data platform, and instead uses the DataHub Operation metadata to evaluate Freshness Assertions.
+ This relies on Operations being reported to DataHub, either via ingestion or via use of the DataHub APIs (see [Report Operation via API](#reporting-operations-via-api)).
+ Note if you have not configured an ingestion source through DataHub, then this may be the only option available.
+
+ Using either of the column value approaches (**Last Modified Column** or **High Watermark Column**) to determine whether a Table has changed can be useful because it can be customized to determine whether specific types of important changes have been made to a given Table.
Because it does not involve system warehouse tables, it is also easily portable across Data Warehouse and Data Lake providers.
Freshness Assertions also have an off switch: they can be started or stopped at any time with the click of button.
@@ -178,7 +182,7 @@ _Check whether the table has changed in a specific window of time_
7. (Optional) Click **Advanced** to customize the evaluation **source**. This is the mechanism that will be used to evaluate
-the check. Each Data Platform supports different options including Audit Log, Information Schema, Last Modified Column, and High Watermark Column.
+the check. Each Data Platform supports different options including Audit Log, Information Schema, Last Modified Column, High Watermark Column, and DataHub Operation.
@@ -189,11 +193,12 @@ the check. Each Data Platform supports different options including Audit Log, In
- **Last Modified Column**: Check for the presence of rows using a "Last Modified Time" column, which should reflect the time at which a given row was last changed in the table, to
determine whether the table changed within the evaluation period.
- **High Watermark Column**: Monitor changes to a continuously-increasing "high watermark" column value to determine whether a table
- has been changed. This option is particularly useful for tables that grow consistently with time, for example fact or event (e.g. click-strea) tables. It is not available
+ has been changed. This option is particularly useful for tables that grow consistently with time, for example fact or event (e.g. click-stream) tables. It is not available
when using a fixed lookback period.
+- **DataHub Operation**: Use DataHub Operations to determine whether the table changed within the evaluation period.
-8. Click **Next**
-9. Configure actions that should be taken when the Freshness Assertion passes or fails
+1. Click **Next**
+2. Configure actions that should be taken when the Freshness Assertion passes or fails
@@ -280,7 +285,7 @@ Note that to create or delete Assertions and Monitors for a specific entity on D
In order to create a Freshness Assertion that is being monitored on a specific **Evaluation Schedule**, you'll need to use 2
GraphQL mutation queries to create a Freshness Assertion entity and create an Assertion Monitor entity responsible for evaluating it.
-Start by creating the Freshness Assertion entity using the `createFreshnessAssertion` query and hang on to the 'urn' field of the Assertion entit y
+Start by creating the Freshness Assertion entity using the `createFreshnessAssertion` query and hang on to the 'urn' field of the Assertion entity
you get back. Then continue by creating a Monitor entity using the `createAssertionMonitor`.
##### Examples
@@ -291,10 +296,10 @@ To create a Freshness Assertion Entity that checks whether a table has been upda
mutation createFreshnessAssertion {
createFreshnessAssertion(
input: {
- entityUrn: ""
- type: DATASET_CHANGE
+ entityUrn: "",
+ type: DATASET_CHANGE,
schedule: {
- type: FIXED_INTERVAL
+ type: FIXED_INTERVAL,
fixedInterval: { unit: HOUR, multiple: 8 }
}
}
@@ -337,6 +342,28 @@ After creating the monitor, the new assertion will start to be evaluated every 8
You can delete assertions along with their monitors using GraphQL mutations: `deleteAssertion` and `deleteMonitor`.
+### Reporting Operations via API
+
+DataHub Operations can be used to capture changes made to entities. This is useful for cases where the underlying data platform does not provide a mechanism
+to capture changes, or where the data platform's mechanism is not reliable. In order to report an operation, you can use the `reportOperation` GraphQL mutation.
+
+
+##### Examples
+```json
+mutation reportOperation {
+ reportOperation(
+ input: {
+ urn: "",
+ operationType: INSERT,
+ sourceType: DATA_PLATFORM,
+ timestampMillis: 1693252366489
+ }
+ )
+}
+```
+
+Use the `timestampMillis` field to specify the time at which the operation occurred. If no value is provided, the current time will be used.
+
### Tips
:::info
diff --git a/docs/managed-datahub/observe/volume-assertions.md b/docs/managed-datahub/observe/volume-assertions.md
new file mode 100644
index 00000000000000..5f5aff33a5ce21
--- /dev/null
+++ b/docs/managed-datahub/observe/volume-assertions.md
@@ -0,0 +1,355 @@
+---
+description: This page provides an overview of working with DataHub Volume Assertions
+---
+import FeatureAvailability from '@site/src/components/FeatureAvailability';
+
+
+# Volume Assertions
+
+
+
+
+> ⚠️ The **Volume Assertions** feature is currently in private beta, part of the **Acryl Observe** module, and may only be available to a
+> limited set of design partners.
+>
+> If you are interested in trying it and providing feedback, please reach out to your Acryl Customer Success
+> representative.
+
+## Introduction
+
+Can you remember a time when the meaning of Data Warehouse Table that you depended on fundamentally changed, with little or no notice?
+If the answer is yes, how did you find out? We'll take a guess - someone looking at an internal reporting dashboard or worse, a user using your your product, sounded an alarm when
+a number looked a bit out of the ordinary. Perhaps your table initially tracked purchases made on your company's e-commerce web store, but suddenly began to include purchases made
+through your company's new mobile app.
+
+There are many reasons why an important Table on Snowflake, Redshift, or BigQuery may change in its meaning - application code bugs, new feature rollouts,
+changes to key metric definitions, etc. Often times, these changes break important assumptions made about the data used in building key downstream data products
+like reporting dashboards or data-driven product features.
+
+What if you could reduce the time to detect these incidents, so that the people responsible for the data were made aware of data
+issues _before_ anyone else? With Acryl DataHub **Volume Assertions**, you can.
+
+Acryl DataHub allows users to define expectations about the normal volume, or size, of a particular warehouse Table,
+and then monitor those expectations over time as the table grows and changes.
+
+In this article, we'll cover the basics of monitoring Volume Assertions - what they are, how to configure them, and more - so that you and your team can
+start building trust in your most important data assets.
+
+Let's get started!
+
+## Support
+
+Volume Assertions are currently supported for:
+
+1. Snowflake
+2. Redshift
+3. BigQuery
+
+Note that an Ingestion Source _must_ be configured with the data platform of your choice in Acryl DataHub's **Ingestion**
+tab.
+
+> Note that Volume Assertions are not yet supported if you are connecting to your warehouse
+> using the DataHub CLI or a Remote Ingestion Executor.
+
+## What is a Volume Assertion?
+
+A **Volume Assertion** is a configurable Data Quality rule used to monitor a Data Warehouse Table
+for unexpected or sudden changes in "volume", or row count. Volume Assertions can be particularly useful when you have frequently-changing
+Tables which have a relatively stable pattern of growth or decline.
+
+For example, imagine that we work for a company with a Snowflake Table that stores user clicks collected from our e-commerce website.
+This table is updated with new data on a specific cadence: once per hour (In practice, daily or even weekly are also common).
+In turn, there is a downstream Business Analytics Dashboard in Looker that shows important metrics like
+the number of people clicking our "Daily Sale" banners, and this dashboard is generated from data stored in our "clicks" table.
+It is important that our clicks Table is updated with the correct number of rows each hour, else it could mean
+that our downstream metrics dashboard becomes incorrect. The risk of this situation is obvious: our organization
+may make bad decisions based on incomplete information.
+
+In such cases, we can use a **Volume Assertion** that checks whether the Snowflake "clicks" Table is growing in an expected
+way, and that there are no sudden increases or sudden decreases in the rows being added or removed from the table.
+If too many rows are added or removed within an hour, we can notify key stakeholders and begin to root cause before the problem impacts stakeholders of the data.
+
+### Anatomy of a Volume Assertion
+
+At the most basic level, **Volume Assertions** consist of a few important parts:
+
+1. An **Evaluation Schedule**
+2. A **Volume Condition**
+2. A **Volume Source**
+
+In this section, we'll give an overview of each.
+
+#### 1. Evaluation Schedule
+
+The **Evaluation Schedule**: This defines how often to check a given warehouse Table for its volume. This should usually
+be configured to match the expected change frequency of the Table, although it can also be less frequently depending
+on the requirements. You can also specify specific days of the week, hours in the day, or even
+minutes in an hour.
+
+
+#### 2. Volume Condition
+
+The **Volume Condition**: This defines the type of condition that we'd like to monitor, or when the Assertion
+should result in failure.
+
+There are a 2 different categories of conditions: **Total** Volume and **Change** Volume.
+
+_Total_ volume conditions are those which are defined against the point-in-time total row count for a table. They allow you to specify conditions like:
+
+1. **Table has too many rows**: The table should always have less than 1000 rows
+2. **Table has too few rows**: The table should always have more than 1000 rows
+3. **Table row count is outside a range**: The table should always have between 1000 and 2000 rows.
+
+_Change_ volume conditions are those which are defined against the growth or decline rate of a table, measured between subsequent checks
+of the table volume. They allow you to specify conditions like:
+
+1. **Table growth is too fast**: When the table volume is checked, it should have < 1000 more rows than it had during the previous check.
+2. **Table growth is too slow**: When the table volume is checked, it should have > 1000 more rows than it had during the previous check.
+3. **Table growth is outside a range**: When the table volume is checked, it should have between 1000 and 2000 more rows than it had during the previous check.
+
+For change volume conditions, both _absolute_ row count deltas and relative percentage deltas are supported for identifying
+table that are following an abnormal pattern of growth.
+
+
+#### 3. Volume Source
+
+The **Volume Source**: This is the mechanism that Acryl DataHub should use to determine the table volume (row count). The supported
+source types vary by the platform, but generally fall into these categories:
+
+- **Information Schema**: A system Table that is exposed by the Data Warehouse which contains live information about the Databases
+ and Tables stored inside the Data Warehouse, including their row count. It is usually efficient to check, but can in some cases be slightly delayed to update
+ once a change has been made to a table.
+
+- **Query**: A `COUNT(*)` query is used to retrieve the latest row count for a table, with optional SQL filters applied (depending on platform).
+ This can be less efficient to check depending on the size of the table. This approach is more portable, as it does not involve
+ system warehouse tables, it is also easily portable across Data Warehouse and Data Lake providers.
+
+- **DataHub Dataset Profile**: The DataHub Dataset Profile aspect is used to retrieve the latest row count information for a table.
+ Using this option avoids contacting your data platform, and instead uses the DataHub Dataset Profile metadata to evaluate Volume Assertions.
+ Note if you have not configured an ingestion source through DataHub, then this may be the only option available.
+
+Volume Assertions also have an off switch: they can be started or stopped at any time with the click of button.
+
+
+## Creating a Volume Assertion
+
+### Prerequisites
+
+1. **Permissions**: To create or delete Volume Assertions for a specific entity on DataHub, you'll need to be granted the
+ `Edit Assertions` and `Edit Monitors` privileges for the entity. This is granted to Entity owners by default.
+
+2. **Data Platform Connection**: In order to create a Volume Assertion, you'll need to have an **Ingestion Source** configured to your
+ Data Platform: Snowflake, BigQuery, or Redshift under the **Integrations** tab.
+
+Once these are in place, you're ready to create your Volume Assertions!
+
+### Steps
+
+1. Navigate to the Table that to monitor for volume
+2. Click the **Validations** tab
+
+
+
+
+
+3. Click **+ Create Assertion**
+
+
+
+
+
+4. Choose **Volume**
+
+5. Configure the evaluation **schedule**. This is the frequency at which the assertion will be evaluated to produce a pass or fail result, and the times
+ when the table volume will be checked.
+
+6. Configure the evaluation **condition type**. This determines the cases in which the new assertion will fail when it is evaluated.
+
+
+
+
+
+7. (Optional) Click **Advanced** to customize the volume **source**. This is the mechanism that will be used to obtain the table
+ row count metric. Each Data Platform supports different options including Information Schema, Query, and DataHub Dataset Profile.
+
+
+
+
+
+- **Information Schema**: Check the Data Platform system metadata tables to determine the table row count.
+- **Query**: Issue a `COUNT(*)` query to the table to determine the row count.
+- **DataHub Dataset Profile**: Use the DataHub Dataset Profile metadata to determine the row count.
+
+8. Click **Next**
+9. Configure actions that should be taken when the Volume Assertion passes or fails
+
+
+
+
+
+- **Raise incident**: Automatically raise a new DataHub `Volume` Incident for the Table whenever the Volume Assertion is failing. This
+ may indicate that the Table is unfit for consumption. Configure Slack Notifications under **Settings** to be notified when
+ an incident is created due to an Assertion failure.
+- **Resolve incident**: Automatically resolved any incidents that were raised due to failures in this Volume Assertion. Note that
+ any other incidents will not be impacted.
+
+10. Click **Save**.
+
+And that's it! DataHub will now begin to monitor your Volume Assertion for the table.
+
+To view the time of the next Volume Assertion evaluation, simply click **Volume** and then click on your
+new Assertion:
+
+
+
+
+
+Once your assertion has run, you will begin to see Success or Failure status for the Table
+
+
+
+
+
+
+## Stopping a Volume Assertion
+
+In order to temporarily stop the evaluation of a Volume Assertion:
+
+1. Navigate to the **Validations** tab of the Table with the assertion
+2. Click **Volume** to open the Volume Assertions list
+3. Click the three-dot menu on the right side of the assertion you want to disable
+4. Click **Stop**
+
+
+
+
+
+To resume the Volume Assertion, simply click **Turn On**.
+
+
+
+
+
+
+## Smart Assertions ⚡
+
+As part of the **Acryl Observe** module, Acryl DataHub also provides **Smart Assertions** out of the box. These are
+dynamic, AI-powered Volume Assertions that you can use to monitor the volume of important warehouse Tables, without
+requiring any manual setup.
+
+If Acryl DataHub is able to detect a pattern in the volume of a Snowflake, Redshift, or BigQuery Table, you'll find
+a recommended Smart Assertion under the `Validations` tab on the Table profile page:
+
+
+
+
+
+In order to enable it, simply click **Turn On**. From this point forward, the Smart Assertion will check for changes on a cadence
+based on the Table history.
+
+Don't need it anymore? Smart Assertions can just as easily be turned off by clicking the three-dot "more" button and then **Stop**.
+
+
+## Creating Volume Assertions via API
+
+Under the hood, Acryl DataHub implements Volume Assertion Monitoring using two "entity" concepts:
+
+- **Assertion**: The specific expectation for volume, e.g. "The table was changed int the past 7 hours"
+ or "The table is changed on a schedule of every day by 8am". This is the "what".
+
+- **Monitor**: The process responsible for evaluating the Assertion on a given evaluation schedule and using specific
+ mechanisms. This is the "how".
+
+Note that to create or delete Assertions and Monitors for a specific entity on DataHub, you'll need the
+`Edit Assertions` and `Edit Monitors` privileges for it.
+
+#### GraphQL
+
+In order to create a Volume Assertion that is being monitored on a specific **Evaluation Schedule**, you'll need to use 2
+GraphQL mutation queries to create a Volume Assertion entity and create an Assertion Monitor entity responsible for evaluating it.
+
+Start by creating the Volume Assertion entity using the `createVolumeAssertion` query and hang on to the 'urn' field of the Assertion entity
+you get back. Then continue by creating a Monitor entity using the `createAssertionMonitor`.
+
+##### Examples
+
+To create a Volume Assertion Entity that checks whether a table has been updated in the past 8 hours:
+
+```json
+mutation createVolumeAssertion {
+ createVolumeAssertion(
+ input: {
+ entityUrn: "",
+ type: ROW_COUNT_TOTAL,
+ rowCountTotal: {
+ operator: BETWEEN,
+ parameters: {
+ minValue: {
+ "value": 10,
+ "type": NUMBER
+ },
+ maxValue: {
+ "value": 20,
+ "type": NUMBER
+ }
+ }
+ }
+ }
+ ) {
+ urn
+}
+}
+```
+
+To create an assertion that specifies that the row count total should always fall between 10 and 20.
+
+The supported volume assertion types are `ROW_COUNT_TOTAL` and `ROW_COUNT_CHANGE`. Other (e.g. incrementing segment) types are not yet supported.
+The supported operator types are `GREATER_THAN`, `GREATER_THAN_OR_EQUAL_TO`, `LESS_THAN`, `LESS_THAN_OR_EQUAL_TO`, and `BETWEEN` (requires minValue, maxValue).
+The supported parameter types are `NUMBER`.
+
+To create an Assertion Monitor Entity that evaluates the volume assertion every 8 hours using the Information Schema:
+
+```json
+mutation createAssertionMonitor {
+ createAssertionMonitor(
+ input: {
+ entityUrn: "",
+ assertionUrn: "",
+ schedule: {
+ cron: "0 */8 * * *",
+ timezone: "America/Los_Angeles"
+ },
+ parameters: {
+ type: DATASET_VOLUME,
+ datasetVolumeParameters: {
+ sourceType: INFORMATION_SCHEMA,
+ }
+ }
+ }
+ ) {
+ urn
+ }
+}
+```
+
+This entity defines _when_ to run the check (Using CRON format - every 8th hour) and _how_ to run the check (using the Information Schema).
+
+After creating the monitor, the new assertion will start to be evaluated every 8 hours in your selected timezone.
+
+You can delete assertions along with their monitors using GraphQL mutations: `deleteAssertion` and `deleteMonitor`.
+
+### Tips
+
+:::info
+**Authorization**
+
+Remember to always provide a DataHub Personal Access Token when calling the GraphQL API. To do so, just add the 'Authorization' header as follows:
+
+```
+Authorization: Bearer
+```
+
+**Exploring GraphQL API**
+
+Also, remember that you can play with an interactive version of the Acryl GraphQL API at `https://your-account-id.acryl.io/api/graphiql`
+:::
diff --git a/docs/managed-datahub/operator-guide/setting-up-remote-ingestion-executor-on-aws.md b/docs/managed-datahub/operator-guide/setting-up-remote-ingestion-executor-on-aws.md
index d389ec97d05502..b8fb0ea9e80f17 100644
--- a/docs/managed-datahub/operator-guide/setting-up-remote-ingestion-executor-on-aws.md
+++ b/docs/managed-datahub/operator-guide/setting-up-remote-ingestion-executor-on-aws.md
@@ -17,11 +17,19 @@ Acryl DataHub comes packaged with an Acryl-managed ingestion executor, which is
For example, if an ingestion source is not publicly accessible via the internet, e.g. hosted privately within a specific AWS account, then the Acryl executor will be unable to extract metadata from it.
-![Option 1: Acryl-hosted ingestion runner](../imgs/saas/image-(12).png)
+
+
+
+
+
To accommodate these cases, Acryl supports configuring a remote ingestion executor which can be deployed inside of your AWS account. This setup allows you to continue leveraging the Acryl DataHub console to create, schedule, and run metadata ingestion, all while retaining network and credential isolation.
-![Option 2: Customer-hosted ingestion runner](../imgs/saas/image-(6).png)
+
+
+
+
+
## Deploying a Remote Ingestion Executor
1. **Provide AWS Account Id**: Provide Acryl Team with the id of the AWS in which the remote executor will be hosted. This will be used to grant access to private Acryl containers and create a unique SQS queue which your remote agent will subscribe to. The account id can be provided to your Acryl representative via Email or [One Time Secret](https://onetimesecret.com/).
@@ -40,23 +48,39 @@ To accommodate these cases, Acryl supports configuring a remote ingestion execut
Note that the only external secret provider that is currently supported is AWS Secrets Manager.
-![](../imgs/saas/Screen-Shot-2023-01-19-at-5.12.47-PM.png)
-![](../imgs/saas/Screen-Shot-2023-01-19-at-5.12.56-PM.png)
+
+
+
+
+
+
+
+
+
+
3. **Test the Executor:** To test your remote executor:
1. Create a new Ingestion Source by clicking '**Create new Source**' the '**Ingestion**' tab of the DataHub console. Configure your Ingestion Recipe as though you were running it from inside of your environment.
2. When working with "secret" fields (passwords, keys, etc), you can refer to any "self-managed" secrets by name: `${SECRET_NAME}:`
- ![Using a secret called BQ_DEPLOY_KEY which is managed in AWS secrets manager](../imgs/saas/Screen-Shot-2023-01-19-at-4.16.52-PM.png)
+
+
+
+
+
3. In the 'Finish Up' step, click '**Advanced'**.
4. Update the '**Executor Id**' form field to be '**remote**'. This indicates that you'd like to use the remote executor.
5. Click '**Done**'.
Now, simple click '**Execute**' to test out the remote executor. If your remote executor is configured properly, you should promptly see the ingestion task state change to 'Running'.
-![](../imgs/saas/Screen-Shot-2022-03-07-at-10.23.31-AM.png)
+
+
+
+
+
## Updating a Remote Ingestion Executor
In order to update the executor, ie. to deploy a new container version, you'll need to update the CloudFormation Stack to re-deploy the CloudFormation template with a new set of parameters.
### Steps - AWS Console
@@ -66,7 +90,11 @@ In order to update the executor, ie. to deploy a new container version, you'll n
4. Select **Replace Current Template**
5. Select **Upload a template file**
6. Upload a copy of the Acryl Remote Executor [CloudFormation Template](https://raw.githubusercontent.com/acryldata/datahub-cloudformation/master/Ingestion/templates/python.ecs.template.yaml)
-![](../imgs/saas/Screen-Shot-2023-01-19-at-4.23.32-PM.png)
+
+
+
+
+
7. Click **Next**
8. Change parameters based on your modifications (e.g. ImageTag, etc)
9. Click **Next**
diff --git a/docs/managed-datahub/release-notes/v_0_2_11.md b/docs/managed-datahub/release-notes/v_0_2_11.md
new file mode 100644
index 00000000000000..1f420908487127
--- /dev/null
+++ b/docs/managed-datahub/release-notes/v_0_2_11.md
@@ -0,0 +1,73 @@
+# v0.2.11
+---
+
+Release Availability Date
+---
+14-Sep-2023
+
+Recommended CLI/SDK
+---
+- `v0.11.0` with release notes at https://github.com/acryldata/datahub/releases/tag/v0.10.5.5
+- [Deprecation] In LDAP ingestor, the manager_pagination_enabled changed to general pagination_enabled
+
+If you are using an older CLI/SDK version then please upgrade it. This applies for all CLI/SDK usages, if you are using it through your terminal, github actions, airflow, in python SDK somewhere, Java SKD etc. This is a strong recommendation to upgrade as we keep on pushing fixes in the CLI and it helps us support you better.
+
+Special Notes
+---
+- Deployment process for this release is going to have a downtime when systme will be in a read only mode. A rough estimate 1 hour for every 2.3 million entities (includes soft-deleted entities).
+
+
+## Release Changelog
+---
+- Since `v0.2.10` these changes from OSS DataHub https://github.com/datahub-project/datahub/compare/2b0952195b7895df0a2bf92b28e71aac18217781...75252a3d9f6a576904be5a0790d644b9ae2df6ac have been pulled in.
+- Misc fixes & features
+ - Proposals
+ - Group names shown correctly for proposal Inbox
+ - Metadata tests
+ - Deprecate/Un-deprecate actions available in Metadata tests
+ - Last Observed (in underlying sql) available as a filter in metadata tests
+ - [Breaking change] Renamed `__lastUpdated` -> `__created` as a filter to correctly represent what it was. This was not surfaced in the UI. But if you were using it then this needs to be renamed. Acryl Customer Success team will keep an eye out to pro-actively find and bring this up if you are affected by this.
+ - Robustness improvements to metadata test runs
+ - Copy urn for metadata tests to allow for easier filtering for iteration over metadata test results via our APIs.
+ - A lot more fixes to subscriptions, notifications and Observability (Beta).
+ - Some performance improvements to lineage queries
+
+## Some notable features in this SaaS release
+- We now enable you to create and delete pinned announcements on your DataHub homepage! If you have the “Manage Home Page Posts” platform privilege you’ll see a new section in settings called “Home Page Posts” where you can create and delete text posts and link posts that your users see on the home page.
+- Improvements to search experience
+
+
+
+- The CLI now supports recursive deletes
+- New subscriptions feature will be widely rolled out this release
+
+
+
+- We will be enabling these features selectively. If you are interested in trying it and providing feedback, please reach out to your Acryl Customer Success representative.
+ - Acryl Observe Freshness Assertions available in private beta as shared [here](../observe/freshness-assertions.md).
diff --git a/docs/modeling/extending-the-metadata-model.md b/docs/modeling/extending-the-metadata-model.md
index f47630f44e772a..be2d7d795de701 100644
--- a/docs/modeling/extending-the-metadata-model.md
+++ b/docs/modeling/extending-the-metadata-model.md
@@ -11,7 +11,11 @@ these two concepts prior to making changes.
## To fork or not to fork?
An important question that will arise once you've decided to extend the metadata model is whether you need to fork the main repo or not. Use the diagram below to understand how to make this decision.
-![Metadata Model To Fork or Not](../imgs/metadata-model-to-fork-or-not-to.png)
+
+
+
+
+
The green lines represent pathways that will lead to lesser friction for you to maintain your code long term. The red lines represent higher risk of conflicts in the future. We are working hard to move the majority of model extension use-cases to no-code / low-code pathways to ensure that you can extend the core metadata model without having to maintain a custom fork of DataHub.
@@ -20,7 +24,7 @@ We will refer to the two options as the **open-source fork** and **custom reposi
## This Guide
This guide will outline what the experience of adding a new Entity should look like through a real example of adding the
-Dashboard Entity. If you want to extend an existing Entity, you can skip directly to [Step 3](#step_3).
+Dashboard Entity. If you want to extend an existing Entity, you can skip directly to [Step 3](#step-3-define-custom-aspects-or-attach-existing-aspects-to-your-entity).
At a high level, an entity is made up of:
@@ -78,14 +82,14 @@ Because they are aspects, keys need to be annotated with an @Aspect annotation,
can be a part of.
The key can also be annotated with the two index annotations: @Relationship and @Searchable. This instructs DataHub
-infra to use the fields in the key to create relationships and index fields for search. See [Step 3](#step_3) for more details on
+infra to use the fields in the key to create relationships and index fields for search. See [Step 3](#step-3-define-custom-aspects-or-attach-existing-aspects-to-your-entity) for more details on
the annotation model.
**Constraints**: Note that each field in a Key Aspect MUST be of String or Enum type.
### Step 2: Create the new entity with its key aspect
-Define the entity within an `entity-registry.yml` file. Depending on your approach, the location of this file may vary. More on that in steps [4](#step_4) and [5](#step_5).
+Define the entity within an `entity-registry.yml` file. Depending on your approach, the location of this file may vary. More on that in steps [4](#step-4-choose-a-place-to-store-your-model-extension) and [5](#step-5-attaching-your-non-key-aspects-to-the-entity).
Example:
```yaml
@@ -208,11 +212,11 @@ After you create your Aspect, you need to attach to all the entities that it app
**Constraints**: Note that all aspects MUST be of type Record.
-### Step 4: Choose a place to store your model extension
+### Step 4: Choose a place to store your model extension
At the beginning of this document, we walked you through a flow-chart that should help you decide whether you need to maintain a fork of the open source DataHub repo for your model extensions, or whether you can just use a model extension repository that can stay independent of the DataHub repo. Depending on what path you took, the place you store your aspect model files (the .pdl files) and the entity-registry files (the yaml file called `entity-registry.yaml` or `entity-registry.yml`) will vary.
-- Open source Fork: Aspect files go under [`metadata-models`](../../metadata-models) module in the main repo, entity registry goes into [`metadata-models/src/main/resources/entity-registry.yml`](../../metadata-models/src/main/resources/entity-registry.yml). Read on for more details in [Step 5](#step_5).
+- Open source Fork: Aspect files go under [`metadata-models`](../../metadata-models) module in the main repo, entity registry goes into [`metadata-models/src/main/resources/entity-registry.yml`](../../metadata-models/src/main/resources/entity-registry.yml). Read on for more details in [Step 5](#step-5-attaching-your-non-key-aspects-to-the-entity).
- Custom repository: Read the [metadata-models-custom](../../metadata-models-custom/README.md) documentation to learn how to store and version your aspect models and registry.
### Step 5: Attaching your non-key Aspect(s) to the Entity
diff --git a/docs/modeling/metadata-model.md b/docs/modeling/metadata-model.md
index 704fce14123294..a8958985a0a724 100644
--- a/docs/modeling/metadata-model.md
+++ b/docs/modeling/metadata-model.md
@@ -30,7 +30,11 @@ Conceptually, metadata is modeled using the following abstractions
Here is an example graph consisting of 3 types of entity (CorpUser, Chart, Dashboard), 2 types of relationship (OwnedBy, Contains), and 3 types of metadata aspect (Ownership, ChartInfo, and DashboardInfo).
-![metadata-modeling](../imgs/metadata-model-chart.png)
+
+
+
+
+
## The Core Entities
@@ -73,7 +77,11 @@ to the YAML configuration, instead of creating new Snapshot / Aspect files.
## Exploring DataHub's Metadata Model
To explore the current DataHub metadata model, you can inspect this high-level picture that shows the different entities and edges between them showing the relationships between them.
-![Metadata Model Graph](../imgs/datahub-metadata-model.png)
+
+
+
+
+
To navigate the aspect model for specific entities and explore relationships using the `foreign-key` concept, you can view them in our demo environment or navigate the auto-generated docs in the **Metadata Modeling/Entities** section on the left.
@@ -425,7 +433,7 @@ aggregation query against a timeseries aspect.
The *@TimeseriesField* and the *@TimeseriesFieldCollection* are two new annotations that can be attached to a field of
a *Timeseries aspect* that allows it to be part of an aggregatable query. The kinds of aggregations allowed on these
annotated fields depends on the type of the field, as well as the kind of aggregation, as
-described [here](#Performing-an-aggregation-on-a-Timeseries-aspect).
+described [here](#performing-an-aggregation-on-a-timeseries-aspect).
* `@TimeseriesField = {}` - this annotation can be used with any type of non-collection type field of the aspect such as
primitive types and records (see the fields *stat*, *strStat* and *strArray* fields
@@ -507,7 +515,7 @@ my_emitter = DatahubRestEmitter("http://localhost:8080")
my_emitter.emit(mcpw)
```
-###### Performing an aggregation on a Timeseries aspect.
+###### Performing an aggregation on a Timeseries aspect
Aggreations on timeseries aspects can be performed by the GMS REST API for `/analytics?action=getTimeseriesStats` which
accepts the following params.
diff --git a/docs/platform-instances.md b/docs/platform-instances.md
index c6bfe3315de980..0f4515aedae549 100644
--- a/docs/platform-instances.md
+++ b/docs/platform-instances.md
@@ -1,44 +1,48 @@
-# Working With Platform Instances
-
-DataHub's metadata model for Datasets supports a three-part key currently:
-- Data Platform (e.g. urn:li:dataPlatform:mysql)
-- Name (e.g. db.schema.name)
-- Env or Fabric (e.g. DEV, PROD, etc.)
-
-This naming scheme unfortunately does not allow for easy representation of the multiplicity of platforms (or technologies) that might be deployed at an organization within the same environment or fabric. For example, an organization might have multiple Redshift instances in Production and would want to see all the data assets located in those instances inside the DataHub metadata repository.
-
-As part of the `v0.8.24+` releases, we are unlocking the first phase of supporting Platform Instances in the metadata model. This is done via two main additions:
-- The `dataPlatformInstance` aspect that has been added to Datasets which allows datasets to be associated to an instance of a platform
-- Enhancements to all ingestion sources that allow them to attach a platform instance to the recipe that changes the generated urns to go from `urn:li:dataset:(urn:li:dataPlatform:,,ENV)` format to `urn:li:dataset:(urn:li:dataPlatform:,,ENV)` format. Sources that produce lineage to datasets in other platforms (e.g. Looker, Superset etc) also have specific configuration additions that allow the recipe author to specify the mapping between a platform and the instance name that it should be mapped to.
-
-![./imgs/platform-instances-for-ingestion.png](./imgs/platform-instances-for-ingestion.png)
-
-## Naming Platform Instances
-
-When configuring a platform instance, choose an instance name that is understandable and will be stable for the foreseeable future. e.g. `core_warehouse` or `finance_redshift` are allowed names, as are pure guids like `a37dc708-c512-4fe4-9829-401cd60ed789`. Remember that whatever instance name you choose, you will need to specify it in more than one recipe to ensure that the identifiers produced by different sources will line up.
-
-## Enabling Platform Instances
-
-Read the Ingestion source specific guides for how to enable platform instances in each of them.
-The general pattern is to add an additional optional configuration parameter called `platform_instance`.
-
-e.g. here is how you would configure a recipe to ingest a mysql instance that you want to call `core_finance`
-```yaml
-source:
- type: mysql
- config:
- # Coordinates
- host_port: localhost:3306
- platform_instance: core_finance
- database: dbname
-
- # Credentials
- username: root
- password: example
-
-sink:
- # sink configs
-```
-
-
-##
+# Working With Platform Instances
+
+DataHub's metadata model for Datasets supports a three-part key currently:
+- Data Platform (e.g. urn:li:dataPlatform:mysql)
+- Name (e.g. db.schema.name)
+- Env or Fabric (e.g. DEV, PROD, etc.)
+
+This naming scheme unfortunately does not allow for easy representation of the multiplicity of platforms (or technologies) that might be deployed at an organization within the same environment or fabric. For example, an organization might have multiple Redshift instances in Production and would want to see all the data assets located in those instances inside the DataHub metadata repository.
+
+As part of the `v0.8.24+` releases, we are unlocking the first phase of supporting Platform Instances in the metadata model. This is done via two main additions:
+- The `dataPlatformInstance` aspect that has been added to Datasets which allows datasets to be associated to an instance of a platform
+- Enhancements to all ingestion sources that allow them to attach a platform instance to the recipe that changes the generated urns to go from `urn:li:dataset:(urn:li:dataPlatform:,,ENV)` format to `urn:li:dataset:(urn:li:dataPlatform:,,ENV)` format. Sources that produce lineage to datasets in other platforms (e.g. Looker, Superset etc) also have specific configuration additions that allow the recipe author to specify the mapping between a platform and the instance name that it should be mapped to.
+
+
+
+
+
+
+
+## Naming Platform Instances
+
+When configuring a platform instance, choose an instance name that is understandable and will be stable for the foreseeable future. e.g. `core_warehouse` or `finance_redshift` are allowed names, as are pure guids like `a37dc708-c512-4fe4-9829-401cd60ed789`. Remember that whatever instance name you choose, you will need to specify it in more than one recipe to ensure that the identifiers produced by different sources will line up.
+
+## Enabling Platform Instances
+
+Read the Ingestion source specific guides for how to enable platform instances in each of them.
+The general pattern is to add an additional optional configuration parameter called `platform_instance`.
+
+e.g. here is how you would configure a recipe to ingest a mysql instance that you want to call `core_finance`
+```yaml
+source:
+ type: mysql
+ config:
+ # Coordinates
+ host_port: localhost:3306
+ platform_instance: core_finance
+ database: dbname
+
+ # Credentials
+ username: root
+ password: example
+
+sink:
+ # sink configs
+```
+
+
+##
diff --git a/docs/quickstart.md b/docs/quickstart.md
index cd91dc8d1ac84a..29b22b54dc87a3 100644
--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@@ -1,219 +1,218 @@
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
# DataHub Quickstart Guide
+:::tip Managed DataHub
+
This guide provides instructions on deploying the open source DataHub locally.
-If you're interested in a managed version, [Acryl Data](https://www.acryldata.io/product) provides a fully managed, premium version of DataHub.
+If you're interested in a managed version, [Acryl Data](https://www.acryldata.io/product) provides a fully managed, premium version of DataHub.
+**[Get Started with Managed DataHub](./managed-datahub/welcome-acryl.md)**
-
-Get Started with Managed DataHub
-
+:::
-## Deploying DataHub
+## Prerequisites
-To deploy a new instance of DataHub, perform the following steps.
+- Install **Docker** and **Docker Compose** v2 for your platform.
-1. Install Docker and Docker Compose v2 for your platform.
+ | Platform | Application |
+ | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
+ | Window | [Docker Desktop](https://www.docker.com/products/docker-desktop/) |
+ | Mac | [Docker Desktop](https://www.docker.com/products/docker-desktop/) |
+ | Linux | [Docker for Linux](https://docs.docker.com/desktop/install/linux-install/) and [Docker Compose](https://docs.docker.com/compose/install/linux/) |
-- On Windows or Mac, install [Docker Desktop](https://www.docker.com/products/docker-desktop/).
-- On Linux, install [Docker for Linux](https://docs.docker.com/desktop/install/linux-install/) and [Docker Compose](https://docs.docker.com/compose/install/linux/).
+- **Launch the Docker engine** from command line or the desktop app.
+- Ensure you have **Python 3.7+** installed & configured. (Check using `python3 --version`).
-:::note
+:::note Docker Resource Allocation
-Make sure to allocate enough hardware resources for Docker engine.
+Make sure to allocate enough hardware resources for Docker engine.
Tested & confirmed config: 2 CPUs, 8GB RAM, 2GB Swap area, and 10GB disk space.
:::
-2. Launch the Docker Engine from command line or the desktop app.
-
-3. Install the DataHub CLI
-
- a. Ensure you have Python 3.7+ installed & configured. (Check using `python3 --version`).
-
- b. Run the following commands in your terminal
+## Install the DataHub CLI
- ```sh
- python3 -m pip install --upgrade pip wheel setuptools
- python3 -m pip install --upgrade acryl-datahub
- datahub version
- ```
+
+
- If you're using poetry, run the following command.
-
- ```sh
- poetry add acryl-datahub
- datahub version
- ```
+```bash
+python3 -m pip install --upgrade pip wheel setuptools
+python3 -m pip install --upgrade acryl-datahub
+datahub version
+```
-:::note
+:::note Command Not Found
-If you see "command not found", try running cli commands with the prefix 'python3 -m' instead like `python3 -m datahub version`
+If you see `command not found`, try running cli commands like `python3 -m datahub version`.
Note that DataHub CLI does not support Python 2.x.
:::
-4. To deploy a DataHub instance locally, run the following CLI command from your terminal
-
- ```
- datahub docker quickstart
- ```
-
- This will deploy a DataHub instance using [docker-compose](https://docs.docker.com/compose/).
- If you are curious, the `docker-compose.yaml` file is downloaded to your home directory under the `.datahub/quickstart` directory.
-
- If things go well, you should see messages like the ones below:
-
- ```
- Fetching docker-compose file https://raw.githubusercontent.com/datahub-project/datahub/master/docker/quickstart/docker-compose-without-neo4j-m1.quickstart.yml from GitHub
- Pulling docker images...
- Finished pulling docker images!
-
- [+] Running 11/11
- ⠿ Container zookeeper Running 0.0s
- ⠿ Container elasticsearch Running 0.0s
- ⠿ Container broker Running 0.0s
- ⠿ Container schema-registry Running 0.0s
- ⠿ Container elasticsearch-setup Started 0.7s
- ⠿ Container kafka-setup Started 0.7s
- ⠿ Container mysql Running 0.0s
- ⠿ Container datahub-gms Running 0.0s
- ⠿ Container mysql-setup Started 0.7s
- ⠿ Container datahub-datahub-actions-1 Running 0.0s
- ⠿ Container datahub-frontend-react Running 0.0s
- .......
- ✔ DataHub is now running
- Ingest some demo data using `datahub docker ingest-sample-data`,
- or head to http://localhost:9002 (username: datahub, password: datahub) to play around with the frontend.
- Need support? Get in touch on Slack: https://slack.datahubproject.io/
- ```
-
- Upon completion of this step, you should be able to navigate to the DataHub UI
- at [http://localhost:9002](http://localhost:9002) in your browser. You can sign in using `datahub` as both the
- username and password.
-
-:::note
-
-On Mac computers with Apple Silicon (M1, M2 etc.), you might see an error like `no matching manifest for linux/arm64/v8 in the manifest list entries`, this typically means that the datahub cli was not able to detect that you are running it on Apple Silicon. To resolve this issue, override the default architecture detection by issuing `datahub docker quickstart --arch m1`
+
+
-:::
+```bash
+poetry add acryl-datahub
+poetry shell
+datahub version
+```
-5. To ingest the sample metadata, run the following CLI command from your terminal
+
+
- ```bash
- datahub docker ingest-sample-data
- ```
+## Start DataHub
-:::note
+Run the following CLI command from your terminal.
-If you've enabled [Metadata Service Authentication](authentication/introducing-metadata-service-authentication.md), you'll need to provide a Personal Access Token
-using the `--token ` parameter in the command.
+```bash
+datahub docker quickstart
+```
-:::
+This will deploy a DataHub instance using [docker-compose](https://docs.docker.com/compose/).
+If you are curious, the `docker-compose.yaml` file is downloaded to your home directory under the `.datahub/quickstart` directory.
+
+If things go well, you should see messages like the ones below:
+
+```shell-session
+Fetching docker-compose file https://raw.githubusercontent.com/datahub-project/datahub/master/docker/quickstart/docker-compose-without-neo4j-m1.quickstart.yml from GitHub
+Pulling docker images...
+Finished pulling docker images!
+
+[+] Running 11/11
+⠿ Container zookeeper Running 0.0s
+⠿ Container elasticsearch Running 0.0s
+⠿ Container broker Running 0.0s
+⠿ Container schema-registry Running 0.0s
+⠿ Container elasticsearch-setup Started 0.7s
+⠿ Container kafka-setup Started 0.7s
+⠿ Container mysql Running 0.0s
+⠿ Container datahub-gms Running 0.0s
+⠿ Container mysql-setup Started 0.7s
+⠿ Container datahub-datahub-actions-1 Running 0.0s
+⠿ Container datahub-frontend-react Running 0.0s
+.......
+✔ DataHub is now running
+Ingest some demo data using `datahub docker ingest-sample-data`,
+or head to http://localhost:9002 (username: datahub, password: datahub) to play around with the frontend.
+Need support? Get in touch on Slack: https://slack.datahubproject.io/
+```
-That's it! Now feel free to play around with DataHub!
+:::note Mac M1/M2
-## Troubleshooting Issues
+On Mac computers with Apple Silicon (M1, M2 etc.), you might see an error like `no matching manifest for linux/arm64/v8 in the manifest list entries`.
+This typically means that the datahub cli was not able to detect that you are running it on Apple Silicon.
+To resolve this issue, override the default architecture detection by issuing `datahub docker quickstart --arch m1`
-Please refer to [Quickstart Debugging Guide](./troubleshooting/quickstart.md).
+:::
-## Next Steps
+### Sign In
-### Ingest Metadata
+Upon completion of this step, you should be able to navigate to the DataHub UI at [http://localhost:9002](http://localhost:9002) in your browser.
+You can sign in using the default credentials below.
-To start pushing your company's metadata into DataHub, take a look at [UI-based Ingestion Guide](./ui-ingestion.md), or to run ingestion using the cli, look at the [Metadata Ingestion Guide](../metadata-ingestion/README.md).
+```json
+username: datahub
+password: datahub
+```
-### Invite Users
+To change the default credentials, please refer to [Change the default user datahub in quickstart](authentication/changing-default-credentials.md#quickstart).
-To add users to your deployment to share with your team check out our [Adding Users to DataHub](authentication/guides/add-users.md)
+### Ingest Sample Data
-### Enable Authentication
+To ingest the sample metadata, run the following CLI command from your terminal
-To enable SSO, check out [Configuring OIDC Authentication](authentication/guides/sso/configure-oidc-react.md) or [Configuring JaaS Authentication](authentication/guides/jaas.md).
+```bash
+datahub docker ingest-sample-data
+```
-To enable backend Authentication, check out [authentication in DataHub's backend](authentication/introducing-metadata-service-authentication.md#configuring-metadata-service-authentication).
+:::note Token Authentication
-### Change the Default `datahub` User Credentials
+If you've enabled [Metadata Service Authentication](authentication/introducing-metadata-service-authentication.md), you'll need to provide a Personal Access Token
+using the `--token ` parameter in the command.
-:::note
-Please note that deleting the `Data Hub` user in the UI **WILL NOT** disable the default user. You will still be able to log in using the default 'datahub:datahub' credentials. To safely delete the default credentials, please follow the guide provided below.
:::
-Please refer to [Change the default user datahub in quickstart](authentication/changing-default-credentials.md#quickstart).
-
-### Move to Production
+That's it! Now feel free to play around with DataHub!
-We recommend deploying DataHub to production using Kubernetes. We provide helpful [Helm Charts](https://artifacthub.io/packages/helm/datahub/datahub) to help you quickly get up and running. Check out [Deploying DataHub to Kubernetes](./deploy/kubernetes.md) for a step-by-step walkthrough.
+---
-The `quickstart` method of running DataHub is intended for local development and a quick way to experience the features that DataHub has to offer. It is not
-intended for a production environment. This recommendation is based on the following points.
+## Common Operations
-#### Default Credentials
+### Stop DataHub
-`quickstart` uses docker-compose configuration which includes default credentials for both DataHub, and it's underlying
-prerequisite data stores, such as MySQL. Additionally, other components are unauthenticated out of the box. This is a
-design choice to make development easier and is not best practice for a production environment.
-
-#### Exposed Ports
+To stop DataHub's quickstart, you can issue the following command.
-DataHub's services, and it's backend data stores use the docker default behavior of binding to all interface addresses.
-This makes it useful for development but is not recommended in a production environment.
+```bash
+datahub docker quickstart --stop
+```
-#### Performance & Management
+### Reset DataHub
-* `quickstart` is limited by the resources available on a single host, there is no ability to scale horizontally.
-* Rollout of new versions requires downtime.
-* The configuration is largely pre-determined and not easily managed.
-* `quickstart`, by default, follows the most recent builds forcing updates to the latest released and unreleased builds.
+To cleanse DataHub of all of its state (e.g. before ingesting your own), you can use the CLI `nuke` command.
-## Other Common Operations
+```bash
+datahub docker nuke
+```
-### Stopping DataHub
+### Upgrade DataHub
-To stop DataHub's quickstart, you can issue the following command.
+If you have been testing DataHub locally, a new version of DataHub got released and you want to try the new version then you can just issue the quickstart command again. It will pull down newer images and restart your instance without losing any data.
-```
-datahub docker quickstart --stop
+```bash
+datahub docker quickstart
```
-### Resetting DataHub (a.k.a factory reset)
+### Customize installation
-To cleanse DataHub of all of its state (e.g. before ingesting your own), you can use the CLI `nuke` command.
+If you would like to customize the DataHub installation further, please download the [docker-compose.yaml](https://raw.githubusercontent.com/datahub-project/datahub/master/docker/quickstart/docker-compose-without-neo4j-m1.quickstart.yml) used by the cli tool, modify it as necessary and deploy DataHub by passing the downloaded docker-compose file:
-```
-datahub docker nuke
+```bash
+datahub docker quickstart --quickstart-compose-file
```
-### Backing up your DataHub Quickstart (experimental)
+### Back up DataHub
-The quickstart image is not recommended for use as a production instance. See [Moving to production](#move-to-production) for recommendations on setting up your production cluster. However, in case you want to take a backup of your current quickstart state (e.g. you have a demo to your company coming up and you want to create a copy of the quickstart data so you can restore it at a future date), you can supply the `--backup` flag to quickstart.
+The quickstart image is not recommended for use as a production instance.
+However, in case you want to take a backup of your current quickstart state (e.g. you have a demo to your company coming up and you want to create a copy of the quickstart data so you can restore it at a future date), you can supply the `--backup` flag to quickstart.
-```
+
+
+
+```bash
datahub docker quickstart --backup
```
-will take a backup of your MySQL image and write it by default to your `~/.datahub/quickstart/` directory as the file `backup.sql`. You can customize this by passing a `--backup-file` argument.
-e.g.
+This will take a backup of your MySQL image and write it by default to your `~/.datahub/quickstart/` directory as the file `backup.sql`.
+
+
+
+```bash
+datahub docker quickstart --backup --backup-file
```
-datahub docker quickstart --backup --backup-file /home/my_user/datahub_backups/quickstart_backup_2002_22_01.sql
-```
-:::note
+You can customize the backup file path by passing a `--backup-file` argument.
+
+
+
+
+:::caution
Note that the Quickstart backup does not include any timeseries data (dataset statistics, profiles, etc.), so you will lose that information if you delete all your indexes and restore from this backup.
:::
-### Restoring your DataHub Quickstart (experimental)
+### Restore DataHub
As you might imagine, these backups are restore-able. The following section describes a few different options you have to restore your backup.
-#### Restoring a backup (primary + index) [most common]
+
+
To restore a previous backup, run the following command:
-```
+```bash
datahub docker quickstart --restore
```
@@ -221,38 +220,71 @@ This command will pick up the `backup.sql` file located under `~/.datahub/quicks
To supply a specific backup file, use the `--restore-file` option.
-```
+```bash
datahub docker quickstart --restore --restore-file /home/my_user/datahub_backups/quickstart_backup_2002_22_01.sql
```
-#### Restoring only the index [to deal with index out of sync / corruption issues]
+
+
Another situation that can come up is the index can get corrupt, or be missing some update. In order to re-bootstrap the index from the primary store, you can run this command to sync the index with the primary store.
-```
+```bash
datahub docker quickstart --restore-indices
```
-#### Restoring a backup (primary but NO index) [rarely used]
+
+
+
Sometimes, you might want to just restore the state of your primary database (MySQL), but not re-index the data. To do this, you have to explicitly disable the restore-indices capability.
-```
+```bash
datahub docker quickstart --restore --no-restore-indices
```
-### Upgrading your local DataHub
+
+
-If you have been testing DataHub locally, a new version of DataHub got released and you want to try the new version then you can just issue the quickstart command again. It will pull down newer images and restart your instance without losing any data.
+---
-```
-datahub docker quickstart
-```
+## Next Steps
-### Customization
+- [Quickstart Debugging Guide](./troubleshooting/quickstart.md)
+- [Ingest metadata through the UI](./ui-ingestion.md)
+- [Ingest metadata through the CLI](../metadata-ingestion/README.md)
+- [Add Users to DataHub](authentication/guides/add-users.md)
+- [Configure OIDC Authentication](authentication/guides/sso/configure-oidc-react.md)
+- [Configure JaaS Authentication](authentication/guides/jaas.md)
+- [Configure authentication in DataHub's backend](authentication/introducing-metadata-service-authentication.md#configuring-metadata-service-authentication).
+- [Change the default user datahub in quickstart](authentication/changing-default-credentials.md#quickstart)
-If you would like to customize the DataHub installation further, please download the [docker-compose.yaml](https://raw.githubusercontent.com/datahub-project/datahub/master/docker/quickstart/docker-compose-without-neo4j-m1.quickstart.yml) used by the cli tool, modify it as necessary and deploy DataHub by passing the downloaded docker-compose file:
+### Move To Production
-```
-datahub docker quickstart --quickstart-compose-file
-```
+:::caution
+
+Quickstart is not intended for a production environment. We recommend deploying DataHub to production using Kubernetes.
+We provide helpful [Helm Charts](https://artifacthub.io/packages/helm/datahub/datahub) to help you quickly get up and running.
+Check out [Deploying DataHub to Kubernetes](./deploy/kubernetes.md) for a step-by-step walkthrough.
+
+:::
+
+The `quickstart` method of running DataHub is intended for local development and a quick way to experience the features that DataHub has to offer.
+It is not intended for a production environment. This recommendation is based on the following points.
+
+#### Default Credentials
+
+`quickstart` uses docker-compose configuration which includes default credentials for both DataHub, and it's underlying
+prerequisite data stores, such as MySQL. Additionally, other components are unauthenticated out of the box. This is a
+design choice to make development easier and is not best practice for a production environment.
+
+#### Exposed Ports
+
+DataHub's services, and it's backend data stores use the docker default behavior of binding to all interface addresses.
+This makes it useful for development but is not recommended in a production environment.
+
+#### Performance & Management
+
+`quickstart` is limited by the resources available on a single host, there is no ability to scale horizontally.
+Rollout of new versions often requires downtime and the configuration is largely pre-determined and not easily managed.
+Lastly, by default, `quickstart` follows the most recent builds forcing updates to the latest released and unreleased builds.
diff --git a/docs/schema-history.md b/docs/schema-history.md
index 9fc9ec1af52bbc..120d041960186e 100644
--- a/docs/schema-history.md
+++ b/docs/schema-history.md
@@ -23,20 +23,32 @@ must have the **View Entity Page** privilege, or be assigned to **any** DataHub
You can view the Schema History for a Dataset by navigating to that Dataset's Schema Tab. As long as that Dataset has more than
one version, you can view what a Dataset looked like at any given version by using the version selector.
Here's an example from DataHub's official Demo environment with the
-[Snowflake pets dataset](https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:snowflake,long_tail_companions.adoption.pets,PROD)/Schema?is_lineage_mode=false).
+Snowflake pets dataset .
+
+
+
+
+
-![](./imgs/schema-history-latest-version.png)
If you click on an older version in the selector, you'll be able to see what the schema looked like back then. Notice
the changes here to the glossary terms for the `status` field, and to the descriptions for the `created_at` and `updated_at`
fields.
-![](./imgs/schema-history-older-version.png)
+
+
+
+
+
In addition to this, you can also toggle the Audit view that shows you when the most recent changes were made to each field.
You can active this by clicking on the Audit icon you see above the top right of the table.
-![](./imgs/schema-history-audit-activated.png)
+
+
+
+
+
You can see here that some of these fields were added at the oldest dataset version, while some were added only at this latest
version. Some fields were even modified and had a type change at the latest version!
diff --git a/docs/tags.md b/docs/tags.md
index 945b514dc7b473..cb08c9fafea490 100644
--- a/docs/tags.md
+++ b/docs/tags.md
@@ -27,25 +27,25 @@ You can create these privileges by creating a new [Metadata Policy](./authorizat
To add a tag at the dataset or container level, simply navigate to the page for that entity and click on the **Add Tag** button.
-
+
Type in the name of the tag you want to add. You can add a new tag, or add a tag that already exists (the autocomplete will pull up the tag if it already exists).
-
+
Click on the "Add" button and you'll see the tag has been added!
-
+
If you would like to add a tag at the schema level, hover over the "Tags" column for a schema until the "Add Tag" button shows up, and then follow the same flow as above.
-
+
### Removing a Tag
@@ -57,7 +57,7 @@ To remove a tag, simply click on the "X" button in the tag. Then click "Yes" whe
You can search for a tag in the search bar, and even filter entities by the presence of a specific tag.
-
+
## Additional Resources
diff --git a/docs/townhall-history.md b/docs/townhall-history.md
index 1da490ca6fa692..d92905af0cd72c 100644
--- a/docs/townhall-history.md
+++ b/docs/townhall-history.md
@@ -328,7 +328,7 @@ November Town Hall (in December!)
* Welcome - 5 mins
* Latest React App Demo! ([video](https://www.youtube.com/watch?v=RQBEJhcen5E)) by John Joyce and Gabe Lyons - 5 mins
-* Use-Case: DataHub at Geotab ([slides](https://docs.google.com/presentation/d/1qcgO3BW5NauuG0HnPqrxGcujsK-rJ1-EuU-7cbexkqE/edit?usp=sharing),[video](https://www.youtube.com/watch?v=boyjT2OrlU4)) by [John Yoon](https://www.linkedin.com/in/yhjyoon/) - 15 mins
+* Use-Case: DataHub at Geotab ([video](https://www.youtube.com/watch?v=boyjT2OrlU4)) by [John Yoon](https://www.linkedin.com/in/yhjyoon/) - 15 mins
* Tech Deep Dive: Tour of new pull-based Python Ingestion scripts ([slides](https://docs.google.com/presentation/d/15Xay596WDIhzkc5c8DEv6M-Bv1N4hP8quup1tkws6ms/edit#slide=id.gb478361595_0_10),[video](https://www.youtube.com/watch?v=u0IUQvG-_xI)) by [Harshal Sheth](https://www.linkedin.com/in/hsheth2/) - 15 mins
* General Q&A from sign up sheet, slack, and participants - 15 mins
* Closing remarks - 5 mins
@@ -343,8 +343,7 @@ Agenda
- Announcements - 2 mins
- Community Updates ([video](https://youtu.be/r862MZTLAJ0?t=99)) - 10 mins
-- Use-Case: DataHub at Viasat ([slides](demo/ViasatMetadataJourney.pdf),[video](https://youtu.be/2SrDAJnzkjE)) by [Anna Kepler](https://www.linkedin.com/in/akepler) - 15 mins
-- Tech Deep Dive: GraphQL + React RFCs readout and discussion ([slides](https://docs.google.com/presentation/d/e/2PACX-1vRtnINnpi6PvFw7-5iW8PSQoT9Kdf1O_0YW7QAr1_mSdJMNftYFTVCjKL-e3fpe8t6IGkha8UpdmoOI/pub?start=false&loop=false&delayms=3000) ,[video](https://www.youtube.com/watch?v=PrBaFrb7pqA)) by [John Joyce](https://www.linkedin.com/in/john-joyce-759883aa) and [Arun Vasudevan](https://www.linkedin.com/in/arun-vasudevan-55117368/) - 15 mins
+- Use-Case: DataHub at Viasat ([slides](https://github.com/acryldata/static-assets-test/raw/master/imgs/demo/ViasatMetadataJourney.pdf),[video](https://youtu.be/2SrDAJnzkjE)) by [Anna Kepler](https://www.linkedin.com/in/akepler) - 15 mins- Tech Deep Dive: GraphQL + React RFCs readout and discussion ([slides](https://docs.google.com/presentation/d/e/2PACX-1vRtnINnpi6PvFw7-5iW8PSQoT9Kdf1O_0YW7QAr1_mSdJMNftYFTVCjKL-e3fpe8t6IGkha8UpdmoOI/pub?start=false&loop=false&delayms=3000) ,[video](https://www.youtube.com/watch?v=PrBaFrb7pqA)) by [John Joyce](https://www.linkedin.com/in/john-joyce-759883aa) and [Arun Vasudevan](https://www.linkedin.com/in/arun-vasudevan-55117368/) - 15 mins
- General Q&A from sign up sheet, slack, and participants - 15 mins
- Closing remarks - 3 mins
- General Q&A from sign up sheet, slack, and participants - 15 mins
@@ -356,8 +355,8 @@ Agenda
Agenda
- Quick intro - 5 mins
-- [Why did Grofers choose DataHub for their data catalog?](demo/Datahub_at_Grofers.pdf) by [Shubham Gupta](https://www.linkedin.com/in/shubhamg931/) - 15 minutes
-- [DataHub UI development - Part 2](demo/Town_Hall_Presentation_-_12-2020_-_UI_Development_Part_2.pdf) by [Charlie Tran](https://www.linkedin.com/in/charlie-tran/) (LinkedIn) - 20 minutes
+- [Why did Grofers choose DataHub for their data catalog?](https://github.com/acryldata/static-assets-test/raw/master/imgs/demo/Datahub_at_Grofers.pdf) by [Shubham Gupta](https://www.linkedin.com/in/shubhamg931/) - 15 minutes
+- [DataHub UI development - Part 2](https://github.com/acryldata/static-assets-test/raw/master/imgs/demo/Town_Hall_Presentation_-_12-2020_-_UI_Development_Part_2.pdf) by [Charlie Tran](https://www.linkedin.com/in/charlie-tran/) (LinkedIn) - 20 minutes
- General Q&A from sign up sheet, slack, and participants - 15 mins
- Closing remarks - 5 minutes
@@ -368,9 +367,9 @@ Agenda
Agenda
- Quick intro - 5 mins
-- [Lightning talk on Metadata use-cases at LinkedIn](demo/Metadata_Use-Cases_at_LinkedIn_-_Lightning_Talk.pdf) by [Shirshanka Das](https://www.linkedin.com/in/shirshankadas/) (LinkedIn) - 5 mins
-- [Strongly Consistent Secondary Index (SCSI) in GMA](demo/Datahub_-_Strongly_Consistent_Secondary_Indexing.pdf), an upcoming feature by [Jyoti Wadhwani](https://www.linkedin.com/in/jyotiwadhwani/) (LinkedIn) - 15 minutes
-- [DataHub UI overview](demo/DataHub-UIOverview.pdf) by [Ignacio Bona](https://www.linkedin.com/in/ignaciobona) (LinkedIn) - 20 minutes
+- [Lightning talk on Metadata use-cases at LinkedIn](https://github.com/acryldata/static-assets-test/raw/master/imgs/demo/Metadata_Use-Cases_at_LinkedIn_-_Lightning_Talk.pdf) by [Shirshanka Das](https://www.linkedin.com/in/shirshankadas/) (LinkedIn) - 5 mins
+- [Strongly Consistent Secondary Index (SCSI) in GMA](https://github.com/acryldata/static-assets-test/raw/master/imgs/demo/Datahub_-_Strongly_Consistent_Secondary_Indexing.pdf), an upcoming feature by [Jyoti Wadhwani](https://www.linkedin.com/in/jyotiwadhwani/) (LinkedIn) - 15 minutes
+- [DataHub UI overview](https://github.com/acryldata/static-assets-test/raw/master/imgs/demo/DataHub-UIOverview.pdf) by [Ignacio Bona](https://www.linkedin.com/in/ignaciobona) (LinkedIn) - 20 minutes
- General Q&A from sign up sheet, slack, and participants - 10 mins
- Closing remarks - 5 minutes
@@ -382,8 +381,8 @@ Agenda
Agenda
- Quick intro - 5 mins
-- [Data Discoverability at SpotHero](demo/Data_Discoverability_at_SpotHero.pdf) by [Maggie Hays](https://www.linkedin.com/in/maggie-hays/) (SpotHero) - 20 mins
-- [Designing the next generation of metadata events for scale](demo/Designing_the_next_generation_of_metadata_events_for_scale.pdf) by [Chris Lee](https://www.linkedin.com/in/chrisleecmu/) (LinkedIn) - 15 mins
+- [Data Discoverability at SpotHero](https://github.com/acryldata/static-assets-test/raw/master/imgs/demo/Data_Discoverability_at_SpotHero.pdf) by [Maggie Hays](https://www.linkedin.com/in/maggie-hays/) (SpotHero) - 20 mins
+- [Designing the next generation of metadata events for scale](https://github.com/acryldata/static-assets-test/raw/master/imgs/demo/Designing_the_next_generation_of_metadata_events_for_scale.pdf) by [Chris Lee](https://www.linkedin.com/in/chrisleecmu/) (LinkedIn) - 15 mins
- General Q&A from sign up sheet, slack, and participants - 15 mins
- Closing remarks - 5 mins
diff --git a/docs/ui-ingestion.md b/docs/ui-ingestion.md
index 4435f66e514f33..2ecb1e634c79f1 100644
--- a/docs/ui-ingestion.md
+++ b/docs/ui-ingestion.md
@@ -14,11 +14,19 @@ This document will describe the steps required to configure, schedule, and execu
To view & manage UI-based metadata ingestion, you must have the `Manage Metadata Ingestion` & `Manage Secrets`
privileges assigned to your account. These can be granted by a [Platform Policy](authorization/policies.md).
-![](./imgs/ingestion-privileges.png)
+
+
+
+
+
Once you have these privileges, you can begin to manage ingestion by navigating to the 'Ingestion' tab in DataHub.
-![](./imgs/ingestion-tab.png)
+
+
+
+
+
On this page, you'll see a list of active **Ingestion Sources**. An Ingestion Sources is a unique source of metadata ingested
into DataHub from an external source like Snowflake, Redshift, or BigQuery.
@@ -33,7 +41,11 @@ your first **Ingestion Source**.
Before ingesting any metadata, you need to create a new Ingestion Source. Start by clicking **+ Create new source**.
-![](./imgs/create-new-ingestion-source-button.png)
+
+
+
+
+
#### Step 1: Select a Platform Template
@@ -41,7 +53,11 @@ In the first step, select a **Recipe Template** corresponding to the source type
a variety of natively supported integrations, from Snowflake to Postgres to Kafka.
Select `Custom` to construct an ingestion recipe from scratch.
-![](./imgs/select-platform-template.png)
+
+
+
+
+
Next, you'll configure an ingestion **Recipe**, which defines _how_ and _what_ to extract from the source system.
@@ -68,7 +84,11 @@ used by DataHub to extract metadata from a 3rd party system. It most often consi
A sample of a full recipe configured to ingest metadata from MySQL can be found in the image below.
-![](./imgs/example-mysql-recipe.png)
+
+
+
+
+
Detailed configuration examples & documentation for each source type can be found on the [DataHub Docs](https://datahubproject.io/docs/metadata-ingestion/) website.
@@ -80,7 +100,11 @@ that are encrypted and stored within DataHub's storage layer.
To create a secret, first navigate to the 'Secrets' tab. Then click `+ Create new secret`.
-![](./imgs/create-secret.png)
+
+
+
+
+
_Creating a Secret to store the username for a MySQL database_
@@ -123,7 +147,11 @@ Secret values are not persisted to disk beyond execution time, and are never tra
Next, you can optionally configure a schedule on which to execute your new Ingestion Source. This enables to schedule metadata extraction on a monthly, weekly, daily, or hourly cadence depending on the needs of your organization.
Schedules are defined using CRON format.
-![](./imgs/schedule-ingestion.png)
+
+
+
+
+
_An Ingestion Source that is executed at 9:15am every day, Los Angeles time_
@@ -136,7 +164,11 @@ you can always come back and change this.
Finally, give your Ingestion Source a name.
-![](./imgs/name-ingestion-source.png)
+
+
+
+
+
Once you're happy with your configurations, click 'Done' to save your changes.
@@ -149,7 +181,11 @@ with the server. However, you can override the default package version using the
To do so, simply click 'Advanced', then change the 'CLI Version' text box to contain the exact version
of the DataHub CLI you'd like to use.
-![](./imgs/custom-ingestion-cli-version.png)
+
+
+
+
+
_Pinning the CLI version to version `0.8.23.2`_
Once you're happy with your changes, simply click 'Done' to save.
@@ -200,11 +236,19 @@ Once you've created your Ingestion Source, you can run it by clicking 'Execute'.
you should see the 'Last Status' column of the ingestion source change from `N/A` to `Running`. This
means that the request to execute ingestion has been successfully picked up by the DataHub ingestion executor.
-![](./imgs/running-ingestion.png)
+
+
+
+
+
If ingestion has executed successfully, you should see it's state shown in green as `Succeeded`.
-![](./imgs/successful-ingestion.png)
+
+
+
+
+
### Cancelling an Ingestion Run
@@ -212,14 +256,22 @@ If ingestion has executed successfully, you should see it's state shown in green
If your ingestion run is hanging, there may a bug in the ingestion source, or another persistent issue like exponential timeouts. If these situations,
you can cancel ingestion by clicking **Cancel** on the problematic run.
-![](./imgs/cancelled-ingestion.png)
+
+
+
+
+
Once cancelled, you can view the output of the ingestion run by clicking **Details**.
### Debugging a Failed Ingestion Run
-![](./imgs/failed-ingestion.png)
+
+
+
+
+
A variety of things can cause an ingestion run to fail. Common reasons for failure include:
@@ -235,12 +287,20 @@ A variety of things can cause an ingestion run to fail. Common reasons for failu
4. **Authentication**: If you've enabled [Metadata Service Authentication](authentication/introducing-metadata-service-authentication.md), you'll need to provide a Personal Access Token
in your Recipe Configuration. To so this, set the 'token' field of the sink configuration to contain a Personal Access Token:
- ![](./imgs/ingestion-with-token.png)
+
+
+
+
+
The output of each run is captured and available to view in the UI for easier debugging. To view output logs, click **DETAILS**
on the corresponding ingestion run.
-![](./imgs/ingestion-logs.png)
+
+
+
+
+
## FAQ
@@ -250,7 +310,11 @@ If not due to one of the reasons outlined above, this may be because the executo
to reach DataHub's backend using the default configurations. Try changing your ingestion recipe to make the `sink.config.server` variable point to the Docker
DNS name for the `datahub-gms` pod:
-![](./imgs/quickstart-ingestion-config.png)
+
+
+
+
+
### I see 'N/A' when I try to run ingestion. What do I do?
diff --git a/docs/what/gms.md b/docs/what/gms.md
index 9e1cea1b9540e8..a39450d28ae83e 100644
--- a/docs/what/gms.md
+++ b/docs/what/gms.md
@@ -2,6 +2,4 @@
Metadata for [entities](entity.md) [onboarded](../modeling/metadata-model.md) to [GMA](gma.md) is served through microservices known as Generalized Metadata Service (GMS). GMS typically provides a [Rest.li](http://rest.li) API and must access the metadata using [GMA DAOs](../architecture/metadata-serving.md).
-While a GMS is completely free to define its public APIs, we do provide a list of [resource base classes](https://github.com/datahub-project/datahub-gma/tree/master/restli-resources/src/main/java/com/linkedin/metadata/restli) to leverage for common patterns.
-
-GMA is designed to support a distributed fleet of GMS, each serving a subset of the [GMA graph](graph.md). However, for simplicity we include a single centralized GMS ([datahub-gms](../../gms)) that serves all entities.
+GMA is designed to support a distributed fleet of GMS, each serving a subset of the [GMA graph](graph.md). However, for simplicity we include a single centralized GMS that serves all entities.
diff --git a/docs/what/mxe.md b/docs/what/mxe.md
index 8af96360858a33..25294e04ea3d92 100644
--- a/docs/what/mxe.md
+++ b/docs/what/mxe.md
@@ -266,7 +266,7 @@ A Metadata Change Event represents a request to change multiple aspects for the
It leverages a deprecated concept of `Snapshot`, which is a strongly-typed list of aspects for the same
entity.
-A MCE is a "proposal" for a set of metadata changes, as opposed to [MAE](#metadata-audit-event), which is conveying a committed change.
+A MCE is a "proposal" for a set of metadata changes, as opposed to [MAE](#metadata-audit-event-mae), which is conveying a committed change.
Consequently, only successfully accepted and processed MCEs will lead to the emission of a corresponding MAE / MCLs.
### Emission
diff --git a/docs/what/relationship.md b/docs/what/relationship.md
index 1908bbd6ce75f0..d5348dc04b3c01 100644
--- a/docs/what/relationship.md
+++ b/docs/what/relationship.md
@@ -2,7 +2,11 @@
A relationship is a named associate between exactly two [entities](entity.md), a source and a destination.
-![metadata-modeling](../imgs/metadata-modeling.png)
+
+
+
+
+
From the above graph, a `Group` entity can be linked to a `User` entity via a `HasMember` relationship.
Note that the name of the relationship reflects the direction, i.e. pointing from `Group` to `User`.
@@ -98,9 +102,6 @@ For one, the actual direction doesn’t really impact the execution of graph que
That being said, generally there’s a more "natural way" to specify the direction of a relationship, which closely relate to how the metadata is stored. For example, the membership information for an LDAP group is generally stored as a list in group’s metadata. As a result, it’s more natural to model a `HasMember` relationship that points from a group to a member, instead of a `IsMemberOf` relationship pointing from member to group.
-Since all relationships are explicitly declared, it’s fairly easy for a user to discover what relationships are available and their directionality by inspecting
-the [relationships directory](../../metadata-models/src/main/pegasus/com/linkedin/metadata/relationship). It’s also possible to provide a UI for the catalog of entities and relationships for analysts who are interested in building complex graph queries to gain insights into the metadata.
-
## High Cardinality Relationships
See [this doc](../advanced/high-cardinality.md) for suggestions on how to best model relationships with high cardinality.
diff --git a/docs/what/search-document.md b/docs/what/search-document.md
index 81359a55d0caec..bd27656e512c3a 100644
--- a/docs/what/search-document.md
+++ b/docs/what/search-document.md
@@ -13,7 +13,6 @@ As a result, one may be tempted to add as many attributes as needed. This is acc
Below shows an example schema for the `User` search document. Note that:
1. Each search document is required to have a type-specific `urn` field, generally maps to an entity in the [graph](graph.md).
2. Similar to `Entity`, each document has an optional `removed` field for "soft deletion".
-This is captured in [BaseDocument](../../metadata-models/src/main/pegasus/com/linkedin/metadata/search/BaseDocument.pdl), which is expected to be included by all documents.
3. Similar to `Entity`, all remaining fields are made `optional` to support partial updates.
4. `management` shows an example of a string array field.
5. `ownedDataset` shows an example on how a field can be derived from metadata [aspects](aspect.md) associated with other types of entity (in this case, `Dataset`).
diff --git a/entity-registry/build.gradle b/entity-registry/build.gradle
index af742d240d1e6b..3da0bf5bb4fb81 100644
--- a/entity-registry/build.gradle
+++ b/entity-registry/build.gradle
@@ -1,16 +1,17 @@
apply plugin: 'pegasus'
+apply plugin: 'java-library'
dependencies {
- compile spec.product.pegasus.data
- compile spec.product.pegasus.generator
- compile project(path: ':metadata-models')
+ implementation spec.product.pegasus.data
+ implementation spec.product.pegasus.generator
+ api project(path: ':metadata-models')
implementation externalDependency.slf4jApi
compileOnly externalDependency.lombok
- compile externalDependency.guava
- compile externalDependency.jacksonDataBind
- compile externalDependency.jacksonDataFormatYaml
- compile externalDependency.reflections
- compile externalDependency.jsonPatch
+ implementation externalDependency.guava
+ implementation externalDependency.jacksonDataBind
+ implementation externalDependency.jacksonDataFormatYaml
+ implementation externalDependency.reflections
+ implementation externalDependency.jsonPatch
constraints {
implementation(externalDependency.snakeYaml) {
because("previous versions are vulnerable to CVE-2022-25857")
@@ -19,12 +20,13 @@ dependencies {
dataModel project(':li-utils')
annotationProcessor externalDependency.lombok
- compile externalDependency.mavenArtifact
+ api externalDependency.mavenArtifact
- testCompile project(':test-models')
- testCompile externalDependency.testng
- testCompile externalDependency.mockito
- testCompile externalDependency.mockitoInline
+ testImplementation project(':test-models')
+ testImplementation project(path: ':test-models', configuration: 'testDataTemplate')
+ testImplementation externalDependency.testng
+ testImplementation externalDependency.mockito
+ testImplementation externalDependency.mockitoInline
}
compileTestJava.dependsOn tasks.getByPath(':entity-registry:custom-test-model:modelDeploy')
diff --git a/entity-registry/custom-test-model/build.gradle b/entity-registry/custom-test-model/build.gradle
index 90f50fe1f29929..778e2e42b95c44 100644
--- a/entity-registry/custom-test-model/build.gradle
+++ b/entity-registry/custom-test-model/build.gradle
@@ -23,11 +23,11 @@ if (project.hasProperty('projVersion')) {
dependencies {
- compile spec.product.pegasus.data
+ implementation spec.product.pegasus.data
// Uncomment these if you want to depend on models defined in core datahub
- //compile project(':li-utils')
+ //implementation project(':li-utils')
//dataModel project(':li-utils')
- //compile project(':metadata-models')
+ //implementation project(':metadata-models')
//dataModel project(':metadata-models')
}
diff --git a/entity-registry/src/main/java/com/linkedin/metadata/models/SearchableFieldSpecExtractor.java b/entity-registry/src/main/java/com/linkedin/metadata/models/SearchableFieldSpecExtractor.java
index 2ffd9283ed4569..8f2f42cd69caee 100644
--- a/entity-registry/src/main/java/com/linkedin/metadata/models/SearchableFieldSpecExtractor.java
+++ b/entity-registry/src/main/java/com/linkedin/metadata/models/SearchableFieldSpecExtractor.java
@@ -155,7 +155,8 @@ private void extractSearchableAnnotation(final Object annotationObj, final DataS
annotation.getBoostScore(),
annotation.getHasValuesFieldName(),
annotation.getNumValuesFieldName(),
- annotation.getWeightsPerFieldValue());
+ annotation.getWeightsPerFieldValue(),
+ annotation.getFieldNameAliases());
}
}
log.debug("Searchable annotation for field: {} : {}", schemaPathSpec, annotation);
diff --git a/entity-registry/src/main/java/com/linkedin/metadata/models/annotation/SearchableAnnotation.java b/entity-registry/src/main/java/com/linkedin/metadata/models/annotation/SearchableAnnotation.java
index 3d3fbcf3ccaa6f..d5e5044f95c238 100644
--- a/entity-registry/src/main/java/com/linkedin/metadata/models/annotation/SearchableAnnotation.java
+++ b/entity-registry/src/main/java/com/linkedin/metadata/models/annotation/SearchableAnnotation.java
@@ -4,7 +4,10 @@
import com.google.common.collect.ImmutableSet;
import com.linkedin.data.schema.DataSchema;
import com.linkedin.metadata.models.ModelValidationException;
+
+import java.util.ArrayList;
import java.util.Arrays;
+import java.util.List;
import java.util.Map;
import java.util.Optional;
import java.util.Set;
@@ -19,6 +22,7 @@
@Value
public class SearchableAnnotation {
+ public static final String FIELD_NAME_ALIASES = "fieldNameAliases";
public static final String ANNOTATION_NAME = "Searchable";
private static final Set DEFAULT_QUERY_FIELD_TYPES =
ImmutableSet.of(FieldType.TEXT, FieldType.TEXT_PARTIAL, FieldType.WORD_GRAM, FieldType.URN, FieldType.URN_PARTIAL);
@@ -47,6 +51,8 @@ public class SearchableAnnotation {
Optional numValuesFieldName;
// (Optional) Weights to apply to score for a given value
Map weightsPerFieldValue;
+ // (Optional) Aliases for this given field that can be used for sorting etc.
+ List fieldNameAliases;
public enum FieldType {
KEYWORD,
@@ -94,6 +100,7 @@ public static SearchableAnnotation fromPegasusAnnotationObject(@Nonnull final Ob
final Optional numValuesFieldName = AnnotationUtils.getField(map, "numValuesFieldName", String.class);
final Optional weightsPerFieldValueMap =
AnnotationUtils.getField(map, "weightsPerFieldValue", Map.class).map(m -> (Map) m);
+ final List fieldNameAliases = getFieldNameAliases(map);
final FieldType resolvedFieldType = getFieldType(fieldType, schemaDataType);
return new SearchableAnnotation(
@@ -108,7 +115,8 @@ public static SearchableAnnotation fromPegasusAnnotationObject(@Nonnull final Ob
boostScore.orElse(1.0),
hasValuesFieldName,
numValuesFieldName,
- weightsPerFieldValueMap.orElse(ImmutableMap.of()));
+ weightsPerFieldValueMap.orElse(ImmutableMap.of()),
+ fieldNameAliases);
}
private static FieldType getFieldType(Optional maybeFieldType, DataSchema.Type schemaDataType) {
@@ -156,4 +164,15 @@ private static String capitalizeFirstLetter(String str) {
return str.substring(0, 1).toUpperCase() + str.substring(1);
}
}
+
+ private static List getFieldNameAliases(Map map) {
+ final List aliases = new ArrayList<>();
+ final Optional fieldNameAliases = AnnotationUtils.getField(map, FIELD_NAME_ALIASES, List.class);
+ if (fieldNameAliases.isPresent()) {
+ for (Object alias : fieldNameAliases.get()) {
+ aliases.add((String) alias);
+ }
+ }
+ return aliases;
+ }
}
diff --git a/entity-registry/src/main/java/com/linkedin/metadata/models/registry/config/Entity.java b/entity-registry/src/main/java/com/linkedin/metadata/models/registry/config/Entity.java
index c446f63b653211..f32aa1aa8bd478 100644
--- a/entity-registry/src/main/java/com/linkedin/metadata/models/registry/config/Entity.java
+++ b/entity-registry/src/main/java/com/linkedin/metadata/models/registry/config/Entity.java
@@ -1,12 +1,14 @@
package com.linkedin.metadata.models.registry.config;
import java.util.List;
+
import lombok.AccessLevel;
import lombok.AllArgsConstructor;
import lombok.NoArgsConstructor;
import lombok.Value;
import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
+import javax.annotation.Nullable;
@Value
@@ -18,4 +20,7 @@ public class Entity {
String doc;
String keyAspect;
List aspects;
+
+ @Nullable
+ String category;
}
diff --git a/gradle/wrapper/gradle-wrapper.jar b/gradle/wrapper/gradle-wrapper.jar
index e708b1c023ec8b..afba109285af78 100644
Binary files a/gradle/wrapper/gradle-wrapper.jar and b/gradle/wrapper/gradle-wrapper.jar differ
diff --git a/gradle/wrapper/gradle-wrapper.properties b/gradle/wrapper/gradle-wrapper.properties
index ec991f9aa12cb8..4e86b9270786fb 100644
--- a/gradle/wrapper/gradle-wrapper.properties
+++ b/gradle/wrapper/gradle-wrapper.properties
@@ -1,5 +1,6 @@
distributionBase=GRADLE_USER_HOME
distributionPath=wrapper/dists
-distributionUrl=https\://services.gradle.org/distributions/gradle-6.9.2-bin.zip
+distributionUrl=https\://services.gradle.org/distributions/gradle-7.6.2-bin.zip
+networkTimeout=10000
zipStoreBase=GRADLE_USER_HOME
zipStorePath=wrapper/dists
diff --git a/gradlew b/gradlew
index 1b6c787337ffb7..65dcd68d65c82f 100755
--- a/gradlew
+++ b/gradlew
@@ -55,7 +55,7 @@
# Darwin, MinGW, and NonStop.
#
# (3) This script is generated from the Groovy template
-# https://github.com/gradle/gradle/blob/master/subprojects/plugins/src/main/resources/org/gradle/api/internal/plugins/unixStartScript.txt
+# https://github.com/gradle/gradle/blob/HEAD/subprojects/plugins/src/main/resources/org/gradle/api/internal/plugins/unixStartScript.txt
# within the Gradle project.
#
# You can find Gradle at https://github.com/gradle/gradle/.
@@ -80,10 +80,10 @@ do
esac
done
-APP_HOME=$( cd "${APP_HOME:-./}" && pwd -P ) || exit
-
-APP_NAME="Gradle"
+# This is normally unused
+# shellcheck disable=SC2034
APP_BASE_NAME=${0##*/}
+APP_HOME=$( cd "${APP_HOME:-./}" && pwd -P ) || exit
# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
DEFAULT_JVM_OPTS='"-Xmx64m" "-Xms64m"'
@@ -143,12 +143,16 @@ fi
if ! "$cygwin" && ! "$darwin" && ! "$nonstop" ; then
case $MAX_FD in #(
max*)
+ # In POSIX sh, ulimit -H is undefined. That's why the result is checked to see if it worked.
+ # shellcheck disable=SC3045
MAX_FD=$( ulimit -H -n ) ||
warn "Could not query maximum file descriptor limit"
esac
case $MAX_FD in #(
'' | soft) :;; #(
*)
+ # In POSIX sh, ulimit -n is undefined. That's why the result is checked to see if it worked.
+ # shellcheck disable=SC3045
ulimit -n "$MAX_FD" ||
warn "Could not set maximum file descriptor limit to $MAX_FD"
esac
@@ -205,6 +209,12 @@ set -- \
org.gradle.wrapper.GradleWrapperMain \
"$@"
+# Stop when "xargs" is not available.
+if ! command -v xargs >/dev/null 2>&1
+then
+ die "xargs is not available"
+fi
+
# Use "xargs" to parse quoted args.
#
# With -n1 it outputs one arg per line, with the quotes and backslashes removed.
diff --git a/gradlew.bat b/gradlew.bat
index ac1b06f93825db..6689b85beecde6 100644
--- a/gradlew.bat
+++ b/gradlew.bat
@@ -14,7 +14,7 @@
@rem limitations under the License.
@rem
-@if "%DEBUG%" == "" @echo off
+@if "%DEBUG%"=="" @echo off
@rem ##########################################################################
@rem
@rem Gradle startup script for Windows
@@ -25,7 +25,8 @@
if "%OS%"=="Windows_NT" setlocal
set DIRNAME=%~dp0
-if "%DIRNAME%" == "" set DIRNAME=.
+if "%DIRNAME%"=="" set DIRNAME=.
+@rem This is normally unused
set APP_BASE_NAME=%~n0
set APP_HOME=%DIRNAME%
@@ -40,7 +41,7 @@ if defined JAVA_HOME goto findJavaFromJavaHome
set JAVA_EXE=java.exe
%JAVA_EXE% -version >NUL 2>&1
-if "%ERRORLEVEL%" == "0" goto execute
+if %ERRORLEVEL% equ 0 goto execute
echo.
echo ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
@@ -75,13 +76,15 @@ set CLASSPATH=%APP_HOME%\gradle\wrapper\gradle-wrapper.jar
:end
@rem End local scope for the variables with windows NT shell
-if "%ERRORLEVEL%"=="0" goto mainEnd
+if %ERRORLEVEL% equ 0 goto mainEnd
:fail
rem Set variable GRADLE_EXIT_CONSOLE if you need the _script_ return code instead of
rem the _cmd.exe /c_ return code!
-if not "" == "%GRADLE_EXIT_CONSOLE%" exit 1
-exit /b 1
+set EXIT_CODE=%ERRORLEVEL%
+if %EXIT_CODE% equ 0 set EXIT_CODE=1
+if not ""=="%GRADLE_EXIT_CONSOLE%" exit %EXIT_CODE%
+exit /b %EXIT_CODE%
:mainEnd
if "%OS%"=="Windows_NT" endlocal
diff --git a/ingestion-scheduler/build.gradle b/ingestion-scheduler/build.gradle
index b15b5b8c52673e..dc9887406b8b4f 100644
--- a/ingestion-scheduler/build.gradle
+++ b/ingestion-scheduler/build.gradle
@@ -1,16 +1,17 @@
apply plugin: 'java'
dependencies {
- compile project(path: ':metadata-models')
- compile project(path: ':metadata-io')
- compile project(path: ':metadata-service:restli-client')
- compile project(':metadata-service:configuration')
+ implementation project(path: ':metadata-models')
+ implementation project(path: ':metadata-io')
+ implementation project(path: ':metadata-service:restli-client')
+ implementation project(':metadata-service:configuration')
+
implementation externalDependency.slf4jApi
compileOnly externalDependency.lombok
annotationProcessor externalDependency.lombok
- testCompile externalDependency.mockito
- testCompile externalDependency.testng
+ testImplementation externalDependency.mockito
+ testImplementation externalDependency.testng
constraints {
implementation(externalDependency.log4jCore) {
diff --git a/li-utils/build.gradle b/li-utils/build.gradle
index d11cd86659605c..8f526cffba094b 100644
--- a/li-utils/build.gradle
+++ b/li-utils/build.gradle
@@ -1,4 +1,4 @@
-apply plugin: 'java'
+apply plugin: 'java-library'
apply plugin: 'pegasus'
tasks.withType(JavaCompile).configureEach {
@@ -13,19 +13,21 @@ tasks.withType(Test).configureEach {
}
dependencies {
- compile spec.product.pegasus.data
- compile externalDependency.commonsLang
- compile(externalDependency.reflections) {
+ api spec.product.pegasus.data
+ implementation externalDependency.commonsLang
+ implementation(externalDependency.reflections) {
exclude group: 'com.google.guava', module: 'guava'
}
- compile externalDependency.guava
+ implementation externalDependency.guava
implementation externalDependency.slf4jApi
compileOnly externalDependency.lombok
annotationProcessor externalDependency.lombok
- testCompile externalDependency.assertJ
- testCompile project(':test-models')
+ testImplementation externalDependency.assertJ
+ testImplementation externalDependency.commonsIo
+ testImplementation project(':test-models')
+ testImplementation project(path: ':test-models', configuration: 'testDataTemplate')
}
idea {
@@ -34,5 +36,5 @@ idea {
}
}
-// Need to compile backing java definitions with the data template.
+// Need to compile backing java parameterDefinitions with the data template.
sourceSets.mainGeneratedDataTemplate.java.srcDirs('src/main/javaPegasus/')
\ No newline at end of file
diff --git a/metadata-auth/auth-api/build.gradle b/metadata-auth/auth-api/build.gradle
index f82f488b6f182a..7159aa5f15e61e 100644
--- a/metadata-auth/auth-api/build.gradle
+++ b/metadata-auth/auth-api/build.gradle
@@ -3,7 +3,7 @@ plugins {
}
apply plugin: 'com.github.johnrengelman.shadow'
-apply plugin: 'java'
+apply plugin: 'java-library'
apply plugin: 'signing'
apply plugin: 'maven-publish'
apply plugin: 'io.codearte.nexus-staging'
@@ -28,14 +28,14 @@ shadowJar {
dependencies() {
implementation spec.product.pegasus.data
implementation project(path: ':li-utils')
- implementation project(path: ':metadata-utils')
+ api project(path: ':metadata-utils')
- compile externalDependency.guava
- compile externalDependency.lombok
+ implementation externalDependency.guava
+ compileOnly externalDependency.lombok
annotationProcessor externalDependency.lombok
-
- testCompile externalDependency.testng
+
+ testImplementation externalDependency.testng
}
task sourcesJar(type: Jar) {
diff --git a/metadata-auth/auth-api/src/main/java/com/datahub/authorization/ResolvedResourceSpec.java b/metadata-auth/auth-api/src/main/java/com/datahub/authorization/ResolvedResourceSpec.java
index 0dae1bd386ccd6..53dd0be44f963d 100644
--- a/metadata-auth/auth-api/src/main/java/com/datahub/authorization/ResolvedResourceSpec.java
+++ b/metadata-auth/auth-api/src/main/java/com/datahub/authorization/ResolvedResourceSpec.java
@@ -3,7 +3,6 @@
import java.util.Collections;
import java.util.Map;
import java.util.Set;
-import javax.annotation.Nullable;
import lombok.Getter;
import lombok.RequiredArgsConstructor;
import lombok.ToString;
@@ -26,21 +25,6 @@ public Set getFieldValues(ResourceFieldType resourceFieldType) {
return fieldResolvers.get(resourceFieldType).getFieldValuesFuture().join().getValues();
}
- /**
- * Fetch the entity-registry type for a resource. ('dataset', 'dashboard', 'chart').
- * @return the entity type.
- */
- public String getType() {
- if (!fieldResolvers.containsKey(ResourceFieldType.RESOURCE_TYPE)) {
- throw new UnsupportedOperationException(
- "Failed to resolve resource type! No field resolver for RESOURCE_TYPE provided.");
- }
- Set resourceTypes =
- fieldResolvers.get(ResourceFieldType.RESOURCE_TYPE).getFieldValuesFuture().join().getValues();
- assert resourceTypes.size() == 1; // There should always be a single resource type.
- return resourceTypes.stream().findFirst().get();
- }
-
/**
* Fetch the owners for a resource.
* @return a set of owner urns, or empty set if none exist.
@@ -51,20 +35,4 @@ public Set getOwners() {
}
return fieldResolvers.get(ResourceFieldType.OWNER).getFieldValuesFuture().join().getValues();
}
-
- /**
- * Fetch the domain for a Resolved Resource Spec
- * @return a Domain or null if one does not exist.
- */
- @Nullable
- public String getDomain() {
- if (!fieldResolvers.containsKey(ResourceFieldType.DOMAIN)) {
- return null;
- }
- Set domains = fieldResolvers.get(ResourceFieldType.DOMAIN).getFieldValuesFuture().join().getValues();
- if (domains.size() > 0) {
- return domains.stream().findFirst().get();
- }
- return null;
- }
}
diff --git a/metadata-dao-impl/kafka-producer/build.gradle b/metadata-dao-impl/kafka-producer/build.gradle
index 6b08ac50a4c17d..393b10b0e9d246 100644
--- a/metadata-dao-impl/kafka-producer/build.gradle
+++ b/metadata-dao-impl/kafka-producer/build.gradle
@@ -1,20 +1,23 @@
apply plugin: 'java'
dependencies {
- compile project(':metadata-events:mxe-avro-1.7')
- compile project(':metadata-events:mxe-registration')
- compile project(':metadata-events:mxe-utils-avro-1.7')
- compile project(':entity-registry')
- compile project(':metadata-io')
+ implementation project(':metadata-events:mxe-avro-1.7')
+ implementation project(':metadata-events:mxe-registration')
+ implementation project(':metadata-events:mxe-utils-avro-1.7')
+ implementation project(':entity-registry')
+ implementation project(':metadata-io')
- compile externalDependency.kafkaClients
+ implementation externalDependency.kafkaClients
+ implementation externalDependency.springBeans
+ implementation externalDependency.springContext
+ implementation externalDependency.opentelemetryAnnotations
implementation externalDependency.slf4jApi
compileOnly externalDependency.lombok
annotationProcessor externalDependency.lombok
- testCompile externalDependency.mockito
+ testImplementation externalDependency.mockito
constraints {
implementation(externalDependency.log4jCore) {
diff --git a/metadata-dao-impl/kafka-producer/src/main/java/com/linkedin/metadata/dao/producer/KafkaEventProducer.java b/metadata-dao-impl/kafka-producer/src/main/java/com/linkedin/metadata/dao/producer/KafkaEventProducer.java
index 65bf250200d131..00b5bb75d901bd 100644
--- a/metadata-dao-impl/kafka-producer/src/main/java/com/linkedin/metadata/dao/producer/KafkaEventProducer.java
+++ b/metadata-dao-impl/kafka-producer/src/main/java/com/linkedin/metadata/dao/producer/KafkaEventProducer.java
@@ -1,25 +1,18 @@
package com.linkedin.metadata.dao.producer;
import com.datahub.util.exception.ModelConversionException;
-import com.google.common.annotations.VisibleForTesting;
import com.linkedin.common.urn.Urn;
import com.linkedin.metadata.EventUtils;
import com.linkedin.metadata.event.EventProducer;
import com.linkedin.metadata.models.AspectSpec;
-import com.linkedin.metadata.snapshot.Snapshot;
import com.linkedin.mxe.DataHubUpgradeHistoryEvent;
-import com.linkedin.mxe.MetadataAuditEvent;
-import com.linkedin.mxe.MetadataAuditOperation;
import com.linkedin.mxe.MetadataChangeLog;
import com.linkedin.mxe.MetadataChangeProposal;
import com.linkedin.mxe.PlatformEvent;
-import com.linkedin.mxe.SystemMetadata;
import com.linkedin.mxe.TopicConvention;
import com.linkedin.mxe.TopicConventionImpl;
-import com.linkedin.mxe.Topics;
import io.opentelemetry.extension.annotations.WithSpan;
import java.io.IOException;
-import java.util.Arrays;
import java.util.concurrent.Future;
import javax.annotation.Nonnull;
import javax.annotation.Nullable;
@@ -55,45 +48,6 @@ public KafkaEventProducer(@Nonnull final Producer produceMetadataChangeLog(@Nonnull final Urn urn, @Nonnull AspectSpec aspectSpec,
@@ -120,7 +74,7 @@ record = EventUtils.pegasusToAvroMCL(metadataChangeLog);
@Override
@WithSpan
public Future> produceMetadataChangeProposal(@Nonnull final Urn urn,
- @Nonnull final MetadataChangeProposal metadataChangeProposal) {
+ @Nonnull final MetadataChangeProposal metadataChangeProposal) {
GenericRecord record;
try {
@@ -171,9 +125,4 @@ record = EventUtils.pegasusToAvroDUHE(event);
_producer.send(new ProducerRecord(topic, event.getVersion(), record), _kafkaHealthChecker
.getKafkaCallBack("History Event", "Event Version: " + event.getVersion()));
}
-
- @VisibleForTesting
- static boolean isValidAspectSpecificTopic(@Nonnull String topic) {
- return Arrays.stream(Topics.class.getFields()).anyMatch(field -> field.getName().equals(topic));
- }
}
diff --git a/metadata-events/mxe-avro-1.7/build.gradle b/metadata-events/mxe-avro-1.7/build.gradle
index 6bde1511bf280a..8c0a26d22dc7d2 100644
--- a/metadata-events/mxe-avro-1.7/build.gradle
+++ b/metadata-events/mxe-avro-1.7/build.gradle
@@ -3,11 +3,11 @@ configurations {
}
apply plugin: 'io.acryl.gradle.plugin.avro'
-apply plugin: 'java'
+apply plugin: 'java-library'
dependencies {
- compile externalDependency.avro_1_7
- compile(externalDependency.avroCompiler_1_7) {
+ api externalDependency.avro_1_7
+ implementation(externalDependency.avroCompiler_1_7) {
exclude group: 'org.apache.velocity', module: 'velocity'
}
constraints {
@@ -43,4 +43,8 @@ jar {
dependsOn classes
from sourceSets.main.output
exclude('com/linkedin/events/**')
+}
+
+clean {
+ delete 'src'
}
\ No newline at end of file
diff --git a/metadata-events/mxe-registration/build.gradle b/metadata-events/mxe-registration/build.gradle
index aa5fad09f3fec2..60e0da59616d93 100644
--- a/metadata-events/mxe-registration/build.gradle
+++ b/metadata-events/mxe-registration/build.gradle
@@ -5,11 +5,12 @@ configurations {
}
dependencies {
- compile project(':metadata-events:mxe-avro-1.7')
- compile project(':metadata-models')
- compile spec.product.pegasus.dataAvro1_6
+ implementation project(':metadata-events:mxe-avro-1.7')
+ implementation project(':metadata-models')
+ implementation spec.product.pegasus.dataAvro1_6
- testCompile project(':test-models')
+ testImplementation project(':test-models')
+ testImplementation project(path: ':test-models', configuration: 'testDataTemplate')
avroOriginal project(path: ':metadata-models', configuration: 'avroSchema')
diff --git a/metadata-events/mxe-schemas/build.gradle b/metadata-events/mxe-schemas/build.gradle
index 0b3e621b8db150..fe46601fb68b79 100644
--- a/metadata-events/mxe-schemas/build.gradle
+++ b/metadata-events/mxe-schemas/build.gradle
@@ -11,6 +11,10 @@ task copyMetadataModels(type: Copy) {
}
generateAvroSchema.dependsOn copyMetadataModels
+validateSchemaAnnotation.dependsOn copyMetadataModels
+mainTranslateSchemas.dependsOn copyMetadataModels
+generateDataTemplate.dependsOn copyMetadataModels
+mainCopySchemas.dependsOn copyMetadataModels
pegasus.main.generationModes = [PegasusGenerationMode.PEGASUS, PegasusGenerationMode.AVRO]
task copyOriginalAvsc(type: Copy, dependsOn: generateAvroSchema) {
diff --git a/metadata-events/mxe-utils-avro-1.7/build.gradle b/metadata-events/mxe-utils-avro-1.7/build.gradle
index f8474e21daa0bd..82249d393578cb 100644
--- a/metadata-events/mxe-utils-avro-1.7/build.gradle
+++ b/metadata-events/mxe-utils-avro-1.7/build.gradle
@@ -1,11 +1,12 @@
-apply plugin: 'java'
+apply plugin: 'java-library'
dependencies {
- compile project(':metadata-events:mxe-avro-1.7')
- compile project(':metadata-models')
- compile spec.product.pegasus.dataAvro1_6
+ api project(':metadata-events:mxe-avro-1.7')
+ api project(':metadata-models')
+ api spec.product.pegasus.dataAvro1_6
- testCompile project(':test-models')
+ testImplementation project(':test-models')
+ testImplementation project(path: ':test-models', configuration: 'testDataTemplate')
constraints {
implementation(externalDependency.log4jCore) {
diff --git a/metadata-ingestion-modules/airflow-plugin/build.gradle b/metadata-ingestion-modules/airflow-plugin/build.gradle
index 336be8fc94d442..58a2bc9e670e34 100644
--- a/metadata-ingestion-modules/airflow-plugin/build.gradle
+++ b/metadata-ingestion-modules/airflow-plugin/build.gradle
@@ -7,6 +7,10 @@ ext {
venv_name = 'venv'
}
+if (!project.hasProperty("extra_pip_requirements")) {
+ ext.extra_pip_requirements = ""
+}
+
def pip_install_command = "${venv_name}/bin/pip install -e ../../metadata-ingestion"
task checkPythonVersion(type: Exec) {
@@ -14,30 +18,37 @@ task checkPythonVersion(type: Exec) {
}
task environmentSetup(type: Exec, dependsOn: checkPythonVersion) {
+ def sentinel_file = "${venv_name}/.venv_environment_sentinel"
inputs.file file('setup.py')
- outputs.dir("${venv_name}")
- commandLine 'bash', '-c', "${python_executable} -m venv ${venv_name} && ${venv_name}/bin/python -m pip install --upgrade pip wheel 'setuptools>=63.0.0'"
+ outputs.file(sentinel_file)
+ commandLine 'bash', '-c',
+ "${python_executable} -m venv ${venv_name} &&" +
+ "${venv_name}/bin/python -m pip install --upgrade pip wheel 'setuptools>=63.0.0' && " +
+ "touch ${sentinel_file}"
}
-task installPackage(type: Exec, dependsOn: environmentSetup) {
+task installPackage(type: Exec, dependsOn: [environmentSetup, ':metadata-ingestion:codegen']) {
+ def sentinel_file = "${venv_name}/.build_install_package_sentinel"
inputs.file file('setup.py')
- outputs.dir("${venv_name}")
+ outputs.file(sentinel_file)
// Workaround for https://github.com/yaml/pyyaml/issues/601.
// See https://github.com/yaml/pyyaml/issues/601#issuecomment-1638509577.
// and https://github.com/datahub-project/datahub/pull/8435.
commandLine 'bash', '-x', '-c',
"${pip_install_command} install 'Cython<3.0' 'PyYAML<6' --no-build-isolation && " +
- "${pip_install_command} -e ."
+ "${pip_install_command} -e . ${extra_pip_requirements} &&" +
+ "touch ${sentinel_file}"
}
task install(dependsOn: [installPackage])
task installDev(type: Exec, dependsOn: [install]) {
+ def sentinel_file = "${venv_name}/.build_install_dev_sentinel"
inputs.file file('setup.py')
- outputs.dir("${venv_name}")
- outputs.file("${venv_name}/.build_install_dev_sentinel")
+ outputs.file("${sentinel_file}")
commandLine 'bash', '-x', '-c',
- "${pip_install_command} -e .[dev] && touch ${venv_name}/.build_install_dev_sentinel"
+ "${pip_install_command} -e .[dev] ${extra_pip_requirements} && " +
+ "touch ${sentinel_file}"
}
task lint(type: Exec, dependsOn: installDev) {
@@ -45,9 +56,13 @@ task lint(type: Exec, dependsOn: installDev) {
The find/sed combo below is a temporary work-around for the following mypy issue with airflow 2.2.0:
"venv/lib/python3.8/site-packages/airflow/_vendor/connexion/spec.py:169: error: invalid syntax".
*/
- commandLine 'bash', '-x', '-c',
+ commandLine 'bash', '-c',
"find ${venv_name}/lib -path *airflow/_vendor/connexion/spec.py -exec sed -i.bak -e '169,169s/ # type: List\\[str\\]//g' {} \\; && " +
- "source ${venv_name}/bin/activate && black --check --diff src/ tests/ && isort --check --diff src/ tests/ && flake8 --count --statistics src/ tests/ && mypy src/ tests/"
+ "source ${venv_name}/bin/activate && set -x && " +
+ "black --check --diff src/ tests/ && " +
+ "isort --check --diff src/ tests/ && " +
+ "flake8 --count --statistics src/ tests/ && " +
+ "mypy --show-traceback --show-error-codes src/ tests/"
}
task lintFix(type: Exec, dependsOn: installDev) {
commandLine 'bash', '-x', '-c',
@@ -58,21 +73,13 @@ task lintFix(type: Exec, dependsOn: installDev) {
"mypy src/ tests/ "
}
-task testQuick(type: Exec, dependsOn: installDev) {
- // We can't enforce the coverage requirements if we run a subset of the tests.
- inputs.files(project.fileTree(dir: "src/", include: "**/*.py"))
- inputs.files(project.fileTree(dir: "tests/"))
- outputs.dir("${venv_name}")
- commandLine 'bash', '-x', '-c',
- "source ${venv_name}/bin/activate && pytest -vv --continue-on-collection-errors --junit-xml=junit.quick.xml"
-}
-
task installDevTest(type: Exec, dependsOn: [installDev]) {
+ def sentinel_file = "${venv_name}/.build_install_dev_test_sentinel"
inputs.file file('setup.py')
outputs.dir("${venv_name}")
- outputs.file("${venv_name}/.build_install_dev_test_sentinel")
+ outputs.file("${sentinel_file}")
commandLine 'bash', '-x', '-c',
- "${pip_install_command} -e .[dev,integration-tests] && touch ${venv_name}/.build_install_dev_test_sentinel"
+ "${pip_install_command} -e .[dev,integration-tests] && touch ${sentinel_file}"
}
def testFile = hasProperty('testFile') ? testFile : 'unknown'
@@ -89,18 +96,28 @@ task testSingle(dependsOn: [installDevTest]) {
}
}
+task testQuick(type: Exec, dependsOn: installDevTest) {
+ // We can't enforce the coverage requirements if we run a subset of the tests.
+ inputs.files(project.fileTree(dir: "src/", include: "**/*.py"))
+ inputs.files(project.fileTree(dir: "tests/"))
+ outputs.dir("${venv_name}")
+ commandLine 'bash', '-x', '-c',
+ "source ${venv_name}/bin/activate && pytest -vv --continue-on-collection-errors --junit-xml=junit.quick.xml"
+}
+
+
task testFull(type: Exec, dependsOn: [testQuick, installDevTest]) {
commandLine 'bash', '-x', '-c',
"source ${venv_name}/bin/activate && pytest -m 'not slow_integration' -vv --continue-on-collection-errors --junit-xml=junit.full.xml"
}
-task buildWheel(type: Exec, dependsOn: [install]) {
- commandLine 'bash', '-c', "source ${venv_name}/bin/activate && " + 'pip install build && RELEASE_VERSION="\${RELEASE_VERSION:-0.0.0.dev1}" RELEASE_SKIP_TEST=1 RELEASE_SKIP_UPLOAD=1 ./scripts/release.sh'
-}
task cleanPythonCache(type: Exec) {
commandLine 'bash', '-c',
"find src -type f -name '*.py[co]' -delete -o -type d -name __pycache__ -delete -o -type d -empty -delete"
}
+task buildWheel(type: Exec, dependsOn: [install, cleanPythonCache]) {
+ commandLine 'bash', '-c', "source ${venv_name}/bin/activate && " + 'pip install build && RELEASE_VERSION="\${RELEASE_VERSION:-0.0.0.dev1}" RELEASE_SKIP_TEST=1 RELEASE_SKIP_UPLOAD=1 ./scripts/release.sh'
+}
build.dependsOn install
check.dependsOn lint
diff --git a/metadata-ingestion-modules/airflow-plugin/pyproject.toml b/metadata-ingestion-modules/airflow-plugin/pyproject.toml
index 83b79e31461767..fba81486b9f677 100644
--- a/metadata-ingestion-modules/airflow-plugin/pyproject.toml
+++ b/metadata-ingestion-modules/airflow-plugin/pyproject.toml
@@ -9,7 +9,6 @@ extend-exclude = '''
^/tmp
'''
include = '\.pyi?$'
-target-version = ['py36', 'py37', 'py38']
[tool.isort]
indent = ' '
diff --git a/metadata-ingestion-modules/airflow-plugin/scripts/release.sh b/metadata-ingestion-modules/airflow-plugin/scripts/release.sh
index 7134187a458850..87157479f37d63 100755
--- a/metadata-ingestion-modules/airflow-plugin/scripts/release.sh
+++ b/metadata-ingestion-modules/airflow-plugin/scripts/release.sh
@@ -13,7 +13,7 @@ MODULE=datahub_airflow_plugin
python -c 'import setuptools; where="./src"; assert setuptools.find_packages(where) == setuptools.find_namespace_packages(where), "you seem to be missing or have extra __init__.py files"'
if [[ ${RELEASE_VERSION:-} ]]; then
# Replace version with RELEASE_VERSION env variable
- sed -i.bak "s/__version__ = \"0.0.0.dev0\"/__version__ = \"$RELEASE_VERSION\"/" src/${MODULE}/__init__.py
+ sed -i.bak "s/__version__ = \"1!0.0.0.dev0\"/__version__ = \"$RELEASE_VERSION\"/" src/${MODULE}/__init__.py
else
vim src/${MODULE}/__init__.py
fi
diff --git a/metadata-ingestion-modules/airflow-plugin/setup.cfg b/metadata-ingestion-modules/airflow-plugin/setup.cfg
index c9a2ba93e9933c..157bcce1c298d2 100644
--- a/metadata-ingestion-modules/airflow-plugin/setup.cfg
+++ b/metadata-ingestion-modules/airflow-plugin/setup.cfg
@@ -69,4 +69,6 @@ exclude_lines =
pragma: no cover
@abstract
if TYPE_CHECKING:
-#omit =
+omit =
+ # omit example dags
+ src/datahub_airflow_plugin/example_dags/*
diff --git a/metadata-ingestion-modules/airflow-plugin/setup.py b/metadata-ingestion-modules/airflow-plugin/setup.py
index c2571916ca5d0d..18e605ae76ebd5 100644
--- a/metadata-ingestion-modules/airflow-plugin/setup.py
+++ b/metadata-ingestion-modules/airflow-plugin/setup.py
@@ -13,16 +13,21 @@ def get_long_description():
return pathlib.Path(os.path.join(root, "README.md")).read_text()
+rest_common = {"requests", "requests_file"}
+
base_requirements = {
# Compatibility.
"dataclasses>=0.6; python_version < '3.7'",
- "typing_extensions>=3.10.0.2",
+ # Typing extension should be >=3.10.0.2 ideally but we can't restrict due to Airflow 2.0.2 dependency conflict
+ "typing_extensions>=3.7.4.3 ; python_version < '3.8'",
+ "typing_extensions>=3.10.0.2,<4.6.0 ; python_version >= '3.8'",
"mypy_extensions>=0.4.3",
# Actual dependencies.
"typing-inspect",
"pydantic>=1.5.1",
"apache-airflow >= 2.0.2",
- f"acryl-datahub[airflow] == {package_metadata['__version__']}",
+ *rest_common,
+ f"acryl-datahub == {package_metadata['__version__']}",
}
@@ -47,19 +52,18 @@ def get_long_description():
base_dev_requirements = {
*base_requirements,
*mypy_stubs,
- "black>=21.12b0",
+ "black==22.12.0",
"coverage>=5.1",
"flake8>=3.8.3",
"flake8-tidy-imports>=4.3.0",
"isort>=5.7.0",
- "mypy>=0.920",
+ "mypy>=1.4.0",
# pydantic 1.8.2 is incompatible with mypy 0.910.
# See https://github.com/samuelcolvin/pydantic/pull/3175#issuecomment-995382910.
- "pydantic>=1.9.0",
+ "pydantic>=1.10",
"pytest>=6.2.2",
"pytest-asyncio>=0.16.0",
"pytest-cov>=2.8.1",
- "pytest-docker>=0.10.3,<0.12",
"tox",
"deepdiff",
"requests-mock",
@@ -117,6 +121,9 @@ def get_long_description():
# Package info.
zip_safe=False,
python_requires=">=3.7",
+ package_data={
+ "datahub_airflow_plugin": ["py.typed"],
+ },
package_dir={"": "src"},
packages=setuptools.find_namespace_packages(where="./src"),
entry_points=entry_points,
@@ -127,5 +134,13 @@ def get_long_description():
"datahub-kafka": [
f"acryl-datahub[datahub-kafka] == {package_metadata['__version__']}"
],
+ "integration-tests": [
+ f"acryl-datahub[datahub-kafka] == {package_metadata['__version__']}",
+ # Extra requirements for Airflow.
+ "apache-airflow[snowflake]>=2.0.2", # snowflake is used in example dags
+ # Because of https://github.com/snowflakedb/snowflake-sqlalchemy/issues/350 we need to restrict SQLAlchemy's max version.
+ "SQLAlchemy<1.4.42",
+ "virtualenv", # needed by PythonVirtualenvOperator
+ ],
},
)
diff --git a/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/__init__.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/__init__.py
index ce98a0fc1fb609..b2c45d3a1e75d3 100644
--- a/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/__init__.py
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/__init__.py
@@ -1,6 +1,6 @@
# Published at https://pypi.org/project/acryl-datahub/.
__package_name__ = "acryl-datahub-airflow-plugin"
-__version__ = "0.0.0.dev0"
+__version__ = "1!0.0.0.dev0"
def is_dev_mode() -> bool:
diff --git a/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/_airflow_compat.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/_airflow_compat.py
new file mode 100644
index 00000000000000..67c3348ec987cf
--- /dev/null
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/_airflow_compat.py
@@ -0,0 +1,12 @@
+# This module must be imported before any Airflow imports in any of our files.
+# The AIRFLOW_PATCHED just helps avoid flake8 errors.
+
+from datahub.utilities._markupsafe_compat import MARKUPSAFE_PATCHED
+
+assert MARKUPSAFE_PATCHED
+
+AIRFLOW_PATCHED = True
+
+__all__ = [
+ "AIRFLOW_PATCHED",
+]
diff --git a/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/_airflow_shims.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/_airflow_shims.py
new file mode 100644
index 00000000000000..5ad20e1f72551c
--- /dev/null
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/_airflow_shims.py
@@ -0,0 +1,29 @@
+from airflow.models.baseoperator import BaseOperator
+
+from datahub_airflow_plugin._airflow_compat import AIRFLOW_PATCHED
+
+try:
+ from airflow.models.mappedoperator import MappedOperator
+ from airflow.models.operator import Operator
+ from airflow.operators.empty import EmptyOperator
+except ModuleNotFoundError:
+ # Operator isn't a real class, but rather a type alias defined
+ # as the union of BaseOperator and MappedOperator.
+ # Since older versions of Airflow don't have MappedOperator, we can just use BaseOperator.
+ Operator = BaseOperator # type: ignore
+ MappedOperator = None # type: ignore
+ from airflow.operators.dummy import DummyOperator as EmptyOperator # type: ignore
+
+try:
+ from airflow.sensors.external_task import ExternalTaskSensor
+except ImportError:
+ from airflow.sensors.external_task_sensor import ExternalTaskSensor # type: ignore
+
+assert AIRFLOW_PATCHED
+
+__all__ = [
+ "Operator",
+ "MappedOperator",
+ "EmptyOperator",
+ "ExternalTaskSensor",
+]
diff --git a/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/_lineage_core.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/_lineage_core.py
new file mode 100644
index 00000000000000..d91c039ffa718d
--- /dev/null
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/_lineage_core.py
@@ -0,0 +1,115 @@
+from datetime import datetime
+from typing import TYPE_CHECKING, Dict, List
+
+import datahub.emitter.mce_builder as builder
+from datahub.api.entities.dataprocess.dataprocess_instance import InstanceRunResult
+from datahub.configuration.common import ConfigModel
+from datahub.utilities.urns.dataset_urn import DatasetUrn
+
+from datahub_airflow_plugin.client.airflow_generator import AirflowGenerator
+from datahub_airflow_plugin.entities import _Entity
+
+if TYPE_CHECKING:
+ from airflow import DAG
+ from airflow.models.dagrun import DagRun
+ from airflow.models.taskinstance import TaskInstance
+
+ from datahub_airflow_plugin._airflow_shims import Operator
+ from datahub_airflow_plugin.hooks.datahub import DatahubGenericHook
+
+
+def _entities_to_urn_list(iolets: List[_Entity]) -> List[DatasetUrn]:
+ return [DatasetUrn.create_from_string(let.urn) for let in iolets]
+
+
+class DatahubBasicLineageConfig(ConfigModel):
+ enabled: bool = True
+
+ # DataHub hook connection ID.
+ datahub_conn_id: str
+
+ # Cluster to associate with the pipelines and tasks. Defaults to "prod".
+ cluster: str = builder.DEFAULT_FLOW_CLUSTER
+
+ # If true, the owners field of the DAG will be capture as a DataHub corpuser.
+ capture_ownership_info: bool = True
+
+ # If true, the tags field of the DAG will be captured as DataHub tags.
+ capture_tags_info: bool = True
+
+ capture_executions: bool = False
+
+ def make_emitter_hook(self) -> "DatahubGenericHook":
+ # This is necessary to avoid issues with circular imports.
+ from datahub_airflow_plugin.hooks.datahub import DatahubGenericHook
+
+ return DatahubGenericHook(self.datahub_conn_id)
+
+
+def send_lineage_to_datahub(
+ config: DatahubBasicLineageConfig,
+ operator: "Operator",
+ inlets: List[_Entity],
+ outlets: List[_Entity],
+ context: Dict,
+) -> None:
+ if not config.enabled:
+ return
+
+ dag: "DAG" = context["dag"]
+ task: "Operator" = context["task"]
+ ti: "TaskInstance" = context["task_instance"]
+
+ hook = config.make_emitter_hook()
+ emitter = hook.make_emitter()
+
+ dataflow = AirflowGenerator.generate_dataflow(
+ cluster=config.cluster,
+ dag=dag,
+ capture_tags=config.capture_tags_info,
+ capture_owner=config.capture_ownership_info,
+ )
+ dataflow.emit(emitter)
+ operator.log.info(f"Emitted from Lineage: {dataflow}")
+
+ datajob = AirflowGenerator.generate_datajob(
+ cluster=config.cluster,
+ task=task,
+ dag=dag,
+ capture_tags=config.capture_tags_info,
+ capture_owner=config.capture_ownership_info,
+ )
+ datajob.inlets.extend(_entities_to_urn_list(inlets))
+ datajob.outlets.extend(_entities_to_urn_list(outlets))
+
+ datajob.emit(emitter)
+ operator.log.info(f"Emitted from Lineage: {datajob}")
+
+ if config.capture_executions:
+ dag_run: "DagRun" = context["dag_run"]
+
+ dpi = AirflowGenerator.run_datajob(
+ emitter=emitter,
+ cluster=config.cluster,
+ ti=ti,
+ dag=dag,
+ dag_run=dag_run,
+ datajob=datajob,
+ emit_templates=False,
+ )
+
+ operator.log.info(f"Emitted from Lineage: {dpi}")
+
+ dpi = AirflowGenerator.complete_datajob(
+ emitter=emitter,
+ cluster=config.cluster,
+ ti=ti,
+ dag=dag,
+ dag_run=dag_run,
+ datajob=datajob,
+ result=InstanceRunResult.SUCCESS,
+ end_timestamp_millis=int(datetime.utcnow().timestamp() * 1000),
+ )
+ operator.log.info(f"Emitted from Lineage: {dpi}")
+
+ emitter.flush()
diff --git a/metadata-ingestion/src/datahub/ingestion/source/azure/__init__.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/client/__init__.py
similarity index 100%
rename from metadata-ingestion/src/datahub/ingestion/source/azure/__init__.py
rename to metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/client/__init__.py
diff --git a/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/client/airflow_generator.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/client/airflow_generator.py
new file mode 100644
index 00000000000000..b5e86e14d85d0f
--- /dev/null
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/client/airflow_generator.py
@@ -0,0 +1,512 @@
+from typing import TYPE_CHECKING, Dict, List, Optional, Set, Union, cast
+
+from airflow.configuration import conf
+from datahub.api.entities.datajob import DataFlow, DataJob
+from datahub.api.entities.dataprocess.dataprocess_instance import (
+ DataProcessInstance,
+ InstanceRunResult,
+)
+from datahub.metadata.schema_classes import DataProcessTypeClass
+from datahub.utilities.urns.data_flow_urn import DataFlowUrn
+from datahub.utilities.urns.data_job_urn import DataJobUrn
+
+from datahub_airflow_plugin._airflow_compat import AIRFLOW_PATCHED
+
+assert AIRFLOW_PATCHED
+
+if TYPE_CHECKING:
+ from airflow import DAG
+ from airflow.models import DagRun, TaskInstance
+ from datahub.emitter.kafka_emitter import DatahubKafkaEmitter
+ from datahub.emitter.rest_emitter import DatahubRestEmitter
+
+ from datahub_airflow_plugin._airflow_shims import Operator
+
+
+def _task_downstream_task_ids(operator: "Operator") -> Set[str]:
+ if hasattr(operator, "downstream_task_ids"):
+ return operator.downstream_task_ids
+ return operator._downstream_task_id # type: ignore[attr-defined,union-attr]
+
+
+class AirflowGenerator:
+ @staticmethod
+ def _get_dependencies(
+ task: "Operator", dag: "DAG", flow_urn: DataFlowUrn
+ ) -> List[DataJobUrn]:
+ from datahub_airflow_plugin._airflow_shims import ExternalTaskSensor
+
+ # resolve URNs for upstream nodes in subdags upstream of the current task.
+ upstream_subdag_task_urns: List[DataJobUrn] = []
+
+ for upstream_task_id in task.upstream_task_ids:
+ upstream_task = dag.task_dict[upstream_task_id]
+
+ # if upstream task is not a subdag, then skip it
+ upstream_subdag = getattr(upstream_task, "subdag", None)
+ if upstream_subdag is None:
+ continue
+
+ # else, link the leaf tasks of the upstream subdag as upstream tasks
+ for upstream_subdag_task_id in upstream_subdag.task_dict:
+ upstream_subdag_task = upstream_subdag.task_dict[
+ upstream_subdag_task_id
+ ]
+
+ upstream_subdag_task_urn = DataJobUrn.create_from_ids(
+ job_id=upstream_subdag_task_id, data_flow_urn=str(flow_urn)
+ )
+
+ # if subdag task is a leaf task, then link it as an upstream task
+ if len(_task_downstream_task_ids(upstream_subdag_task)) == 0:
+ upstream_subdag_task_urns.append(upstream_subdag_task_urn)
+
+ # resolve URNs for upstream nodes that trigger the subdag containing the current task.
+ # (if it is in a subdag at all)
+ upstream_subdag_triggers: List[DataJobUrn] = []
+
+ # subdags are always named with 'parent.child' style or Airflow won't run them
+ # add connection from subdag trigger(s) if subdag task has no upstreams
+ if (
+ dag.is_subdag
+ and dag.parent_dag is not None
+ and len(task.upstream_task_ids) == 0
+ ):
+ # filter through the parent dag's tasks and find the subdag trigger(s)
+ subdags = [
+ x for x in dag.parent_dag.task_dict.values() if x.subdag is not None
+ ]
+ matched_subdags = [
+ x for x in subdags if x.subdag and x.subdag.dag_id == dag.dag_id
+ ]
+
+ # id of the task containing the subdag
+ subdag_task_id = matched_subdags[0].task_id
+
+ # iterate through the parent dag's tasks and find the ones that trigger the subdag
+ for upstream_task_id in dag.parent_dag.task_dict:
+ upstream_task = dag.parent_dag.task_dict[upstream_task_id]
+ upstream_task_urn = DataJobUrn.create_from_ids(
+ data_flow_urn=str(flow_urn), job_id=upstream_task_id
+ )
+
+ # if the task triggers the subdag, link it to this node in the subdag
+ if subdag_task_id in _task_downstream_task_ids(upstream_task):
+ upstream_subdag_triggers.append(upstream_task_urn)
+
+ # If the operator is an ExternalTaskSensor then we set the remote task as upstream.
+ # It is possible to tie an external sensor to DAG if external_task_id is omitted but currently we can't tie
+ # jobflow to anothet jobflow.
+ external_task_upstreams = []
+ if task.task_type == "ExternalTaskSensor":
+ task = cast(ExternalTaskSensor, task)
+ if hasattr(task, "external_task_id") and task.external_task_id is not None:
+ external_task_upstreams = [
+ DataJobUrn.create_from_ids(
+ job_id=task.external_task_id,
+ data_flow_urn=str(
+ DataFlowUrn.create_from_ids(
+ orchestrator=flow_urn.get_orchestrator_name(),
+ flow_id=task.external_dag_id,
+ env=flow_urn.get_env(),
+ )
+ ),
+ )
+ ]
+ # exclude subdag operator tasks since these are not emitted, resulting in empty metadata
+ upstream_tasks = (
+ [
+ DataJobUrn.create_from_ids(job_id=task_id, data_flow_urn=str(flow_urn))
+ for task_id in task.upstream_task_ids
+ if getattr(dag.task_dict[task_id], "subdag", None) is None
+ ]
+ + upstream_subdag_task_urns
+ + upstream_subdag_triggers
+ + external_task_upstreams
+ )
+ return upstream_tasks
+
+ @staticmethod
+ def generate_dataflow(
+ cluster: str,
+ dag: "DAG",
+ capture_owner: bool = True,
+ capture_tags: bool = True,
+ ) -> DataFlow:
+ """
+ Generates a Dataflow object from an Airflow DAG
+ :param cluster: str - name of the cluster
+ :param dag: DAG -
+ :param capture_tags:
+ :param capture_owner:
+ :return: DataFlow - Data generated dataflow
+ """
+ id = dag.dag_id
+ orchestrator = "airflow"
+ description = f"{dag.description}\n\n{dag.doc_md or ''}"
+ data_flow = DataFlow(
+ env=cluster, id=id, orchestrator=orchestrator, description=description
+ )
+
+ flow_property_bag: Dict[str, str] = {}
+
+ allowed_flow_keys = [
+ "_access_control",
+ "_concurrency",
+ "_default_view",
+ "catchup",
+ "fileloc",
+ "is_paused_upon_creation",
+ "start_date",
+ "tags",
+ "timezone",
+ ]
+
+ for key in allowed_flow_keys:
+ if hasattr(dag, key):
+ flow_property_bag[key] = repr(getattr(dag, key))
+
+ data_flow.properties = flow_property_bag
+ base_url = conf.get("webserver", "base_url")
+ data_flow.url = f"{base_url}/tree?dag_id={dag.dag_id}"
+
+ if capture_owner and dag.owner:
+ data_flow.owners.add(dag.owner)
+
+ if capture_tags and dag.tags:
+ data_flow.tags.update(dag.tags)
+
+ return data_flow
+
+ @staticmethod
+ def _get_description(task: "Operator") -> Optional[str]:
+ from airflow.models.baseoperator import BaseOperator
+
+ if not isinstance(task, BaseOperator):
+ # TODO: Get docs for mapped operators.
+ return None
+
+ if hasattr(task, "doc") and task.doc:
+ return task.doc
+ elif hasattr(task, "doc_md") and task.doc_md:
+ return task.doc_md
+ elif hasattr(task, "doc_json") and task.doc_json:
+ return task.doc_json
+ elif hasattr(task, "doc_yaml") and task.doc_yaml:
+ return task.doc_yaml
+ elif hasattr(task, "doc_rst") and task.doc_yaml:
+ return task.doc_yaml
+ return None
+
+ @staticmethod
+ def generate_datajob(
+ cluster: str,
+ task: "Operator",
+ dag: "DAG",
+ set_dependencies: bool = True,
+ capture_owner: bool = True,
+ capture_tags: bool = True,
+ ) -> DataJob:
+ """
+
+ :param cluster: str
+ :param task: TaskIntance
+ :param dag: DAG
+ :param set_dependencies: bool - whether to extract dependencies from airflow task
+ :param capture_owner: bool - whether to extract owner from airflow task
+ :param capture_tags: bool - whether to set tags automatically from airflow task
+ :return: DataJob - returns the generated DataJob object
+ """
+ dataflow_urn = DataFlowUrn.create_from_ids(
+ orchestrator="airflow", env=cluster, flow_id=dag.dag_id
+ )
+ datajob = DataJob(id=task.task_id, flow_urn=dataflow_urn)
+
+ # TODO add support for MappedOperator
+ datajob.description = AirflowGenerator._get_description(task)
+
+ job_property_bag: Dict[str, str] = {}
+
+ allowed_task_keys = [
+ "_downstream_task_ids",
+ "_inlets",
+ "_outlets",
+ "_task_type",
+ "_task_module",
+ "depends_on_past",
+ "email",
+ "label",
+ "execution_timeout",
+ "sla",
+ "sql",
+ "task_id",
+ "trigger_rule",
+ "wait_for_downstream",
+ # In Airflow 2.3, _downstream_task_ids was renamed to downstream_task_ids
+ "downstream_task_ids",
+ # In Airflow 2.4, _inlets and _outlets were removed in favor of non-private versions.
+ "inlets",
+ "outlets",
+ ]
+
+ for key in allowed_task_keys:
+ if hasattr(task, key):
+ job_property_bag[key] = repr(getattr(task, key))
+
+ datajob.properties = job_property_bag
+ base_url = conf.get("webserver", "base_url")
+ datajob.url = f"{base_url}/taskinstance/list/?flt1_dag_id_equals={datajob.flow_urn.get_flow_id()}&_flt_3_task_id={task.task_id}"
+
+ if capture_owner and dag.owner:
+ datajob.owners.add(dag.owner)
+
+ if capture_tags and dag.tags:
+ datajob.tags.update(dag.tags)
+
+ if set_dependencies:
+ datajob.upstream_urns.extend(
+ AirflowGenerator._get_dependencies(
+ task=task, dag=dag, flow_urn=datajob.flow_urn
+ )
+ )
+
+ return datajob
+
+ @staticmethod
+ def create_datajob_instance(
+ cluster: str,
+ task: "Operator",
+ dag: "DAG",
+ data_job: Optional[DataJob] = None,
+ ) -> DataProcessInstance:
+ if data_job is None:
+ data_job = AirflowGenerator.generate_datajob(cluster, task=task, dag=dag)
+ dpi = DataProcessInstance.from_datajob(
+ datajob=data_job, id=task.task_id, clone_inlets=True, clone_outlets=True
+ )
+ return dpi
+
+ @staticmethod
+ def run_dataflow(
+ emitter: Union["DatahubRestEmitter", "DatahubKafkaEmitter"],
+ cluster: str,
+ dag_run: "DagRun",
+ start_timestamp_millis: Optional[int] = None,
+ dataflow: Optional[DataFlow] = None,
+ ) -> None:
+ if dataflow is None:
+ assert dag_run.dag
+ dataflow = AirflowGenerator.generate_dataflow(cluster, dag_run.dag)
+
+ if start_timestamp_millis is None:
+ assert dag_run.execution_date
+ start_timestamp_millis = int(dag_run.execution_date.timestamp() * 1000)
+
+ assert dag_run.run_id
+ dpi = DataProcessInstance.from_dataflow(dataflow=dataflow, id=dag_run.run_id)
+
+ # This property only exists in Airflow2
+ if hasattr(dag_run, "run_type"):
+ from airflow.utils.types import DagRunType
+
+ if dag_run.run_type == DagRunType.SCHEDULED:
+ dpi.type = DataProcessTypeClass.BATCH_SCHEDULED
+ elif dag_run.run_type == DagRunType.MANUAL:
+ dpi.type = DataProcessTypeClass.BATCH_AD_HOC
+ else:
+ if dag_run.run_id.startswith("scheduled__"):
+ dpi.type = DataProcessTypeClass.BATCH_SCHEDULED
+ else:
+ dpi.type = DataProcessTypeClass.BATCH_AD_HOC
+
+ property_bag: Dict[str, str] = {}
+ property_bag["run_id"] = str(dag_run.run_id)
+ property_bag["execution_date"] = str(dag_run.execution_date)
+ property_bag["end_date"] = str(dag_run.end_date)
+ property_bag["start_date"] = str(dag_run.start_date)
+ property_bag["creating_job_id"] = str(dag_run.creating_job_id)
+ # These properties only exists in Airflow>=2.2.0
+ if hasattr(dag_run, "data_interval_start") and hasattr(
+ dag_run, "data_interval_end"
+ ):
+ property_bag["data_interval_start"] = str(dag_run.data_interval_start)
+ property_bag["data_interval_end"] = str(dag_run.data_interval_end)
+ property_bag["external_trigger"] = str(dag_run.external_trigger)
+ dpi.properties.update(property_bag)
+
+ dpi.emit_process_start(
+ emitter=emitter, start_timestamp_millis=start_timestamp_millis
+ )
+
+ @staticmethod
+ def complete_dataflow(
+ emitter: Union["DatahubRestEmitter", "DatahubKafkaEmitter"],
+ cluster: str,
+ dag_run: "DagRun",
+ end_timestamp_millis: Optional[int] = None,
+ dataflow: Optional[DataFlow] = None,
+ ) -> None:
+ """
+
+ :param emitter: DatahubRestEmitter - the datahub rest emitter to emit the generated mcps
+ :param cluster: str - name of the cluster
+ :param dag_run: DagRun
+ :param end_timestamp_millis: Optional[int] - the completion time in milliseconds if not set the current time will be used.
+ :param dataflow: Optional[Dataflow]
+ """
+ if dataflow is None:
+ assert dag_run.dag
+ dataflow = AirflowGenerator.generate_dataflow(cluster, dag_run.dag)
+
+ assert dag_run.run_id
+ dpi = DataProcessInstance.from_dataflow(dataflow=dataflow, id=dag_run.run_id)
+ if end_timestamp_millis is None:
+ if dag_run.end_date is None:
+ raise Exception(
+ f"Dag {dag_run.dag_id}_{dag_run.run_id} is still running and unable to get end_date..."
+ )
+ end_timestamp_millis = int(dag_run.end_date.timestamp() * 1000)
+
+ # We should use DagRunState but it is not available in Airflow 1
+ if dag_run.state == "success":
+ result = InstanceRunResult.SUCCESS
+ elif dag_run.state == "failed":
+ result = InstanceRunResult.FAILURE
+ else:
+ raise Exception(
+ f"Result should be either success or failure and it was {dag_run.state}"
+ )
+
+ dpi.emit_process_end(
+ emitter=emitter,
+ end_timestamp_millis=end_timestamp_millis,
+ result=result,
+ result_type="airflow",
+ )
+
+ @staticmethod
+ def run_datajob(
+ emitter: Union["DatahubRestEmitter", "DatahubKafkaEmitter"],
+ cluster: str,
+ ti: "TaskInstance",
+ dag: "DAG",
+ dag_run: "DagRun",
+ start_timestamp_millis: Optional[int] = None,
+ datajob: Optional[DataJob] = None,
+ attempt: Optional[int] = None,
+ emit_templates: bool = True,
+ ) -> DataProcessInstance:
+ if datajob is None:
+ datajob = AirflowGenerator.generate_datajob(cluster, ti.task, dag)
+
+ assert dag_run.run_id
+ dpi = DataProcessInstance.from_datajob(
+ datajob=datajob,
+ id=f"{dag.dag_id}_{ti.task_id}_{dag_run.run_id}",
+ clone_inlets=True,
+ clone_outlets=True,
+ )
+ job_property_bag: Dict[str, str] = {}
+ job_property_bag["run_id"] = str(dag_run.run_id)
+ job_property_bag["duration"] = str(ti.duration)
+ job_property_bag["start_date"] = str(ti.start_date)
+ job_property_bag["end_date"] = str(ti.end_date)
+ job_property_bag["execution_date"] = str(ti.execution_date)
+ job_property_bag["try_number"] = str(ti.try_number - 1)
+ job_property_bag["hostname"] = str(ti.hostname)
+ job_property_bag["max_tries"] = str(ti.max_tries)
+ # Not compatible with Airflow 1
+ if hasattr(ti, "external_executor_id"):
+ job_property_bag["external_executor_id"] = str(ti.external_executor_id)
+ job_property_bag["pid"] = str(ti.pid)
+ job_property_bag["state"] = str(ti.state)
+ job_property_bag["operator"] = str(ti.operator)
+ job_property_bag["priority_weight"] = str(ti.priority_weight)
+ job_property_bag["unixname"] = str(ti.unixname)
+ job_property_bag["log_url"] = ti.log_url
+ dpi.properties.update(job_property_bag)
+ dpi.url = ti.log_url
+
+ # This property only exists in Airflow2
+ if hasattr(ti, "dag_run") and hasattr(ti.dag_run, "run_type"):
+ from airflow.utils.types import DagRunType
+
+ if ti.dag_run.run_type == DagRunType.SCHEDULED:
+ dpi.type = DataProcessTypeClass.BATCH_SCHEDULED
+ elif ti.dag_run.run_type == DagRunType.MANUAL:
+ dpi.type = DataProcessTypeClass.BATCH_AD_HOC
+ else:
+ if dag_run.run_id.startswith("scheduled__"):
+ dpi.type = DataProcessTypeClass.BATCH_SCHEDULED
+ else:
+ dpi.type = DataProcessTypeClass.BATCH_AD_HOC
+
+ if start_timestamp_millis is None:
+ assert ti.start_date
+ start_timestamp_millis = int(ti.start_date.timestamp() * 1000)
+
+ if attempt is None:
+ attempt = ti.try_number
+
+ dpi.emit_process_start(
+ emitter=emitter,
+ start_timestamp_millis=start_timestamp_millis,
+ attempt=attempt,
+ emit_template=emit_templates,
+ )
+ return dpi
+
+ @staticmethod
+ def complete_datajob(
+ emitter: Union["DatahubRestEmitter", "DatahubKafkaEmitter"],
+ cluster: str,
+ ti: "TaskInstance",
+ dag: "DAG",
+ dag_run: "DagRun",
+ end_timestamp_millis: Optional[int] = None,
+ result: Optional[InstanceRunResult] = None,
+ datajob: Optional[DataJob] = None,
+ ) -> DataProcessInstance:
+ """
+
+ :param emitter: DatahubRestEmitter
+ :param cluster: str
+ :param ti: TaskInstance
+ :param dag: DAG
+ :param dag_run: DagRun
+ :param end_timestamp_millis: Optional[int]
+ :param result: Optional[str] One of the result from datahub.metadata.schema_class.RunResultTypeClass
+ :param datajob: Optional[DataJob]
+ :return: DataProcessInstance
+ """
+ if datajob is None:
+ datajob = AirflowGenerator.generate_datajob(cluster, ti.task, dag)
+
+ if end_timestamp_millis is None:
+ assert ti.end_date
+ end_timestamp_millis = int(ti.end_date.timestamp() * 1000)
+
+ if result is None:
+ # We should use TaskInstanceState but it is not available in Airflow 1
+ if ti.state == "success":
+ result = InstanceRunResult.SUCCESS
+ elif ti.state == "failed":
+ result = InstanceRunResult.FAILURE
+ else:
+ raise Exception(
+ f"Result should be either success or failure and it was {ti.state}"
+ )
+
+ dpi = DataProcessInstance.from_datajob(
+ datajob=datajob,
+ id=f"{dag.dag_id}_{ti.task_id}_{dag_run.run_id}",
+ clone_inlets=True,
+ clone_outlets=True,
+ )
+ dpi.emit_process_end(
+ emitter=emitter,
+ end_timestamp_millis=end_timestamp_millis,
+ result=result,
+ result_type="airflow",
+ )
+ return dpi
diff --git a/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/datahub_plugin.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/datahub_plugin.py
index 226a7382f75954..d1cec9e5c1b54f 100644
--- a/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/datahub_plugin.py
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/datahub_plugin.py
@@ -1,4 +1,367 @@
-# This package serves as a shim, but the actual implementation lives in datahub_provider
-# from the acryl-datahub package. We leave this shim here to avoid breaking existing
-# Airflow installs.
-from datahub_provider._plugin import DatahubPlugin # noqa: F401
+import contextlib
+import logging
+import traceback
+from typing import Any, Callable, Iterable, List, Optional, Union
+
+from airflow.configuration import conf
+from airflow.lineage import PIPELINE_OUTLETS
+from airflow.models.baseoperator import BaseOperator
+from airflow.plugins_manager import AirflowPlugin
+from airflow.utils.module_loading import import_string
+from cattr import structure
+from datahub.api.entities.dataprocess.dataprocess_instance import InstanceRunResult
+
+from datahub_airflow_plugin._airflow_compat import AIRFLOW_PATCHED
+from datahub_airflow_plugin._airflow_shims import MappedOperator, Operator
+from datahub_airflow_plugin.client.airflow_generator import AirflowGenerator
+from datahub_airflow_plugin.hooks.datahub import DatahubGenericHook
+from datahub_airflow_plugin.lineage.datahub import DatahubLineageConfig
+
+assert AIRFLOW_PATCHED
+logger = logging.getLogger(__name__)
+
+TASK_ON_FAILURE_CALLBACK = "on_failure_callback"
+TASK_ON_SUCCESS_CALLBACK = "on_success_callback"
+
+
+def get_lineage_config() -> DatahubLineageConfig:
+ """Load the lineage config from airflow.cfg."""
+
+ enabled = conf.get("datahub", "enabled", fallback=True)
+ datahub_conn_id = conf.get("datahub", "conn_id", fallback="datahub_rest_default")
+ cluster = conf.get("datahub", "cluster", fallback="prod")
+ graceful_exceptions = conf.get("datahub", "graceful_exceptions", fallback=True)
+ capture_tags_info = conf.get("datahub", "capture_tags_info", fallback=True)
+ capture_ownership_info = conf.get(
+ "datahub", "capture_ownership_info", fallback=True
+ )
+ capture_executions = conf.get("datahub", "capture_executions", fallback=True)
+ return DatahubLineageConfig(
+ enabled=enabled,
+ datahub_conn_id=datahub_conn_id,
+ cluster=cluster,
+ graceful_exceptions=graceful_exceptions,
+ capture_ownership_info=capture_ownership_info,
+ capture_tags_info=capture_tags_info,
+ capture_executions=capture_executions,
+ )
+
+
+def _task_inlets(operator: "Operator") -> List:
+ # From Airflow 2.4 _inlets is dropped and inlets used consistently. Earlier it was not the case, so we have to stick there to _inlets
+ if hasattr(operator, "_inlets"):
+ return operator._inlets # type: ignore[attr-defined, union-attr]
+ return operator.inlets
+
+
+def _task_outlets(operator: "Operator") -> List:
+ # From Airflow 2.4 _outlets is dropped and inlets used consistently. Earlier it was not the case, so we have to stick there to _outlets
+ # We have to use _outlets because outlets is empty in Airflow < 2.4.0
+ if hasattr(operator, "_outlets"):
+ return operator._outlets # type: ignore[attr-defined, union-attr]
+ return operator.outlets
+
+
+def get_inlets_from_task(task: BaseOperator, context: Any) -> Iterable[Any]:
+ # TODO: Fix for https://github.com/apache/airflow/commit/1b1f3fabc5909a447a6277cafef3a0d4ef1f01ae
+ # in Airflow 2.4.
+ # TODO: ignore/handle airflow's dataset type in our lineage
+
+ inlets: List[Any] = []
+ task_inlets = _task_inlets(task)
+ # From Airflow 2.3 this should be AbstractOperator but due to compatibility reason lets use BaseOperator
+ if isinstance(task_inlets, (str, BaseOperator)):
+ inlets = [
+ task_inlets,
+ ]
+
+ if task_inlets and isinstance(task_inlets, list):
+ inlets = []
+ task_ids = (
+ {o for o in task_inlets if isinstance(o, str)}
+ .union(op.task_id for op in task_inlets if isinstance(op, BaseOperator))
+ .intersection(task.get_flat_relative_ids(upstream=True))
+ )
+
+ from airflow.lineage import AUTO
+
+ # pick up unique direct upstream task_ids if AUTO is specified
+ if AUTO.upper() in task_inlets or AUTO.lower() in task_inlets:
+ print("Picking up unique direct upstream task_ids as AUTO is specified")
+ task_ids = task_ids.union(
+ task_ids.symmetric_difference(task.upstream_task_ids)
+ )
+
+ inlets = task.xcom_pull(
+ context, task_ids=list(task_ids), dag_id=task.dag_id, key=PIPELINE_OUTLETS
+ )
+
+ # re-instantiate the obtained inlets
+ inlets = [
+ structure(item["data"], import_string(item["type_name"]))
+ # _get_instance(structure(item, Metadata))
+ for sublist in inlets
+ if sublist
+ for item in sublist
+ ]
+
+ for inlet in task_inlets:
+ if not isinstance(inlet, str):
+ inlets.append(inlet)
+
+ return inlets
+
+
+def _make_emit_callback(
+ logger: logging.Logger,
+) -> Callable[[Optional[Exception], str], None]:
+ def emit_callback(err: Optional[Exception], msg: str) -> None:
+ if err:
+ logger.error(f"Error sending metadata to datahub: {msg}", exc_info=err)
+
+ return emit_callback
+
+
+def datahub_task_status_callback(context, status):
+ ti = context["ti"]
+ task: "BaseOperator" = ti.task
+ dag = context["dag"]
+
+ # This code is from the original airflow lineage code ->
+ # https://github.com/apache/airflow/blob/main/airflow/lineage/__init__.py
+ inlets = get_inlets_from_task(task, context)
+
+ emitter = (
+ DatahubGenericHook(context["_datahub_config"].datahub_conn_id)
+ .get_underlying_hook()
+ .make_emitter()
+ )
+
+ dataflow = AirflowGenerator.generate_dataflow(
+ cluster=context["_datahub_config"].cluster,
+ dag=dag,
+ capture_tags=context["_datahub_config"].capture_tags_info,
+ capture_owner=context["_datahub_config"].capture_ownership_info,
+ )
+ task.log.info(f"Emitting Datahub Dataflow: {dataflow}")
+ dataflow.emit(emitter, callback=_make_emit_callback(task.log))
+
+ datajob = AirflowGenerator.generate_datajob(
+ cluster=context["_datahub_config"].cluster,
+ task=task,
+ dag=dag,
+ capture_tags=context["_datahub_config"].capture_tags_info,
+ capture_owner=context["_datahub_config"].capture_ownership_info,
+ )
+
+ for inlet in inlets:
+ datajob.inlets.append(inlet.urn)
+
+ task_outlets = _task_outlets(task)
+ for outlet in task_outlets:
+ datajob.outlets.append(outlet.urn)
+
+ task.log.info(f"Emitting Datahub Datajob: {datajob}")
+ datajob.emit(emitter, callback=_make_emit_callback(task.log))
+
+ if context["_datahub_config"].capture_executions:
+ dpi = AirflowGenerator.run_datajob(
+ emitter=emitter,
+ cluster=context["_datahub_config"].cluster,
+ ti=context["ti"],
+ dag=dag,
+ dag_run=context["dag_run"],
+ datajob=datajob,
+ start_timestamp_millis=int(ti.start_date.timestamp() * 1000),
+ )
+
+ task.log.info(f"Emitted Start Datahub Dataprocess Instance: {dpi}")
+
+ dpi = AirflowGenerator.complete_datajob(
+ emitter=emitter,
+ cluster=context["_datahub_config"].cluster,
+ ti=context["ti"],
+ dag_run=context["dag_run"],
+ result=status,
+ dag=dag,
+ datajob=datajob,
+ end_timestamp_millis=int(ti.end_date.timestamp() * 1000),
+ )
+ task.log.info(f"Emitted Completed Data Process Instance: {dpi}")
+
+ emitter.flush()
+
+
+def datahub_pre_execution(context):
+ ti = context["ti"]
+ task: "BaseOperator" = ti.task
+ dag = context["dag"]
+
+ task.log.info("Running Datahub pre_execute method")
+
+ emitter = (
+ DatahubGenericHook(context["_datahub_config"].datahub_conn_id)
+ .get_underlying_hook()
+ .make_emitter()
+ )
+
+ # This code is from the original airflow lineage code ->
+ # https://github.com/apache/airflow/blob/main/airflow/lineage/__init__.py
+ inlets = get_inlets_from_task(task, context)
+
+ datajob = AirflowGenerator.generate_datajob(
+ cluster=context["_datahub_config"].cluster,
+ task=context["ti"].task,
+ dag=dag,
+ capture_tags=context["_datahub_config"].capture_tags_info,
+ capture_owner=context["_datahub_config"].capture_ownership_info,
+ )
+
+ for inlet in inlets:
+ datajob.inlets.append(inlet.urn)
+
+ task_outlets = _task_outlets(task)
+
+ for outlet in task_outlets:
+ datajob.outlets.append(outlet.urn)
+
+ task.log.info(f"Emitting Datahub dataJob {datajob}")
+ datajob.emit(emitter, callback=_make_emit_callback(task.log))
+
+ if context["_datahub_config"].capture_executions:
+ dpi = AirflowGenerator.run_datajob(
+ emitter=emitter,
+ cluster=context["_datahub_config"].cluster,
+ ti=context["ti"],
+ dag=dag,
+ dag_run=context["dag_run"],
+ datajob=datajob,
+ start_timestamp_millis=int(ti.start_date.timestamp() * 1000),
+ )
+
+ task.log.info(f"Emitting Datahub Dataprocess Instance: {dpi}")
+
+ emitter.flush()
+
+
+def _wrap_pre_execution(pre_execution):
+ def custom_pre_execution(context):
+ config = get_lineage_config()
+ if config.enabled:
+ context["_datahub_config"] = config
+ datahub_pre_execution(context)
+
+ # Call original policy
+ if pre_execution:
+ pre_execution(context)
+
+ return custom_pre_execution
+
+
+def _wrap_on_failure_callback(on_failure_callback):
+ def custom_on_failure_callback(context):
+ config = get_lineage_config()
+ if config.enabled:
+ context["_datahub_config"] = config
+ try:
+ datahub_task_status_callback(context, status=InstanceRunResult.FAILURE)
+ except Exception as e:
+ if not config.graceful_exceptions:
+ raise e
+ else:
+ print(f"Exception: {traceback.format_exc()}")
+
+ # Call original policy
+ if on_failure_callback:
+ on_failure_callback(context)
+
+ return custom_on_failure_callback
+
+
+def _wrap_on_success_callback(on_success_callback):
+ def custom_on_success_callback(context):
+ config = get_lineage_config()
+ if config.enabled:
+ context["_datahub_config"] = config
+ try:
+ datahub_task_status_callback(context, status=InstanceRunResult.SUCCESS)
+ except Exception as e:
+ if not config.graceful_exceptions:
+ raise e
+ else:
+ print(f"Exception: {traceback.format_exc()}")
+
+ # Call original policy
+ if on_success_callback:
+ on_success_callback(context)
+
+ return custom_on_success_callback
+
+
+def task_policy(task: Union[BaseOperator, MappedOperator]) -> None:
+ task.log.debug(f"Setting task policy for Dag: {task.dag_id} Task: {task.task_id}")
+ # task.add_inlets(["auto"])
+ # task.pre_execute = _wrap_pre_execution(task.pre_execute)
+
+ # MappedOperator's callbacks don't have setters until Airflow 2.X.X
+ # https://github.com/apache/airflow/issues/24547
+ # We can bypass this by going through partial_kwargs for now
+ if MappedOperator and isinstance(task, MappedOperator): # type: ignore
+ on_failure_callback_prop: property = getattr(
+ MappedOperator, TASK_ON_FAILURE_CALLBACK
+ )
+ on_success_callback_prop: property = getattr(
+ MappedOperator, TASK_ON_SUCCESS_CALLBACK
+ )
+ if not on_failure_callback_prop.fset or not on_success_callback_prop.fset:
+ task.log.debug(
+ "Using MappedOperator's partial_kwargs instead of callback properties"
+ )
+ task.partial_kwargs[TASK_ON_FAILURE_CALLBACK] = _wrap_on_failure_callback(
+ task.on_failure_callback
+ )
+ task.partial_kwargs[TASK_ON_SUCCESS_CALLBACK] = _wrap_on_success_callback(
+ task.on_success_callback
+ )
+ return
+
+ task.on_failure_callback = _wrap_on_failure_callback(task.on_failure_callback) # type: ignore
+ task.on_success_callback = _wrap_on_success_callback(task.on_success_callback) # type: ignore
+ # task.pre_execute = _wrap_pre_execution(task.pre_execute)
+
+
+def _wrap_task_policy(policy):
+ if policy and hasattr(policy, "_task_policy_patched_by"):
+ return policy
+
+ def custom_task_policy(task):
+ policy(task)
+ task_policy(task)
+
+ # Add a flag to the policy to indicate that we've patched it.
+ custom_task_policy._task_policy_patched_by = "datahub_plugin" # type: ignore[attr-defined]
+ return custom_task_policy
+
+
+def _patch_policy(settings):
+ if hasattr(settings, "task_policy"):
+ datahub_task_policy = _wrap_task_policy(settings.task_policy)
+ settings.task_policy = datahub_task_policy
+
+
+def _patch_datahub_policy():
+ with contextlib.suppress(ImportError):
+ import airflow_local_settings
+
+ _patch_policy(airflow_local_settings)
+
+ from airflow.models.dagbag import settings
+
+ _patch_policy(settings)
+
+
+_patch_datahub_policy()
+
+
+class DatahubPlugin(AirflowPlugin):
+ name = "datahub_plugin"
diff --git a/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/entities.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/entities.py
new file mode 100644
index 00000000000000..69f667cad3241d
--- /dev/null
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/entities.py
@@ -0,0 +1,47 @@
+from abc import abstractmethod
+from typing import Optional
+
+import attr
+import datahub.emitter.mce_builder as builder
+from datahub.utilities.urns.urn import guess_entity_type
+
+
+class _Entity:
+ @property
+ @abstractmethod
+ def urn(self) -> str:
+ pass
+
+
+@attr.s(auto_attribs=True, str=True)
+class Dataset(_Entity):
+ platform: str
+ name: str
+ env: str = builder.DEFAULT_ENV
+ platform_instance: Optional[str] = None
+
+ @property
+ def urn(self):
+ return builder.make_dataset_urn_with_platform_instance(
+ platform=self.platform,
+ name=self.name,
+ platform_instance=self.platform_instance,
+ env=self.env,
+ )
+
+
+@attr.s(str=True)
+class Urn(_Entity):
+ _urn: str = attr.ib()
+
+ @_urn.validator
+ def _validate_urn(self, attribute, value):
+ if not value.startswith("urn:"):
+ raise ValueError("invalid urn provided: urns must start with 'urn:'")
+ if guess_entity_type(value) != "dataset":
+ # This is because DataJobs only support Dataset lineage.
+ raise ValueError("Airflow lineage currently only supports datasets")
+
+ @property
+ def urn(self):
+ return self._urn
diff --git a/metadata-ingestion/src/datahub_provider/example_dags/.airflowignore b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/.airflowignore
similarity index 100%
rename from metadata-ingestion/src/datahub_provider/example_dags/.airflowignore
rename to metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/.airflowignore
diff --git a/metadata-ingestion/src/datahub/ingestion/source_report/sql/__init__.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/__init__.py
similarity index 100%
rename from metadata-ingestion/src/datahub/ingestion/source_report/sql/__init__.py
rename to metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/__init__.py
diff --git a/metadata-ingestion/src/datahub_provider/example_dags/generic_recipe_sample_dag.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/generic_recipe_sample_dag.py
similarity index 98%
rename from metadata-ingestion/src/datahub_provider/example_dags/generic_recipe_sample_dag.py
rename to metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/generic_recipe_sample_dag.py
index d0e4aa944e8401..ff8dba457066fd 100644
--- a/metadata-ingestion/src/datahub_provider/example_dags/generic_recipe_sample_dag.py
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/generic_recipe_sample_dag.py
@@ -9,7 +9,6 @@
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
-
from datahub.configuration.config_loader import load_config_file
from datahub.ingestion.run.pipeline import Pipeline
@@ -41,6 +40,7 @@ def datahub_recipe():
schedule_interval=timedelta(days=1),
start_date=days_ago(2),
catchup=False,
+ default_view="tree",
) as dag:
ingest_task = PythonOperator(
task_id="ingest_using_recipe",
diff --git a/metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_demo.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_demo.py
similarity index 94%
rename from metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_demo.py
rename to metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_demo.py
index 95b594e4052a54..3caea093b932d4 100644
--- a/metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_demo.py
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_demo.py
@@ -9,7 +9,7 @@
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
-from datahub_provider.entities import Dataset, Urn
+from datahub_airflow_plugin.entities import Dataset, Urn
default_args = {
"owner": "airflow",
@@ -28,6 +28,7 @@
start_date=days_ago(2),
tags=["example_tag"],
catchup=False,
+ default_view="tree",
) as dag:
task1 = BashOperator(
task_id="run_data_task",
diff --git a/metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_taskflow_demo.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_taskflow_demo.py
similarity index 94%
rename from metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_taskflow_demo.py
rename to metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_taskflow_demo.py
index 1fe321eb5c80a4..ceb0f452b540a0 100644
--- a/metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_taskflow_demo.py
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_taskflow_demo.py
@@ -8,7 +8,7 @@
from airflow.decorators import dag, task
from airflow.utils.dates import days_ago
-from datahub_provider.entities import Dataset, Urn
+from datahub_airflow_plugin.entities import Dataset, Urn
default_args = {
"owner": "airflow",
@@ -26,6 +26,7 @@
start_date=days_ago(2),
tags=["example_tag"],
catchup=False,
+ default_view="tree",
)
def datahub_lineage_backend_taskflow_demo():
@task(
diff --git a/metadata-ingestion/src/datahub_provider/example_dags/lineage_emission_dag.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_emission_dag.py
similarity index 96%
rename from metadata-ingestion/src/datahub_provider/example_dags/lineage_emission_dag.py
rename to metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_emission_dag.py
index 153464246cef77..f40295c6bb883a 100644
--- a/metadata-ingestion/src/datahub_provider/example_dags/lineage_emission_dag.py
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_emission_dag.py
@@ -5,12 +5,12 @@
from datetime import timedelta
+import datahub.emitter.mce_builder as builder
from airflow import DAG
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
from airflow.utils.dates import days_ago
-import datahub.emitter.mce_builder as builder
-from datahub_provider.operators.datahub import DatahubEmitterOperator
+from datahub_airflow_plugin.operators.datahub import DatahubEmitterOperator
default_args = {
"owner": "airflow",
@@ -31,6 +31,7 @@
schedule_interval=timedelta(days=1),
start_date=days_ago(2),
catchup=False,
+ default_view="tree",
) as dag:
# This example shows a SnowflakeOperator followed by a lineage emission. However, the
# same DatahubEmitterOperator can be used to emit lineage in any context.
diff --git a/metadata-ingestion/src/datahub_provider/example_dags/mysql_sample_dag.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/mysql_sample_dag.py
similarity index 98%
rename from metadata-ingestion/src/datahub_provider/example_dags/mysql_sample_dag.py
rename to metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/mysql_sample_dag.py
index 2c833e14256342..77b29711d76882 100644
--- a/metadata-ingestion/src/datahub_provider/example_dags/mysql_sample_dag.py
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/mysql_sample_dag.py
@@ -47,6 +47,7 @@ def ingest_from_mysql():
start_date=datetime(2022, 1, 1),
schedule_interval=timedelta(days=1),
catchup=False,
+ default_view="tree",
) as dag:
# While it is also possible to use the PythonOperator, we recommend using
# the PythonVirtualenvOperator to ensure that there are no dependency
diff --git a/metadata-ingestion/src/datahub_provider/example_dags/snowflake_sample_dag.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/snowflake_sample_dag.py
similarity index 99%
rename from metadata-ingestion/src/datahub_provider/example_dags/snowflake_sample_dag.py
rename to metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/snowflake_sample_dag.py
index c107bb479262cd..30e63b68e459fd 100644
--- a/metadata-ingestion/src/datahub_provider/example_dags/snowflake_sample_dag.py
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/snowflake_sample_dag.py
@@ -57,6 +57,7 @@ def ingest_from_snowflake(snowflake_credentials, datahub_gms_server):
start_date=datetime(2022, 1, 1),
schedule_interval=timedelta(days=1),
catchup=False,
+ default_view="tree",
) as dag:
# This example pulls credentials from Airflow's connection store.
# For this to work, you must have previously configured these connections in Airflow.
diff --git a/metadata-ingestion/src/datahub/ingestion/source_report/usage/__init__.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/hooks/__init__.py
similarity index 100%
rename from metadata-ingestion/src/datahub/ingestion/source_report/usage/__init__.py
rename to metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/hooks/__init__.py
diff --git a/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/hooks/datahub.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/hooks/datahub.py
new file mode 100644
index 00000000000000..aed858c6c4df0b
--- /dev/null
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/hooks/datahub.py
@@ -0,0 +1,214 @@
+from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Union
+
+from airflow.exceptions import AirflowException
+from airflow.hooks.base import BaseHook
+from datahub.metadata.com.linkedin.pegasus2avro.mxe import (
+ MetadataChangeEvent,
+ MetadataChangeProposal,
+)
+
+if TYPE_CHECKING:
+ from airflow.models.connection import Connection
+ from datahub.emitter.kafka_emitter import DatahubKafkaEmitter
+ from datahub.emitter.rest_emitter import DatahubRestEmitter
+ from datahub.ingestion.sink.datahub_kafka import KafkaSinkConfig
+
+
+class DatahubRestHook(BaseHook):
+ """
+ Creates a DataHub Rest API connection used to send metadata to DataHub.
+ Takes the endpoint for your DataHub Rest API in the Server Endpoint(host) field.
+
+ URI example: ::
+
+ AIRFLOW_CONN_DATAHUB_REST_DEFAULT='datahub-rest://rest-endpoint'
+
+ :param datahub_rest_conn_id: Reference to the DataHub Rest connection.
+ :type datahub_rest_conn_id: str
+ """
+
+ conn_name_attr = "datahub_rest_conn_id"
+ default_conn_name = "datahub_rest_default"
+ conn_type = "datahub_rest"
+ hook_name = "DataHub REST Server"
+
+ def __init__(self, datahub_rest_conn_id: str = default_conn_name) -> None:
+ super().__init__()
+ self.datahub_rest_conn_id = datahub_rest_conn_id
+
+ @staticmethod
+ def get_connection_form_widgets() -> Dict[str, Any]:
+ return {}
+
+ @staticmethod
+ def get_ui_field_behaviour() -> Dict:
+ """Returns custom field behavior"""
+ return {
+ "hidden_fields": ["port", "schema", "login"],
+ "relabeling": {
+ "host": "Server Endpoint",
+ },
+ }
+
+ def _get_config(self) -> Tuple[str, Optional[str], Optional[int]]:
+ conn: "Connection" = self.get_connection(self.datahub_rest_conn_id)
+
+ host = conn.host
+ if not host:
+ raise AirflowException("host parameter is required")
+ if conn.port:
+ if ":" in host:
+ raise AirflowException(
+ "host parameter should not contain a port number if the port is specified separately"
+ )
+ host = f"{host}:{conn.port}"
+ password = conn.password
+ timeout_sec = conn.extra_dejson.get("timeout_sec")
+ return (host, password, timeout_sec)
+
+ def make_emitter(self) -> "DatahubRestEmitter":
+ import datahub.emitter.rest_emitter
+
+ return datahub.emitter.rest_emitter.DatahubRestEmitter(*self._get_config())
+
+ def emit_mces(self, mces: List[MetadataChangeEvent]) -> None:
+ emitter = self.make_emitter()
+
+ for mce in mces:
+ emitter.emit_mce(mce)
+
+ def emit_mcps(self, mcps: List[MetadataChangeProposal]) -> None:
+ emitter = self.make_emitter()
+
+ for mce in mcps:
+ emitter.emit_mcp(mce)
+
+
+class DatahubKafkaHook(BaseHook):
+ """
+ Creates a DataHub Kafka connection used to send metadata to DataHub.
+ Takes your kafka broker in the Kafka Broker(host) field.
+
+ URI example: ::
+
+ AIRFLOW_CONN_DATAHUB_KAFKA_DEFAULT='datahub-kafka://kafka-broker'
+
+ :param datahub_kafka_conn_id: Reference to the DataHub Kafka connection.
+ :type datahub_kafka_conn_id: str
+ """
+
+ conn_name_attr = "datahub_kafka_conn_id"
+ default_conn_name = "datahub_kafka_default"
+ conn_type = "datahub_kafka"
+ hook_name = "DataHub Kafka Sink"
+
+ def __init__(self, datahub_kafka_conn_id: str = default_conn_name) -> None:
+ super().__init__()
+ self.datahub_kafka_conn_id = datahub_kafka_conn_id
+
+ @staticmethod
+ def get_connection_form_widgets() -> Dict[str, Any]:
+ return {}
+
+ @staticmethod
+ def get_ui_field_behaviour() -> Dict:
+ """Returns custom field behavior"""
+ return {
+ "hidden_fields": ["port", "schema", "login", "password"],
+ "relabeling": {
+ "host": "Kafka Broker",
+ },
+ }
+
+ def _get_config(self) -> "KafkaSinkConfig":
+ import datahub.ingestion.sink.datahub_kafka
+
+ conn = self.get_connection(self.datahub_kafka_conn_id)
+ obj = conn.extra_dejson
+ obj.setdefault("connection", {})
+ if conn.host is not None:
+ if "bootstrap" in obj["connection"]:
+ raise AirflowException(
+ "Kafka broker specified twice (present in host and extra)"
+ )
+ obj["connection"]["bootstrap"] = ":".join(
+ map(str, filter(None, [conn.host, conn.port]))
+ )
+ config = datahub.ingestion.sink.datahub_kafka.KafkaSinkConfig.parse_obj(obj)
+ return config
+
+ def make_emitter(self) -> "DatahubKafkaEmitter":
+ import datahub.emitter.kafka_emitter
+
+ sink_config = self._get_config()
+ return datahub.emitter.kafka_emitter.DatahubKafkaEmitter(sink_config)
+
+ def emit_mces(self, mces: List[MetadataChangeEvent]) -> None:
+ emitter = self.make_emitter()
+ errors = []
+
+ def callback(exc, msg):
+ if exc:
+ errors.append(exc)
+
+ for mce in mces:
+ emitter.emit_mce_async(mce, callback)
+
+ emitter.flush()
+
+ if errors:
+ raise AirflowException(f"failed to push some MCEs: {errors}")
+
+ def emit_mcps(self, mcps: List[MetadataChangeProposal]) -> None:
+ emitter = self.make_emitter()
+ errors = []
+
+ def callback(exc, msg):
+ if exc:
+ errors.append(exc)
+
+ for mcp in mcps:
+ emitter.emit_mcp_async(mcp, callback)
+
+ emitter.flush()
+
+ if errors:
+ raise AirflowException(f"failed to push some MCPs: {errors}")
+
+
+class DatahubGenericHook(BaseHook):
+ """
+ Emits Metadata Change Events using either the DatahubRestHook or the
+ DatahubKafkaHook. Set up a DataHub Rest or Kafka connection to use.
+
+ :param datahub_conn_id: Reference to the DataHub connection.
+ :type datahub_conn_id: str
+ """
+
+ def __init__(self, datahub_conn_id: str) -> None:
+ super().__init__()
+ self.datahub_conn_id = datahub_conn_id
+
+ def get_underlying_hook(self) -> Union[DatahubRestHook, DatahubKafkaHook]:
+ conn = self.get_connection(self.datahub_conn_id)
+
+ # We need to figure out the underlying hook type. First check the
+ # conn_type. If that fails, attempt to guess using the conn id name.
+ if conn.conn_type == DatahubRestHook.conn_type:
+ return DatahubRestHook(self.datahub_conn_id)
+ elif conn.conn_type == DatahubKafkaHook.conn_type:
+ return DatahubKafkaHook(self.datahub_conn_id)
+ elif "rest" in self.datahub_conn_id:
+ return DatahubRestHook(self.datahub_conn_id)
+ elif "kafka" in self.datahub_conn_id:
+ return DatahubKafkaHook(self.datahub_conn_id)
+ else:
+ raise AirflowException(
+ f"DataHub cannot handle conn_type {conn.conn_type} in {conn}"
+ )
+
+ def make_emitter(self) -> Union["DatahubRestEmitter", "DatahubKafkaEmitter"]:
+ return self.get_underlying_hook().make_emitter()
+
+ def emit_mces(self, mces: List[MetadataChangeEvent]) -> None:
+ return self.get_underlying_hook().emit_mces(mces)
diff --git a/metadata-ingestion/src/datahub_provider/example_dags/__init__.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/lineage/__init__.py
similarity index 100%
rename from metadata-ingestion/src/datahub_provider/example_dags/__init__.py
rename to metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/lineage/__init__.py
diff --git a/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/lineage/datahub.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/lineage/datahub.py
new file mode 100644
index 00000000000000..c41bb2b2a1e371
--- /dev/null
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/lineage/datahub.py
@@ -0,0 +1,91 @@
+import json
+from typing import TYPE_CHECKING, Dict, List, Optional
+
+from airflow.configuration import conf
+from airflow.lineage.backend import LineageBackend
+
+from datahub_airflow_plugin._lineage_core import (
+ DatahubBasicLineageConfig,
+ send_lineage_to_datahub,
+)
+
+if TYPE_CHECKING:
+ from airflow.models.baseoperator import BaseOperator
+
+
+class DatahubLineageConfig(DatahubBasicLineageConfig):
+ # If set to true, most runtime errors in the lineage backend will be
+ # suppressed and will not cause the overall task to fail. Note that
+ # configuration issues will still throw exceptions.
+ graceful_exceptions: bool = True
+
+
+def get_lineage_config() -> DatahubLineageConfig:
+ """Load the lineage config from airflow.cfg."""
+
+ # The kwargs pattern is also used for secret backends.
+ kwargs_str = conf.get("lineage", "datahub_kwargs", fallback="{}")
+ kwargs = json.loads(kwargs_str)
+
+ # Continue to support top-level datahub_conn_id config.
+ datahub_conn_id = conf.get("lineage", "datahub_conn_id", fallback=None)
+ if datahub_conn_id:
+ kwargs["datahub_conn_id"] = datahub_conn_id
+
+ return DatahubLineageConfig.parse_obj(kwargs)
+
+
+class DatahubLineageBackend(LineageBackend):
+ """
+ Sends lineage data from tasks to DataHub.
+
+ Configurable via ``airflow.cfg`` as follows: ::
+
+ # For REST-based:
+ airflow connections add --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host 'http://localhost:8080'
+ # For Kafka-based (standard Kafka sink config can be passed via extras):
+ airflow connections add --conn-type 'datahub_kafka' 'datahub_kafka_default' --conn-host 'broker:9092' --conn-extra '{}'
+
+ [lineage]
+ backend = datahub_provider.lineage.datahub.DatahubLineageBackend
+ datahub_kwargs = {
+ "datahub_conn_id": "datahub_rest_default",
+ "capture_ownership_info": true,
+ "capture_tags_info": true,
+ "graceful_exceptions": true }
+ # The above indentation is important!
+ """
+
+ def __init__(self) -> None:
+ super().__init__()
+
+ # By attempting to get and parse the config, we can detect configuration errors
+ # ahead of time. The init method is only called in Airflow 2.x.
+ _ = get_lineage_config()
+
+ # With Airflow 2.0, this can be an instance method. However, with Airflow 1.10.x, this
+ # method is used statically, even though LineageBackend declares it as an instance variable.
+ @staticmethod
+ def send_lineage(
+ operator: "BaseOperator",
+ inlets: Optional[List] = None, # unused
+ outlets: Optional[List] = None, # unused
+ context: Optional[Dict] = None,
+ ) -> None:
+ config = get_lineage_config()
+ if not config.enabled:
+ return
+
+ try:
+ context = context or {} # ensure not None to satisfy mypy
+ send_lineage_to_datahub(
+ config, operator, operator.inlets, operator.outlets, context
+ )
+ except Exception as e:
+ if config.graceful_exceptions:
+ operator.log.error(e)
+ operator.log.info(
+ "Suppressing error because graceful_exceptions is set"
+ )
+ else:
+ raise
diff --git a/.github/workflows/docker-ingestion-base.yml b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/operators/__init__.py
similarity index 100%
rename from .github/workflows/docker-ingestion-base.yml
rename to metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/operators/__init__.py
diff --git a/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/operators/datahub.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/operators/datahub.py
new file mode 100644
index 00000000000000..109e7ddfe4dfa2
--- /dev/null
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/operators/datahub.py
@@ -0,0 +1,63 @@
+from typing import List, Union
+
+from airflow.models import BaseOperator
+from airflow.utils.decorators import apply_defaults
+from datahub.metadata.com.linkedin.pegasus2avro.mxe import MetadataChangeEvent
+
+from datahub_airflow_plugin.hooks.datahub import (
+ DatahubGenericHook,
+ DatahubKafkaHook,
+ DatahubRestHook,
+)
+
+
+class DatahubBaseOperator(BaseOperator):
+ """
+ The DatahubBaseOperator is used as a base operator all DataHub operators.
+ """
+
+ ui_color = "#4398c8"
+
+ hook: Union[DatahubRestHook, DatahubKafkaHook]
+
+ # mypy is not a fan of this. Newer versions of Airflow support proper typing for the decorator
+ # using PEP 612. However, there is not yet a good way to inherit the types of the kwargs from
+ # the superclass.
+ @apply_defaults # type: ignore[misc]
+ def __init__( # type: ignore[no-untyped-def]
+ self,
+ *,
+ datahub_conn_id: str,
+ **kwargs,
+ ):
+ super().__init__(**kwargs)
+
+ self.datahub_conn_id = datahub_conn_id
+ self.generic_hook = DatahubGenericHook(datahub_conn_id)
+
+
+class DatahubEmitterOperator(DatahubBaseOperator):
+ """
+ Emits a Metadata Change Event to DataHub using either a DataHub
+ Rest or Kafka connection.
+
+ :param datahub_conn_id: Reference to the DataHub Rest or Kafka Connection.
+ :type datahub_conn_id: str
+ """
+
+ # See above for why these mypy type issues are ignored here.
+ @apply_defaults # type: ignore[misc]
+ def __init__( # type: ignore[no-untyped-def]
+ self,
+ mces: List[MetadataChangeEvent],
+ datahub_conn_id: str,
+ **kwargs,
+ ):
+ super().__init__(
+ datahub_conn_id=datahub_conn_id,
+ **kwargs,
+ )
+ self.mces = mces
+
+ def execute(self, context):
+ self.generic_hook.get_underlying_hook().emit_mces(self.mces)
diff --git a/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/operators/datahub_assertion_operator.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/operators/datahub_assertion_operator.py
new file mode 100644
index 00000000000000..6f93c09a9e2872
--- /dev/null
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/operators/datahub_assertion_operator.py
@@ -0,0 +1,78 @@
+import datetime
+from typing import Any, List, Optional, Sequence, Union
+
+from airflow.models import BaseOperator
+from datahub.api.circuit_breaker import (
+ AssertionCircuitBreaker,
+ AssertionCircuitBreakerConfig,
+)
+
+from datahub_airflow_plugin.hooks.datahub import DatahubRestHook
+
+
+class DataHubAssertionOperator(BaseOperator):
+ r"""
+ DataHub Assertion Circuit Breaker Operator.
+
+ :param urn: The DataHub dataset unique identifier. (templated)
+ :param datahub_rest_conn_id: The REST datahub connection id to communicate with DataHub
+ which is set as Airflow connection.
+ :param check_last_assertion_time: If set it checks assertions after the last operation was set on the dataset.
+ By default it is True.
+ :param time_delta: If verify_after_last_update is False it checks for assertion within the time delta.
+ """
+
+ template_fields: Sequence[str] = ("urn",)
+ circuit_breaker: AssertionCircuitBreaker
+ urn: Union[List[str], str]
+
+ def __init__( # type: ignore[no-untyped-def]
+ self,
+ *,
+ urn: Union[List[str], str],
+ datahub_rest_conn_id: Optional[str] = None,
+ check_last_assertion_time: bool = True,
+ time_delta: Optional[datetime.timedelta] = None,
+ **kwargs,
+ ) -> None:
+ super().__init__(**kwargs)
+ hook: DatahubRestHook
+ if datahub_rest_conn_id is not None:
+ hook = DatahubRestHook(datahub_rest_conn_id=datahub_rest_conn_id)
+ else:
+ hook = DatahubRestHook()
+
+ host, password, timeout_sec = hook._get_config()
+ self.urn = urn
+ config: AssertionCircuitBreakerConfig = AssertionCircuitBreakerConfig(
+ datahub_host=host,
+ datahub_token=password,
+ timeout=timeout_sec,
+ verify_after_last_update=check_last_assertion_time,
+ time_delta=time_delta if time_delta else datetime.timedelta(days=1),
+ )
+
+ self.circuit_breaker = AssertionCircuitBreaker(config=config)
+
+ def execute(self, context: Any) -> bool:
+ if "datahub_silence_circuit_breakers" in context["dag_run"].conf:
+ self.log.info(
+ "Circuit breaker is silenced because datahub_silence_circuit_breakers config is set"
+ )
+ return True
+
+ self.log.info(f"Checking if dataset {self.urn} is ready to be consumed")
+ if isinstance(self.urn, str):
+ urns = [self.urn]
+ elif isinstance(self.urn, list):
+ urns = self.urn
+ else:
+ raise Exception(f"urn parameter has invalid type {type(self.urn)}")
+
+ for urn in urns:
+ self.log.info(f"Checking if dataset {self.urn} is ready to be consumed")
+ ret = self.circuit_breaker.is_circuit_breaker_active(urn=urn)
+ if ret:
+ raise Exception(f"Dataset {self.urn} is not in consumable state")
+
+ return True
diff --git a/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/operators/datahub_assertion_sensor.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/operators/datahub_assertion_sensor.py
new file mode 100644
index 00000000000000..16e5d1cbe8b1f4
--- /dev/null
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/operators/datahub_assertion_sensor.py
@@ -0,0 +1,78 @@
+import datetime
+from typing import Any, List, Optional, Sequence, Union
+
+from airflow.sensors.base import BaseSensorOperator
+from datahub.api.circuit_breaker import (
+ AssertionCircuitBreaker,
+ AssertionCircuitBreakerConfig,
+)
+
+from datahub_airflow_plugin.hooks.datahub import DatahubRestHook
+
+
+class DataHubAssertionSensor(BaseSensorOperator):
+ r"""
+ DataHub Assertion Circuit Breaker Sensor.
+
+ :param urn: The DataHub dataset unique identifier. (templated)
+ :param datahub_rest_conn_id: The REST datahub connection id to communicate with DataHub
+ which is set as Airflow connection.
+ :param check_last_assertion_time: If set it checks assertions after the last operation was set on the dataset.
+ By default it is True.
+ :param time_delta: If verify_after_last_update is False it checks for assertion within the time delta.
+ """
+
+ template_fields: Sequence[str] = ("urn",)
+ circuit_breaker: AssertionCircuitBreaker
+ urn: Union[List[str], str]
+
+ def __init__( # type: ignore[no-untyped-def]
+ self,
+ *,
+ urn: Union[List[str], str],
+ datahub_rest_conn_id: Optional[str] = None,
+ check_last_assertion_time: bool = True,
+ time_delta: datetime.timedelta = datetime.timedelta(days=1),
+ **kwargs,
+ ) -> None:
+ super().__init__(**kwargs)
+ hook: DatahubRestHook
+ if datahub_rest_conn_id is not None:
+ hook = DatahubRestHook(datahub_rest_conn_id=datahub_rest_conn_id)
+ else:
+ hook = DatahubRestHook()
+
+ host, password, timeout_sec = hook._get_config()
+ self.urn = urn
+ config: AssertionCircuitBreakerConfig = AssertionCircuitBreakerConfig(
+ datahub_host=host,
+ datahub_token=password,
+ timeout=timeout_sec,
+ verify_after_last_update=check_last_assertion_time,
+ time_delta=time_delta,
+ )
+ self.circuit_breaker = AssertionCircuitBreaker(config=config)
+
+ def poke(self, context: Any) -> bool:
+ if "datahub_silence_circuit_breakers" in context["dag_run"].conf:
+ self.log.info(
+ "Circuit breaker is silenced because datahub_silence_circuit_breakers config is set"
+ )
+ return True
+
+ self.log.info(f"Checking if dataset {self.urn} is ready to be consumed")
+ if isinstance(self.urn, str):
+ urns = [self.urn]
+ elif isinstance(self.urn, list):
+ urns = self.urn
+ else:
+ raise Exception(f"urn parameter has invalid type {type(self.urn)}")
+
+ for urn in urns:
+ self.log.info(f"Checking if dataset {self.urn} is ready to be consumed")
+ ret = self.circuit_breaker.is_circuit_breaker_active(urn=urn)
+ if ret:
+ self.log.info(f"Dataset {self.urn} is not in consumable state")
+ return False
+
+ return True
diff --git a/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/operators/datahub_operation_operator.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/operators/datahub_operation_operator.py
new file mode 100644
index 00000000000000..94e105309537b6
--- /dev/null
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/operators/datahub_operation_operator.py
@@ -0,0 +1,97 @@
+import datetime
+from typing import Any, List, Optional, Sequence, Union
+
+from airflow.sensors.base import BaseSensorOperator
+from datahub.api.circuit_breaker import (
+ OperationCircuitBreaker,
+ OperationCircuitBreakerConfig,
+)
+
+from datahub_airflow_plugin.hooks.datahub import DatahubRestHook
+
+
+class DataHubOperationCircuitBreakerOperator(BaseSensorOperator):
+ r"""
+ DataHub Operation Circuit Breaker Operator.
+
+ :param urn: The DataHub dataset unique identifier. (templated)
+ :param datahub_rest_conn_id: The REST datahub connection id to communicate with DataHub
+ which is set as Airflow connection.
+ :param partition: The partition to check the operation.
+ :param source_type: The partition to check the operation. :ref:`https://datahubproject.io/docs/graphql/enums#operationsourcetype`
+
+ """
+
+ template_fields: Sequence[str] = (
+ "urn",
+ "partition",
+ "source_type",
+ "operation_type",
+ )
+ circuit_breaker: OperationCircuitBreaker
+ urn: Union[List[str], str]
+ partition: Optional[str]
+ source_type: Optional[str]
+ operation_type: Optional[str]
+
+ def __init__( # type: ignore[no-untyped-def]
+ self,
+ *,
+ urn: Union[List[str], str],
+ datahub_rest_conn_id: Optional[str] = None,
+ time_delta: Optional[datetime.timedelta] = datetime.timedelta(days=1),
+ partition: Optional[str] = None,
+ source_type: Optional[str] = None,
+ operation_type: Optional[str] = None,
+ **kwargs,
+ ) -> None:
+ super().__init__(**kwargs)
+ hook: DatahubRestHook
+ if datahub_rest_conn_id is not None:
+ hook = DatahubRestHook(datahub_rest_conn_id=datahub_rest_conn_id)
+ else:
+ hook = DatahubRestHook()
+
+ host, password, timeout_sec = hook._get_config()
+
+ self.urn = urn
+ self.partition = partition
+ self.operation_type = operation_type
+ self.source_type = source_type
+
+ config: OperationCircuitBreakerConfig = OperationCircuitBreakerConfig(
+ datahub_host=host,
+ datahub_token=password,
+ timeout=timeout_sec,
+ time_delta=time_delta,
+ )
+
+ self.circuit_breaker = OperationCircuitBreaker(config=config)
+
+ def execute(self, context: Any) -> bool:
+ if "datahub_silence_circuit_breakers" in context["dag_run"].conf:
+ self.log.info(
+ "Circuit breaker is silenced because datahub_silence_circuit_breakers config is set"
+ )
+ return True
+
+ self.log.info(f"Checking if dataset {self.urn} is ready to be consumed")
+ if isinstance(self.urn, str):
+ urns = [self.urn]
+ elif isinstance(self.urn, list):
+ urns = self.urn
+ else:
+ raise Exception(f"urn parameter has invalid type {type(self.urn)}")
+
+ for urn in urns:
+ self.log.info(f"Checking if dataset {self.urn} is ready to be consumed")
+ ret = self.circuit_breaker.is_circuit_breaker_active(
+ urn=urn,
+ partition=self.partition,
+ operation_type=self.operation_type,
+ source_type=self.source_type,
+ )
+ if ret:
+ raise Exception(f"Dataset {self.urn} is not in consumable state")
+
+ return True
diff --git a/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/operators/datahub_operation_sensor.py b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/operators/datahub_operation_sensor.py
new file mode 100644
index 00000000000000..434c60754064d0
--- /dev/null
+++ b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/operators/datahub_operation_sensor.py
@@ -0,0 +1,100 @@
+import datetime
+from typing import Any, List, Optional, Sequence, Union
+
+from airflow.sensors.base import BaseSensorOperator
+from datahub.api.circuit_breaker import (
+ OperationCircuitBreaker,
+ OperationCircuitBreakerConfig,
+)
+
+from datahub_airflow_plugin.hooks.datahub import DatahubRestHook
+
+
+class DataHubOperationCircuitBreakerSensor(BaseSensorOperator):
+ r"""
+ DataHub Operation Circuit Breaker Sensor.
+
+ :param urn: The DataHub dataset unique identifier. (templated)
+ :param datahub_rest_conn_id: The REST datahub connection id to communicate with DataHub
+ which is set as Airflow connection.
+ :param partition: The partition to check the operation.
+ :param source_type: The source type to filter on. If not set it will accept any source type.
+ See valid values at: https://datahubproject.io/docs/graphql/enums#operationsourcetype
+ :param operation_type: The operation type to filter on. If not set it will accept any source type.
+ See valid values at: https://datahubproject.io/docs/graphql/enums/#operationtype
+ """
+
+ template_fields: Sequence[str] = (
+ "urn",
+ "partition",
+ "source_type",
+ "operation_type",
+ )
+ circuit_breaker: OperationCircuitBreaker
+ urn: Union[List[str], str]
+ partition: Optional[str]
+ source_type: Optional[str]
+ operation_type: Optional[str]
+
+ def __init__( # type: ignore[no-untyped-def]
+ self,
+ *,
+ urn: Union[List[str], str],
+ datahub_rest_conn_id: Optional[str] = None,
+ time_delta: Optional[datetime.timedelta] = datetime.timedelta(days=1),
+ partition: Optional[str] = None,
+ source_type: Optional[str] = None,
+ operation_type: Optional[str] = None,
+ **kwargs,
+ ) -> None:
+ super().__init__(**kwargs)
+ hook: DatahubRestHook
+ if datahub_rest_conn_id is not None:
+ hook = DatahubRestHook(datahub_rest_conn_id=datahub_rest_conn_id)
+ else:
+ hook = DatahubRestHook()
+
+ host, password, timeout_sec = hook._get_config()
+
+ self.urn = urn
+ self.partition = partition
+ self.operation_type = operation_type
+ self.source_type = source_type
+
+ config: OperationCircuitBreakerConfig = OperationCircuitBreakerConfig(
+ datahub_host=host,
+ datahub_token=password,
+ timeout=timeout_sec,
+ time_delta=time_delta,
+ )
+
+ self.circuit_breaker = OperationCircuitBreaker(config=config)
+
+ def poke(self, context: Any) -> bool:
+ if "datahub_silence_circuit_breakers" in context["dag_run"].conf:
+ self.log.info(
+ "Circuit breaker is silenced because datahub_silence_circuit_breakers config is set"
+ )
+ return True
+
+ self.log.info(f"Checking if dataset {self.urn} is ready to be consumed")
+ if isinstance(self.urn, str):
+ urns = [self.urn]
+ elif isinstance(self.urn, list):
+ urns = self.urn
+ else:
+ raise Exception(f"urn parameter has invalid type {type(self.urn)}")
+
+ for urn in urns:
+ self.log.info(f"Checking if dataset {self.urn} is ready to be consumed")
+ ret = self.circuit_breaker.is_circuit_breaker_active(
+ urn=urn,
+ partition=self.partition,
+ operation_type=self.operation_type,
+ source_type=self.source_type,
+ )
+ if ret:
+ self.log.info(f"Dataset {self.urn} is not in consumable state")
+ return False
+
+ return True
diff --git a/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/py.typed b/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/py.typed
new file mode 100644
index 00000000000000..e69de29bb2d1d6
diff --git a/metadata-ingestion/tests/unit/test_airflow.py b/metadata-ingestion-modules/airflow-plugin/tests/unit/test_airflow.py
similarity index 97%
rename from metadata-ingestion/tests/unit/test_airflow.py
rename to metadata-ingestion-modules/airflow-plugin/tests/unit/test_airflow.py
index 980dc5550fafa8..9aa901171cfa65 100644
--- a/metadata-ingestion/tests/unit/test_airflow.py
+++ b/metadata-ingestion-modules/airflow-plugin/tests/unit/test_airflow.py
@@ -9,12 +9,11 @@
import airflow.configuration
import airflow.version
+import datahub.emitter.mce_builder as builder
import packaging.version
import pytest
from airflow.lineage import apply_lineage, prepare_lineage
from airflow.models import DAG, Connection, DagBag, DagRun, TaskInstance
-
-import datahub.emitter.mce_builder as builder
from datahub_provider import get_provider_info
from datahub_provider._airflow_shims import AIRFLOW_PATCHED, EmptyOperator
from datahub_provider.entities import Dataset, Urn
@@ -23,7 +22,7 @@
assert AIRFLOW_PATCHED
-pytestmark = pytest.mark.airflow
+# TODO: Remove default_view="tree" arg. Figure out why is default_view being picked as "grid" and how to fix it ?
# Approach suggested by https://stackoverflow.com/a/11887885/5004662.
AIRFLOW_VERSION = packaging.version.parse(airflow.version.version)
@@ -75,7 +74,7 @@ def test_airflow_provider_info():
@pytest.mark.filterwarnings("ignore:.*is deprecated.*")
def test_dags_load_with_no_errors(pytestconfig: pytest.Config) -> None:
airflow_examples_folder = (
- pytestconfig.rootpath / "src/datahub_provider/example_dags"
+ pytestconfig.rootpath / "src/datahub_airflow_plugin/example_dags"
)
# Note: the .airflowignore file skips the snowflake DAG.
@@ -233,7 +232,11 @@ def test_lineage_backend(mock_emit, inlets, outlets, capture_executions):
func = mock.Mock()
func.__name__ = "foo"
- dag = DAG(dag_id="test_lineage_is_sent_to_backend", start_date=DEFAULT_DATE)
+ dag = DAG(
+ dag_id="test_lineage_is_sent_to_backend",
+ start_date=DEFAULT_DATE,
+ default_view="tree",
+ )
with dag:
op1 = EmptyOperator(
@@ -252,6 +255,7 @@ def test_lineage_backend(mock_emit, inlets, outlets, capture_executions):
# versions do not require it, but will attempt to find the associated
# run_id in the database if execution_date is provided. As such, we
# must fake the run_id parameter for newer Airflow versions.
+ # We need to add type:ignore in else to suppress mypy error in Airflow < 2.2
if AIRFLOW_VERSION < packaging.version.parse("2.2.0"):
ti = TaskInstance(task=op2, execution_date=DEFAULT_DATE)
# Ignoring type here because DagRun state is just a sring at Airflow 1
@@ -259,7 +263,7 @@ def test_lineage_backend(mock_emit, inlets, outlets, capture_executions):
else:
from airflow.utils.state import DagRunState
- ti = TaskInstance(task=op2, run_id=f"test_airflow-{DEFAULT_DATE}")
+ ti = TaskInstance(task=op2, run_id=f"test_airflow-{DEFAULT_DATE}") # type: ignore[call-arg]
dag_run = DagRun(
state=DagRunState.SUCCESS,
run_id=f"scheduled_{DEFAULT_DATE.isoformat()}",
diff --git a/metadata-ingestion/adding-source.md b/metadata-ingestion/adding-source.md
index 50e6a1cd5fcc6a..e4fc950a7cdbd0 100644
--- a/metadata-ingestion/adding-source.md
+++ b/metadata-ingestion/adding-source.md
@@ -44,7 +44,11 @@ class LookerAPIConfig(ConfigModel):
```
generates the following documentation:
-![Generated Config Documentation](./docs/images/generated_config_docs.png)
+
+
+
+
+
:::note
Inline markdown or code snippets are not yet supported for field level documentation.
diff --git a/metadata-ingestion/build.gradle b/metadata-ingestion/build.gradle
index f636cf25c67f72..c20d98cbcbb58b 100644
--- a/metadata-ingestion/build.gradle
+++ b/metadata-ingestion/build.gradle
@@ -21,11 +21,13 @@ task checkPythonVersion(type: Exec) {
}
task environmentSetup(type: Exec, dependsOn: checkPythonVersion) {
+ def sentinel_file = "${venv_name}/.venv_environment_sentinel"
inputs.file file('setup.py')
- outputs.dir("${venv_name}")
+ outputs.file(sentinel_file)
commandLine 'bash', '-c',
"${python_executable} -m venv ${venv_name} && " +
- "${venv_name}/bin/python -m pip install --upgrade pip wheel 'setuptools>=63.0.0'"
+ "${venv_name}/bin/python -m pip install --upgrade pip wheel 'setuptools>=63.0.0' && " +
+ "touch ${sentinel_file}"
}
task runPreFlightScript(type: Exec, dependsOn: environmentSetup) {
@@ -39,7 +41,6 @@ task runPreFlightScript(type: Exec, dependsOn: environmentSetup) {
task installPackageOnly(type: Exec, dependsOn: runPreFlightScript) {
def sentinel_file = "${venv_name}/.build_install_package_only_sentinel"
inputs.file file('setup.py')
- outputs.dir("${venv_name}")
outputs.file(sentinel_file)
commandLine 'bash', '-x', '-c',
"${venv_name}/bin/pip install -e . &&" +
@@ -47,9 +48,12 @@ task installPackageOnly(type: Exec, dependsOn: runPreFlightScript) {
}
task installPackage(type: Exec, dependsOn: installPackageOnly) {
+ def sentinel_file = "${venv_name}/.build_install_package_sentinel"
inputs.file file('setup.py')
- outputs.dir("${venv_name}")
- commandLine 'bash', '-x', '-c', "${venv_name}/bin/pip install -e . ${extra_pip_requirements}"
+ outputs.file(sentinel_file)
+ commandLine 'bash', '-x', '-c',
+ "${venv_name}/bin/pip install -e . ${extra_pip_requirements} && " +
+ "touch ${sentinel_file}"
}
task codegen(type: Exec, dependsOn: [environmentSetup, installPackage, ':metadata-events:mxe-schemas:build']) {
@@ -63,24 +67,20 @@ task install(dependsOn: [installPackage, codegen])
task installDev(type: Exec, dependsOn: [install]) {
def sentinel_file = "${venv_name}/.build_install_dev_sentinel"
inputs.file file('setup.py')
- outputs.dir("${venv_name}")
outputs.file(sentinel_file)
commandLine 'bash', '-c',
"source ${venv_name}/bin/activate && set -x && " +
"${venv_name}/bin/pip install -e .[dev] ${extra_pip_requirements} && " +
- "./scripts/install-sqlalchemy-stubs.sh && " +
"touch ${sentinel_file}"
}
task installAll(type: Exec, dependsOn: [install]) {
def sentinel_file = "${venv_name}/.build_install_all_sentinel"
inputs.file file('setup.py')
- outputs.dir("${venv_name}")
outputs.file(sentinel_file)
commandLine 'bash', '-c',
"source ${venv_name}/bin/activate && set -x && " +
"${venv_name}/bin/pip install -e .[all] ${extra_pip_requirements} && " +
- "./scripts/install-sqlalchemy-stubs.sh && " +
"touch ${sentinel_file}"
}
@@ -117,7 +117,6 @@ task lint(type: Exec, dependsOn: installDev) {
task lintFix(type: Exec, dependsOn: installDev) {
commandLine 'bash', '-c',
"source ${venv_name}/bin/activate && set -x && " +
- "./scripts/install-sqlalchemy-stubs.sh && " +
"black src/ tests/ examples/ && " +
"isort src/ tests/ examples/ && " +
"flake8 src/ tests/ examples/ && " +
@@ -186,9 +185,6 @@ task specGen(type: Exec, dependsOn: [codegen, installDevTest]) {
task docGen(type: Exec, dependsOn: [codegen, installDevTest, specGen]) {
commandLine 'bash', '-c', "source ${venv_name}/bin/activate && ./scripts/docgen.sh"
}
-task buildWheel(type: Exec, dependsOn: [install, codegen]) {
- commandLine 'bash', '-c', "source ${venv_name}/bin/activate && " + 'pip install build && RELEASE_VERSION="\${RELEASE_VERSION:-0.0.0.dev1}" RELEASE_SKIP_TEST=1 RELEASE_SKIP_UPLOAD=1 ./scripts/release.sh'
-}
@@ -196,6 +192,9 @@ task cleanPythonCache(type: Exec) {
commandLine 'bash', '-c',
"find src tests -type f -name '*.py[co]' -delete -o -type d -name __pycache__ -delete -o -type d -empty -delete"
}
+task buildWheel(type: Exec, dependsOn: [install, codegen, cleanPythonCache]) {
+ commandLine 'bash', '-c', "source ${venv_name}/bin/activate && " + 'pip install build && RELEASE_VERSION="\${RELEASE_VERSION:-0.0.0.dev1}" RELEASE_SKIP_TEST=1 RELEASE_SKIP_UPLOAD=1 ./scripts/release.sh'
+}
build.dependsOn install
check.dependsOn lint
diff --git a/metadata-ingestion/developing.md b/metadata-ingestion/developing.md
index 67041d23a21b13..f529590e2ab393 100644
--- a/metadata-ingestion/developing.md
+++ b/metadata-ingestion/developing.md
@@ -26,6 +26,16 @@ source venv/bin/activate
datahub version # should print "DataHub CLI version: unavailable (installed in develop mode)"
```
+### (Optional) Set up your Python environment for developing on Airflow Plugin
+
+From the repository root:
+
+```shell
+cd metadata-ingestion-modules/airflow-plugin
+../../gradlew :metadata-ingestion-modules:airflow-plugin:installDev
+source venv/bin/activate
+datahub version # should print "DataHub CLI version: unavailable (installed in develop mode)"
+```
### Common setup issues
Common issues (click to expand):
@@ -74,7 +84,9 @@ The syntax for installing plugins is slightly different in development. For exam
## Architecture
-![metadata ingestion framework layout](../docs/imgs/datahub-metadata-ingestion-framework.png)
+
+
+
The architecture of this metadata ingestion framework is heavily inspired by [Apache Gobblin](https://gobblin.apache.org/) (also originally a LinkedIn project!). We have a standardized format - the MetadataChangeEvent - and sources and sinks which respectively produce and consume these objects. The sources pull metadata from a variety of data systems, while the sinks are primarily for moving this metadata into DataHub.
@@ -181,7 +193,7 @@ pytest -m 'slow_integration'
../gradlew :metadata-ingestion:testFull
../gradlew :metadata-ingestion:check
# Run all tests in a single file
-../gradlew :metadata-ingestion:testSingle -PtestFile=tests/unit/test_airflow.py
+../gradlew :metadata-ingestion:testSingle -PtestFile=tests/unit/test_bigquery_source.py
# Run all tests under tests/unit
../gradlew :metadata-ingestion:testSingle -PtestFile=tests/unit
```
diff --git a/metadata-ingestion/docs/dev_guides/add_stateful_ingestion_to_source.md b/metadata-ingestion/docs/dev_guides/add_stateful_ingestion_to_source.md
index 0e3ead9a7adf81..9e39d24fb85782 100644
--- a/metadata-ingestion/docs/dev_guides/add_stateful_ingestion_to_source.md
+++ b/metadata-ingestion/docs/dev_guides/add_stateful_ingestion_to_source.md
@@ -60,16 +60,14 @@ class StaleEntityCheckpointStateBase(CheckpointStateBase, ABC, Generic[Derived])
```
Examples:
-1. [KafkaCheckpointState](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state/kafka_state.py#L11).
-2. [DbtCheckpointState](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state/dbt_state.py#L16)
-3. [BaseSQLAlchemyCheckpointState](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state/sql_common_state.py#L17)
+* [BaseSQLAlchemyCheckpointState](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state/sql_common_state.py#L17)
### 2. Modifying the SourceConfig
The source's config must inherit from `StatefulIngestionConfigBase`, and should declare a field named `stateful_ingestion` of type `Optional[StatefulStaleMetadataRemovalConfig]`.
Examples:
-1. The `KafkaSourceConfig`
+- The `KafkaSourceConfig`
```python
from typing import List, Optional
import pydantic
@@ -84,9 +82,6 @@ class KafkaSourceConfig(StatefulIngestionConfigBase):
stateful_ingestion: Optional[StatefulStaleMetadataRemovalConfig] = None
```
-2. The [DBTStatefulIngestionConfig](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dbt.py#L131)
- and the [DBTConfig](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dbt.py#L317).
-
### 3. Modifying the SourceReport
The report class of the source should inherit from `StaleEntityRemovalSourceReport` whose definition is shown below.
```python
@@ -102,7 +97,7 @@ class StaleEntityRemovalSourceReport(StatefulIngestionReport):
```
Examples:
-1. The `KafkaSourceReport`
+* The `KafkaSourceReport`
```python
from dataclasses import dataclass
from datahub.ingestion.source.state.stale_entity_removal_handler import StaleEntityRemovalSourceReport
@@ -110,7 +105,7 @@ from datahub.ingestion.source.state.stale_entity_removal_handler import StaleEnt
class KafkaSourceReport(StaleEntityRemovalSourceReport):
# Iterable[MetadataWorkUnit]:
# Skip a redundant run
if self.redundant_run_skip_handler.should_skip_this_run(
- cur_start_time_millis=datetime_to_ts_millis(self.config.start_time)
+ cur_start_time_millis=self.config.start_time
):
return
@@ -260,7 +255,7 @@ Example code:
#
# Update checkpoint state for this run.
self.redundant_run_skip_handler.update_state(
- start_time_millis=datetime_to_ts_millis(self.config.start_time),
- end_time_millis=datetime_to_ts_millis(self.config.end_time),
+ start_time_millis=self.config.start_time,
+ end_time_millis=self.config.end_time,
)
```
\ No newline at end of file
diff --git a/metadata-ingestion/docs/dev_guides/reporting_telemetry.md b/metadata-ingestion/docs/dev_guides/reporting_telemetry.md
index 1e770ab0257111..11aec2efe9714d 100644
--- a/metadata-ingestion/docs/dev_guides/reporting_telemetry.md
+++ b/metadata-ingestion/docs/dev_guides/reporting_telemetry.md
@@ -69,7 +69,7 @@ reporting:
An ingestion reporting state provider is responsible for saving and retrieving the ingestion telemetry
associated with the ingestion runs of various jobs inside the source connector of the ingestion pipeline.
The data model used for capturing the telemetry is [DatahubIngestionRunSummary](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/datajob/datahub/DatahubIngestionRunSummary.pdl).
-A reporting ingestion state provider needs to implement the [IngestionReportingProviderBase](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/api/ingestion_job_reporting_provider_base.py)
+A reporting ingestion state provider needs to implement the IngestionReportingProviderBase.
interface and register itself with datahub by adding an entry under `datahub.ingestion.reporting_provider.plugins`
key of the entry_points section in [setup.py](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/setup.py)
with its type and implementation class as shown below.
diff --git a/metadata-ingestion/docs/dev_guides/stateful.md b/metadata-ingestion/docs/dev_guides/stateful.md
index eccacbb416714b..08ccf015c994c9 100644
--- a/metadata-ingestion/docs/dev_guides/stateful.md
+++ b/metadata-ingestion/docs/dev_guides/stateful.md
@@ -22,14 +22,14 @@ noCode: "true"
Note that a `.` is used to denote nested fields in the YAML recipe.
-| Field | Required | Default | Description |
-|--------------------------------------------------------------| -------- |------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `source.config.stateful_ingestion.enabled` | | False | The type of the ingestion state provider registered with datahub. |
-| `source.config.stateful_ingestion.ignore_old_state` | | False | If set to True, ignores the previous checkpoint state. |
-| `source.config.stateful_ingestion.ignore_new_state` | | False | If set to True, ignores the current checkpoint state. |
-| `source.config.stateful_ingestion.max_checkpoint_state_size` | | 2^24 (16MB) | The maximum size of the checkpoint state in bytes. |
-| `source.config.stateful_ingestion.state_provider` | | The default [datahub ingestion state provider](#datahub-ingestion-state-provider) configuration. | The ingestion state provider configuration. |
-| `pipeline_name` | ✅ | | The name of the ingestion pipeline the checkpoint states of various source connector job runs are saved/retrieved against via the ingestion state provider. |
+| Field | Required | Default | Description |
+|--------------------------------------------------------------| -------- |-----------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `source.config.stateful_ingestion.enabled` | | False | The type of the ingestion state provider registered with datahub. |
+| `source.config.stateful_ingestion.ignore_old_state` | | False | If set to True, ignores the previous checkpoint state. |
+| `source.config.stateful_ingestion.ignore_new_state` | | False | If set to True, ignores the current checkpoint state. |
+| `source.config.stateful_ingestion.max_checkpoint_state_size` | | 2^24 (16MB) | The maximum size of the checkpoint state in bytes. |
+| `source.config.stateful_ingestion.state_provider` | | The default datahub ingestion state provider configuration. | The ingestion state provider configuration. |
+| `pipeline_name` | ✅ | | The name of the ingestion pipeline the checkpoint states of various source connector job runs are saved/retrieved against via the ingestion state provider. |
NOTE: If either `dry-run` or `preview` mode are set, stateful ingestion will be turned off regardless of the rest of the configuration.
## Use-cases powered by stateful ingestion.
@@ -38,7 +38,9 @@ Following is the list of current use-cases powered by stateful ingestion in data
Stateful ingestion can be used to automatically soft-delete the tables and views that are seen in a previous run
but absent in the current run (they are either deleted or no longer desired).
-![Stale Metadata Deletion](./stale_metadata_deletion.png)
+
+
+
#### Supported sources
* All sql based sources.
diff --git a/metadata-ingestion/docs/sources/azure-ad/azure-ad.md b/metadata-ingestion/docs/sources/azure-ad/azure-ad.md
index 8b375fbee4f33c..d2677d7e4fc7a3 100644
--- a/metadata-ingestion/docs/sources/azure-ad/azure-ad.md
+++ b/metadata-ingestion/docs/sources/azure-ad/azure-ad.md
@@ -5,6 +5,15 @@ to read your organization's Users and Groups. The following permissions are requ
- `GroupMember.Read.All`
- `User.Read.All`
-You can add a permission by navigating to the permissions tab in your DataHub application on the Azure AD portal. ![Azure AD API Permissions](./azure_ad_api_permissions.png)
+You can add a permission by navigating to the permissions tab in your DataHub application on the Azure AD portal.
+
+
+
-You can view the necessary endpoints to configure by clicking on the Endpoints button in the Overview tab. ![Azure AD Endpoints](./azure_ad_endpoints.png)
+
+You can view the necessary endpoints to configure by clicking on the Endpoints button in the Overview tab.
+
+
+
+
+
diff --git a/metadata-ingestion/docs/sources/databricks/README.md b/metadata-ingestion/docs/sources/databricks/README.md
index 01aee3236e01c2..b380a892c22b9d 100644
--- a/metadata-ingestion/docs/sources/databricks/README.md
+++ b/metadata-ingestion/docs/sources/databricks/README.md
@@ -15,8 +15,11 @@ To complete the picture, we recommend adding push-based ingestion from your Spar
## Watch the DataHub Talk at the Data and AI Summit 2022
For a deeper look at how to think about DataHub within and across your Databricks ecosystem, watch the recording of our talk at the Data and AI Summit 2022.
-
-[![IMAGE_ALT](../../images/databricks/data_and_ai_summit_2022.png)](https://www.youtube.com/watch?v=SCP0PR3t7dc)
+
+
+
+
+
diff --git a/metadata-ingestion/docs/sources/datahub/README.md b/metadata-ingestion/docs/sources/datahub/README.md
new file mode 100644
index 00000000000000..45afc6e1668897
--- /dev/null
+++ b/metadata-ingestion/docs/sources/datahub/README.md
@@ -0,0 +1,4 @@
+Migrate data from one DataHub instance to another.
+
+Requires direct access to the database, kafka broker, and kafka schema registry
+of the source DataHub instance.
diff --git a/metadata-ingestion/docs/sources/datahub/datahub_pre.md b/metadata-ingestion/docs/sources/datahub/datahub_pre.md
new file mode 100644
index 00000000000000..c98cce70478360
--- /dev/null
+++ b/metadata-ingestion/docs/sources/datahub/datahub_pre.md
@@ -0,0 +1,66 @@
+### Overview
+
+This source pulls data from two locations:
+- The DataHub database, containing a single table holding all versioned aspects
+- The DataHub Kafka cluster, reading from the [MCL Log](../../../../docs/what/mxe.md#metadata-change-log-mcl)
+topic for timeseries aspects.
+
+All data is first read from the database, before timeseries data is ingested from kafka.
+To prevent this source from potentially running forever, it will not ingest data produced after the
+datahub_source ingestion job is started. This `stop_time` is reflected in the report.
+
+Data from the database and kafka are read in chronological order, specifically by the
+createdon timestamp in the database and by kafka offset per partition. In order to
+properly read from the database, please ensure that the `createdon` column is indexed.
+Newly created databases should have this index, named `timeIndex`, by default, but older
+ones you may have to create yourself, with the statement:
+
+```
+CREATE INDEX timeIndex ON metadata_aspect_v2 (createdon);
+```
+
+*If you do not have this index, the source may run incredibly slowly and produce
+significant database load.*
+
+#### Stateful Ingestion
+On first run, the source will read from the earliest data in the database and the earliest
+kafka offsets. Every `commit_state_interval` (default 1000) records, the source will store
+a checkpoint to remember its place, i.e. the last createdon timestamp and kafka offsets.
+This allows you to stop and restart the source without losing much progress, but note that
+you will re-ingest some data at the start of the new run.
+
+If any errors are encountered in the ingestion process, e.g. we are unable to emit an aspect
+due to network errors, the source will keep running, but will stop committing checkpoints,
+unless `commit_with_parse_errors` (default `false`) is set. Thus, if you re-run the ingestion,
+you can re-ingest the data that was missed, but note it will all re-ingest all subsequent data.
+
+If you want to re-ingest all data, you can set a different `pipeline_name` in your recipe,
+or set `stateful_ingestion.ignore_old_state`:
+
+```yaml
+source:
+ config:
+ # ... connection config, etc.
+ stateful_ingestion:
+ enabled: true
+ ignore_old_state: true
+```
+
+#### Limitations
+- Can only pull timeseries aspects retained by Kafka, which by default lasts 90 days.
+- Does not detect hard timeseries deletions, e.g. if via a `datahub delete` command using the CLI.
+Therefore, if you deleted data in this way, it will still exist in the destination instance.
+- If you have a significant amount of aspects with the exact same `createdon` timestamp,
+stateful ingestion will not be able to save checkpoints partially through that timestamp.
+On a subsequent run, all aspects for that timestamp will be ingested.
+
+#### Performance
+On your destination DataHub instance, we suggest the following settings:
+- Enable [async ingestion](../../../../docs/deploy/environment-vars.md#ingestion)
+- Use standalone consumers
+([mae-consumer](../../../../metadata-jobs/mae-consumer-job/README.md)
+and [mce-consumer](../../../../metadata-jobs/mce-consumer-job/README.md))
+ * If you are migrating large amounts of data, consider scaling consumer replicas.
+- Increase the number of gms pods to add redundancy and increase resilience to node evictions
+ * If you are migrating large amounts of data, consider increasing elasticsearch's
+ thread count via the `ELASTICSEARCH_THREAD_COUNT` environment variable.
diff --git a/metadata-ingestion/docs/sources/datahub/datahub_recipe.yml b/metadata-ingestion/docs/sources/datahub/datahub_recipe.yml
new file mode 100644
index 00000000000000..cb7fc97a39b9fb
--- /dev/null
+++ b/metadata-ingestion/docs/sources/datahub/datahub_recipe.yml
@@ -0,0 +1,30 @@
+pipeline_name: datahub_source_1
+datahub_api:
+ server: "http://localhost:8080" # Migrate data from DataHub instance on localhost:8080
+ token: ""
+source:
+ type: datahub
+ config:
+ include_all_versions: false
+ database_connection:
+ scheme: "mysql+pymysql" # or "postgresql+psycopg2" for Postgres
+ host_port: ":"
+ username: ""
+ password: "