Skip to content

Commit

Permalink
Updated
Browse files Browse the repository at this point in the history
  • Loading branch information
dat-a-man committed Mar 13, 2024
1 parent 40e199d commit 42d69e0
Showing 1 changed file with 25 additions and 38 deletions.
63 changes: 25 additions & 38 deletions docs/website/docs/general-usage/schema-evolution.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,10 @@ As the structure of data changes, such as the addition of new columns, changing
## Inferring a schema from nested data
The first run of a pipeline will scan the data that goes through it and generate a schema. To convert nested data into relational format, dlt flattens dictionaries and unpacks nested lists into sub-tables.

We’ll review some examples here and figure out how `dlt` creates initial schema and how normalisation works. Let’s start by running a simple pipeline with organizations and department details in the data resource. Here’s the resource:
We’ll review some examples here and figure out how `dlt` creates initial schema and how normalisation works. Let's begin by creating a pipeline that loads the following data:

```python
# Define a data resource using 'dlt.resource' with a schema contract
yield {
{
"organization": "Tech Innovations Inc.",
"address": {
'building': 'r&d',
Expand All @@ -44,7 +43,7 @@ The schema of data above is loaded to the destination as follows:

### What did the schema inference engine do?

As you can above the `dlt's` inference engine generates the structure of the data based on the source and provided hints . It normalizes the data, creates tables and columns and infers data types.
As you can see above the `dlt's` inference engine generates the structure of the data based on the source and provided hints. It normalizes the data, creates tables and columns, and infers data types.

For more information, you can refer to the **[Schema](https://dlthub.com/docs/general-usage/schema)** and **[Adjust a Schema](https://dlthub.com/docs/walkthroughs/adjust-a-schema)** sections in the documentation.

Expand All @@ -59,30 +58,25 @@ Let’s add the following 4 cases:
- A column is removed: a field named “room” was commented out/removed.
- A column is renamed: a field “building” was renamed to “main_block”.

Here’s the resource:
Please update the pipeline for the cases discussed above.
```python
# Define a data resource using 'dlt.resource' with a schema contract set to 'evolve'

yield {
"organization": "Tech Innovations Inc.",
# Column added:
"CEO": "Alice Smith",
"address": {
# 'building' renamed to 'main_block'
'main_block': 'r&d',
# Removed room column
# "room": 7890,
},
"Inventory": [
# Type change: 'inventory_nr' changed to string from int
{"name": "Plasma ray", "inventory nr": "AR2411"},
{"name": "Self-aware Roomba", "inventory nr": "AR268"},
{"name": "Type-inferrer", "inventory nr": "AR3621"}
]
}


# Create and run the `dlt` pipeline
{
"organization": "Tech Innovations Inc.",
# Column added:
"CEO": "Alice Smith",
"address": {
# 'building' renamed to 'main_block'
'main_block': 'r&d',
# Removed room column
# "room": 7890,
},
"Inventory": [
# Type change: 'inventory_nr' changed to string from int
{"name": "Plasma ray", "inventory nr": "AR2411"},
{"name": "Self-aware Roomba", "inventory nr": "AR268"},
{"name": "Type-inferrer", "inventory nr": "AR3621"}
]
}
```

Let’s load the data and look at the tables:
Expand Down Expand Up @@ -137,7 +131,7 @@ This script sends Slack notifications for data schema updates using the 'send_sl

## How to control evolution

DLT allows schema evolution control via its schema and data contracts. Refer to our **[documentation](https://dlthub.com/docs/general-usage/schema-contracts)** for details.
`dlt` allows schema evolution control via its schema and data contracts. Refer to our **[documentation](https://dlthub.com/docs/general-usage/schema-contracts)** for details.

### How to test for removed columns - applying “not null” constraint

Expand Down Expand Up @@ -168,12 +162,11 @@ During pipeline execution a data validation error indicates that a removed colum

## Some schema changes in the data

The data schema mentioned above is modified. The changes include:
The data in the pipeline mentioned above is modified.

- Updated data resource now includes key 'specifications' within 'details', nested in 'Inventory'.
- Updated data pipeine now includes key 'specifications' within 'details', nested in 'Inventory'.
```python
## Define a data resource using dlt.resource with a schema contract set to evolve
yield {
{
"organization": "Tech Innovations Inc.",
"CEO": "Alice Smith",
"address": {'main_block': 'r&d'},
Expand All @@ -193,16 +186,10 @@ The data schema mentioned above is modified. The changes include:
}
]
}

# Create and run the `dlt` pipeline
```
The schema of the data above is loaded to the destination as follows:
<iframe width="560" height="315" src='https://dbdiagram.io/e/65e80b31cd45b569fba33169/65e81055cd45b569fba3aa20'> </iframe>

:::note
Please note how `dlt` infers deeply nested schema.
:::

## What did the schema evolution engine do?

The schema evolution engine in the `dlt` library is designed to handle changes in the structure of your data over time. For example:
Expand Down

0 comments on commit 42d69e0

Please sign in to comment.