Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/devel' into docs/update_filesyst…
Browse files Browse the repository at this point in the history
…em_docs
  • Loading branch information
dat-a-man committed May 24, 2024
2 parents afe18bb + 7c07c67 commit 1762f09
Show file tree
Hide file tree
Showing 15 changed files with 251 additions and 182 deletions.
24 changes: 24 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,29 @@ We use **master** branch for hot fixes (including documentation) that needs to b

On the release day, **devel** branch is merged into **master**. All releases of `dlt` happen only from the **master**.

### Branch naming rules

We want to make sure that our git history explains in a human readable way what has been changed with which Branch or PR. To this end, we are using the following branch naming pattern (all lowercase and dashes, no underscores):

```sh
{category}/{ticket-id}-description-of-the-branch
# example:
feat/4922-add-avro-support
```

#### Branch categories

* **feat** - a new feature that is being implemented (ticket required)
* **fix** - a change that fixes a bug (ticket required)
* **exp** - an experiment where we are testing a new idea or want to demonstrate something to the team, might turn into a `feat` later (ticket encouraged)
* **test** - anything related to the tests (ticket encouraged)
* **blogs** - a new entry to our blog (ticket optional)
* **docs** - a change to our docs (ticket optional)

#### Ticket Numbers

We encourage you to attach your branches to a ticket, if none exists, create one and explain what you are doing. For `feat` and `fix` branches, tickets are mandatory, for `exp` and `test` branches encouraged and for `blogs` and `docs` branches optional.

### Submitting a hotfix
We'll fix critical bugs and release `dlt` out of the schedule. Follow the regular procedure, but make your PR against **master** branch. Please ping us on Slack if you do it.

Expand Down Expand Up @@ -166,3 +189,4 @@ Once the version has been bumped, follow these steps to publish the new release
- [Poetry Documentation](https://python-poetry.org/docs/)

If you have any questions or need help, don't hesitate to reach out to us. We're here to help you succeed in contributing to `dlt`. Happy coding!
****
9 changes: 7 additions & 2 deletions dlt/sources/helpers/rest_client/detector.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import re
from typing import List, Dict, Any, Tuple, Union, Optional, Callable, Iterable
from pathlib import PurePosixPath
from typing import List, Dict, Any, Tuple, Union, Callable, Iterable
from urllib.parse import urlparse

from requests import Response
Expand All @@ -25,6 +26,7 @@
"payload",
"content",
"objects",
"values",
]
)

Expand All @@ -46,7 +48,10 @@

def single_entity_path(path: str) -> bool:
"""Checks if path ends with path param indicating that single object is returned"""
return re.search(r"\{([a-zA-Z_][a-zA-Z0-9_]*)\}/?$", path) is not None
# get last path segment
name = PurePosixPath(path).name
# alphabet for a name taken from https://github.com/OAI/OpenAPI-Specification/blob/main/versions/3.0.3.md#fixed-fields-6
return re.search(r"\{([a-zA-Z0-9\.\-_]+)\}", name) is not None


def matches_any_pattern(key: str, patterns: Iterable[str]) -> bool:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@
In this example, you'll find a Python script that demonstrates how to load to BigQuery with the custom destination.
We'll learn how to:
- Use [built-in credentials](../general-usage/credentials/config_specs#gcp-credentials)
- Use the [custom destination](../dlt-ecosystem/destinations/destination.md)
- Use pyarrow tables to create complex column types on BigQuery
- Use BigQuery `autodetect=True` for schema inference from parquet files
- Use [built-in credentials.](../general-usage/credentials/config_specs#gcp-credentials)
- Use the [custom destination.](../dlt-ecosystem/destinations/destination.md)
- Use pyarrow tables to create complex column types on BigQuery.
- Use BigQuery `autodetect=True` for schema inference from parquet files.
"""

Expand Down
2 changes: 1 addition & 1 deletion docs/website/docs/dlt-ecosystem/destinations/snowflake.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ keywords: [Snowflake, destination, data warehouse]
## Install `dlt` with Snowflake
**To install the `dlt` library with Snowflake dependencies, run:**
```sh
pip install dlt[snowflake]
pip install "dlt[snowflake]"
```

## Setup Guide
Expand Down
36 changes: 33 additions & 3 deletions docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -282,7 +282,7 @@ The fields in the endpoint configuration are:
- `json`: The JSON payload to be sent with the request (for POST and PUT requests).
- `paginator`: Pagination configuration for the endpoint. See the [pagination](#pagination) section for more details.
- `data_selector`: A JSONPath to select the data from the response. See the [data selection](#data-selection) section for more details.
- `response_actions`: A list of actions that define how to process the response data.
- `response_actions`: A list of actions that define how to process the response data. See the [response actions](#response-actions) section for more details.
- `incremental`: Configuration for [incremental loading](#incremental-loading).

### Pagination
Expand Down Expand Up @@ -414,8 +414,8 @@ Available authentication types:
| Authentication class | String Alias (`type`) | Description |
| ------------------- | ----------- | ----------- |
| [BearTokenAuth](../../general-usage/http/rest-client.md#bearer-token-authentication) | `bearer` | Bearer token authentication. |
| [HTTPBasicAuth](../../general-usage/http/rest-client.md#http-basic-authentication) | `api_key` | Basic HTTP authentication. |
| [APIKeyAuth](../../general-usage/http/rest-client.md#api-key-authentication) | `http_basic` | API key authentication with key defined in the query parameters or in the headers. |
| [HTTPBasicAuth](../../general-usage/http/rest-client.md#http-basic-authentication) | `http_basic` | Basic HTTP authentication. |
| [APIKeyAuth](../../general-usage/http/rest-client.md#api-key-authentication) | `api_key` | API key authentication with key defined in the query parameters or in the headers. |

To specify the authentication configuration, use the `auth` field in the [client](#client) configuration:

Expand Down Expand Up @@ -586,3 +586,33 @@ See the [incremental loading](../../general-usage/incremental-loading.md#increme
- `root_key` (bool): Enables merging on all resources by propagating root foreign key to child tables. This option is most useful if you plan to change write disposition of a resource to disable/enable merge. Defaults to False.
- `schema_contract`: Schema contract settings that will be applied to this resource.
- `spec`: A specification of configuration and secret values required by the source.

### Response actions

The `response_actions` field in the endpoint configuration allows you to specify how to handle specific responses from the API based on status codes or content substrings. This is useful for handling edge cases like ignoring responses on specific conditions.

:::caution Experimental Feature
This is an experimental feature and may change in future releases.
:::

#### Example

```py
{
"path": "issues",
"response_actions": [
{"status_code": 404, "action": "ignore"},
{"content": "Not found", "action": "ignore"},
{"status_code": 200, "content": "some text", "action": "ignore"},
],
}
```

In this example, the source will ignore responses with a status code of 404, responses with the content "Not found", and responses with a status code of 200 _and_ content "some text".

**Fields:**

- `status_code` (int, optional): The HTTP status code to match.
- `content` (str, optional): A substring to search for in the response content.
- `action` (str): The action to take when the condition is met. Currently supported actions:
- `ignore`: Ignore the response.
55 changes: 18 additions & 37 deletions docs/website/docs/general-usage/destination-tables.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ pipeline = dlt.pipeline(
load_info = pipeline.run(users)
```

The result will be the same, but the table is implicitly named `users` based on the resource name.
The result will be the same; note that we do not explicitly pass `table_name="users"` to `pipeline.run`, and the table is implicitly named `users` based on the resource name (e.g., `users()` decorated with `@dlt.resource`).

:::note

Expand Down Expand Up @@ -117,9 +117,7 @@ pipeline = dlt.pipeline(
load_info = pipeline.run(data, table_name="users")
```

Running this pipeline will create two tables in the destination, `users` and `users__pets`. The
`users` table will contain the top level data, and the `users__pets` table will contain the child
data. Here is what the tables may look like:
Running this pipeline will create two tables in the destination, `users` and `users__pets`. The `users` table will contain the top-level data, and the `users__pets` table will contain the child data. Here is what the tables may look like:

**mydata.users**

Expand All @@ -141,21 +139,14 @@ creating and linking children and parent tables.

This is how it works:

1. Each row in all (top level and child) data tables created by `dlt` contains UNIQUE column named
`_dlt_id`.
1. Each child table contains FOREIGN KEY column `_dlt_parent_id` linking to a particular row
(`_dlt_id`) of a parent table.
1. Rows in child tables come from the lists: `dlt` stores the position of each item in the list in
`_dlt_list_idx`.
1. For tables that are loaded with the `merge` write disposition, we add a ROOT KEY column
`_dlt_root_id`, which links child table to a row in top level table.

1. Each row in all (top level and child) data tables created by `dlt` contains a `UNIQUE` column named `_dlt_id`.
1. Each child table contains a `FOREIGN KEY` column `_dlt_parent_id` linking to a particular row (`_dlt_id`) of a parent table.
1. Rows in child tables come from the lists: `dlt` stores the position of each item in the list in `_dlt_list_idx`.
1. For tables that are loaded with the `merge` write disposition, we add a root key column `_dlt_root_id`, which links the child table to a row in the top-level table.

:::note

If you define your own primary key in a child table, it will be used to link to parent table
and the `_dlt_parent_id` and `_dlt_list_idx` will not be added. `_dlt_id` is always added even in
case the primary key or other unique columns are defined.
If you define your own primary key in a child table, it will be used to link to the parent table, and the `_dlt_parent_id` and `_dlt_list_idx` will not be added. `_dlt_id` is always added even if the primary key or other unique columns are defined.

:::

Expand All @@ -164,17 +155,15 @@ case the primary key or other unique columns are defined.
During a pipeline run, dlt [normalizes both table and column names](schema.md#naming-convention) to ensure compatibility with the destination database's accepted format. All names from your source data will be transformed into snake_case and will only include alphanumeric characters. Please be aware that the names in the destination database may differ somewhat from those in your original input.

### Variant columns
If your data has inconsistent types, `dlt` will dispatch the data to several **variant columns**. For example, if you have a resource (ie json file) with a filed with name **answer** and your data contains boolean values, you will get get a column with name **answer** of type **BOOLEAN** in your destination. If for some reason, on next load you get integer value and string value in **answer**, the inconsistent data will go to **answer__v_bigint** and **answer__v_text** columns respectively.
The general naming rule for variant columns is `<original name>__v_<type>` where `original_name` is the existing column name (with data type clash) and `type` is the name of data type stored in the variant.

If your data has inconsistent types, `dlt` will dispatch the data to several **variant columns**. For example, if you have a resource (i.e., JSON file) with a field with name `answer` and your data contains boolean values, you will get a column with name `answer` of type `BOOLEAN` in your destination. If for some reason, on the next load, you get integer and string values in `answer`, the inconsistent data will go to `answer__v_bigint` and `answer__v_text` columns respectively.
The general naming rule for variant columns is `<original name>__v_<type>` where `original_name` is the existing column name (with data type clash) and `type` is the name of the data type stored in the variant.

## Load Packages and Load IDs

Each execution of the pipeline generates one or more load packages. A load package typically contains data retrieved from
all the [resources](glossary.md#resource) of a particular [source](glossary.md#source).
These packages are uniquely identified by a `load_id`. The `load_id` of a particular package is added to the top data tables
(referenced as `_dlt_load_id` column in the example above) and to the special `_dlt_loads` table with a status 0
(when the load process is fully completed).
(referenced as `_dlt_load_id` column in the example above) and to the special `_dlt_loads` table with a status of 0 (when the load process is fully completed).

To illustrate this, let's load more data into the same destination:

Expand All @@ -189,8 +178,7 @@ data = [
```

The rest of the pipeline definition remains the same. Running this pipeline will create a new load
package with a new `load_id` and add the data to the existing tables. The `users` table will now
look like this:
package with a new `load_id` and add the data to the existing tables. The `users` table will now look like this:

**mydata.users**

Expand All @@ -210,12 +198,12 @@ The `_dlt_loads` table will look like this:
| **1234563456.12345** | quick_start | 0 | 2023-09-12 16:46:03.10662+00 | aOEb...Qekd/58= |

The `_dlt_loads` table tracks complete loads and allows chaining transformations on top of them.
Many destinations do not support distributed and long-running transactions (e.g. Amazon Redshift).
Many destinations do not support distributed and long-running transactions (e.g., Amazon Redshift).
In that case, the user may see the partially loaded data. It is possible to filter such data out: any
row with a `load_id` that does not exist in `_dlt_loads` is not yet completed. The same procedure may be used to identify
and delete data for packages that never got completed.

For each load, you can test and [alert](../running-in-production/alerting.md) on anomalies (e.g.
For each load, you can test and [alert](../running-in-production/alerting.md) on anomalies (e.g.,
no data, too much loaded to a table). There are also some useful load stats in the `Load info` tab
of the [Streamlit app](../dlt-ecosystem/visualizations/exploring-the-data.md#exploring-the-data)
mentioned above.
Expand All @@ -231,8 +219,7 @@ Data lineage can be super relevant for architectures like the
[data vault architecture](https://www.data-vault.co.uk/what-is-data-vault/) or when troubleshooting.
The data vault architecture is a data warehouse that large organizations use when representing the
same process across multiple systems, which adds data lineage requirements. Using the pipeline name
and `load_id` provided out of the box by `dlt`, you are able to identify the source and time of
data.
and `load_id` provided out of the box by `dlt`, you are able to identify the source and time of data.

You can [save](../running-in-production/running.md#inspect-and-save-the-load-info-and-trace)
complete lineage info for a particular `load_id` including a list of loaded files, error messages
Expand All @@ -242,11 +229,7 @@ problems.
## Staging dataset

So far we've been using the `append` write disposition in our example pipeline. This means that
each time we run the pipeline, the data is appended to the existing tables. When you use [the
merge write disposition](incremental-loading.md), dlt creates a staging database schema for
staging data. This schema is named `<dataset_name>_staging` and contains the same tables as the
destination schema. When you run the pipeline, the data from the staging tables is loaded into the
destination tables in a single atomic transaction.
each time we run the pipeline, the data is appended to the existing tables. When you use the [merge write disposition](incremental-loading.md), dlt creates a staging database schema for staging data. This schema is named `<dataset_name>_staging` and contains the same tables as the destination schema. When you run the pipeline, the data from the staging tables is loaded into the destination tables in a single atomic transaction.

Let's illustrate this with an example. We change our pipeline to use the `merge` write disposition:

Expand All @@ -270,8 +253,7 @@ load_info = pipeline.run(users)
```

Running this pipeline will create a schema in the destination database with the name `mydata_staging`.
If you inspect the tables in this schema, you will find `mydata_staging.users` table identical to the
`mydata.users` table in the previous example.
If you inspect the tables in this schema, you will find the `mydata_staging.users` table identical to the`mydata.users` table in the previous example.

Here is what the tables may look like after running the pipeline:

Expand All @@ -290,8 +272,7 @@ Here is what the tables may look like after running the pipeline:
| 2 | Bob 2 | rX8ybgTeEmAmmA | 2345672350.98417 |
| 3 | Charlie | h8lehZEvT3fASQ | 1234563456.12345 |

Notice that the `mydata.users` table now contains the data from both the previous pipeline run and
the current one.
Notice that the `mydata.users` table now contains the data from both the previous pipeline run and the current one.

## Versioned datasets

Expand Down Expand Up @@ -322,4 +303,4 @@ load_info = pipeline.run(data, table_name="users")
Every time you run this pipeline, a new schema will be created in the destination database with a
datetime-based suffix. The data will be loaded into tables in this schema.
For example, the first time you run the pipeline, the schema will be named
`mydata_20230912064403`, the second time it will be named `mydata_20230912064407`, and so on.
`mydata_20230912064403`, the second time it will be named `mydata_20230912064407`, and so on.
4 changes: 2 additions & 2 deletions docs/website/docs/general-usage/http/rest-client.md
Original file line number Diff line number Diff line change
Expand Up @@ -585,7 +585,7 @@ from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.auth import BearerTokenAuth

client = RESTClient(base_url="https://api.example.com")
response = client.get("/posts", auth=BearerTokenAuth(token="your_access_token"))
response = client.get("/posts", auth=BearerTokenAuth(token="your_access_token")) # type: ignore

print(response.status_code)
print(response.headers)
Expand Down Expand Up @@ -632,7 +632,7 @@ def response_hook(response, **kwargs):

for page in client.paginate(
"/posts",
auth=BearerTokenAuth(token="your_access_token"),
auth=BearerTokenAuth(token="your_access_token"), # type: ignore
hooks={"response": [response_hook]}
):
print(page)
Expand Down
Loading

0 comments on commit 1762f09

Please sign in to comment.