Merge remote-tracking branch 'origin/devel' into docs/update_filesyst…

…em_docs
dlt-hub · May 24, 2024 · 1762f09 · 1762f09
2 parents afe18bb + 7c07c67
commit 1762f09
Show file tree

Hide file tree

Showing 15 changed files with 251 additions and 182 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -52,6 +52,29 @@ We use **master** branch for hot fixes (including documentation) that needs to b
 
 On the release day, **devel** branch is merged into **master**. All releases of `dlt` happen only from the **master**.
 
+### Branch naming rules
+
+We want to make sure that our git history explains in a human readable way what has been changed with which Branch or PR. To this end, we are using the following branch naming pattern (all lowercase and dashes, no underscores):
+
+```sh
+{category}/{ticket-id}-description-of-the-branch
+# example:
+feat/4922-add-avro-support
+```
+
+#### Branch categories
+
+* **feat** - a new feature that is being implemented (ticket required)
+* **fix** - a change that fixes a bug (ticket required)
+* **exp** - an experiment where we are testing a new  idea or want to demonstrate something to the team, might turn into a `feat` later (ticket encouraged)
+* **test** - anything related to the tests (ticket encouraged)
+* **blogs** - a new entry to our blog (ticket optional)
+* **docs** - a change to our docs (ticket optional)
+
+#### Ticket Numbers
+
+We encourage you to attach your branches to a ticket, if none exists, create one and explain what you are doing. For `feat` and `fix` branches, tickets are mandatory, for `exp` and `test` branches encouraged and for `blogs` and `docs` branches optional.
+
 ### Submitting a hotfix
 We'll fix critical bugs and release `dlt` out of the schedule. Follow the regular procedure, but make your PR against **master** branch. Please ping us on Slack if you do it.
 
@@ -166,3 +189,4 @@ Once the version has been bumped, follow these steps to publish the new release
 - [Poetry Documentation](https://python-poetry.org/docs/)
 
 If you have any questions or need help, don't hesitate to reach out to us. We're here to help you succeed in contributing to `dlt`. Happy coding!
+****
diff --git a/dlt/sources/helpers/rest_client/detector.py b/dlt/sources/helpers/rest_client/detector.py
@@ -1,5 +1,6 @@
 import re
-from typing import List, Dict, Any, Tuple, Union, Optional, Callable, Iterable
+from pathlib import PurePosixPath
+from typing import List, Dict, Any, Tuple, Union, Callable, Iterable
 from urllib.parse import urlparse
 
 from requests import Response
@@ -25,6 +26,7 @@
         "payload",
         "content",
         "objects",
+        "values",
     ]
 )
 
@@ -46,7 +48,10 @@
 
 def single_entity_path(path: str) -> bool:
     """Checks if path ends with path param indicating that single object is returned"""
-    return re.search(r"\{([a-zA-Z_][a-zA-Z0-9_]*)\}/?$", path) is not None
+    # get last path segment
+    name = PurePosixPath(path).name
+    # alphabet for a name taken from https://github.com/OAI/OpenAPI-Specification/blob/main/versions/3.0.3.md#fixed-fields-6
+    return re.search(r"\{([a-zA-Z0-9\.\-_]+)\}", name) is not None
 
 
 def matches_any_pattern(key: str, patterns: Iterable[str]) -> bool:

diff --git a/docs/examples/custom_destination_bigquery/custom_destination_bigquery.py b/docs/examples/custom_destination_bigquery/custom_destination_bigquery.py
@@ -8,10 +8,10 @@
 In this example, you'll find a Python script that demonstrates how to load to BigQuery with the custom destination.
 
 We'll learn how to:
-- Use [built-in credentials](../general-usage/credentials/config_specs#gcp-credentials)
-- Use the [custom destination](../dlt-ecosystem/destinations/destination.md)
-- Use pyarrow tables to create complex column types on BigQuery
-- Use BigQuery `autodetect=True` for schema inference from parquet files
+- Use [built-in credentials.](../general-usage/credentials/config_specs#gcp-credentials)
+- Use the [custom destination.](../dlt-ecosystem/destinations/destination.md)
+- Use pyarrow tables to create complex column types on BigQuery.
+- Use BigQuery `autodetect=True` for schema inference from parquet files.
 
 """
 

diff --git a/docs/website/docs/dlt-ecosystem/destinations/snowflake.md b/docs/website/docs/dlt-ecosystem/destinations/snowflake.md
@@ -9,7 +9,7 @@ keywords: [Snowflake, destination, data warehouse]
 ## Install `dlt` with Snowflake
 **To install the `dlt` library with Snowflake dependencies, run:**
 ```sh
-pip install dlt[snowflake]
+pip install "dlt[snowflake]"
 ```
 
 ## Setup Guide

diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md
@@ -282,7 +282,7 @@ The fields in the endpoint configuration are:
 - `json`: The JSON payload to be sent with the request (for POST and PUT requests).
 - `paginator`: Pagination configuration for the endpoint. See the [pagination](#pagination) section for more details.
 - `data_selector`: A JSONPath to select the data from the response. See the [data selection](#data-selection) section for more details.
-- `response_actions`: A list of actions that define how to process the response data.
+- `response_actions`: A list of actions that define how to process the response data. See the [response actions](#response-actions) section for more details.
 - `incremental`: Configuration for [incremental loading](#incremental-loading).
 
 ### Pagination
@@ -414,8 +414,8 @@ Available authentication types:
 | Authentication class | String Alias (`type`) | Description |
 | ------------------- | ----------- | ----------- |
 | [BearTokenAuth](../../general-usage/http/rest-client.md#bearer-token-authentication) | `bearer` | Bearer token authentication. |
-| [HTTPBasicAuth](../../general-usage/http/rest-client.md#http-basic-authentication) | `api_key` | Basic HTTP authentication. |
-| [APIKeyAuth](../../general-usage/http/rest-client.md#api-key-authentication) | `http_basic` | API key authentication with key defined in the query parameters or in the headers. |
+| [HTTPBasicAuth](../../general-usage/http/rest-client.md#http-basic-authentication) | `http_basic` | Basic HTTP authentication. |
+| [APIKeyAuth](../../general-usage/http/rest-client.md#api-key-authentication) | `api_key` | API key authentication with key defined in the query parameters or in the headers. |
 
 To specify the authentication configuration, use the `auth` field in the [client](#client) configuration:
 
@@ -586,3 +586,33 @@ See the [incremental loading](../../general-usage/incremental-loading.md#increme
 - `root_key` (bool): Enables merging on all resources by propagating root foreign key to child tables. This option is most useful if you plan to change write disposition of a resource to disable/enable merge. Defaults to False.
 - `schema_contract`: Schema contract settings that will be applied to this resource.
 - `spec`: A specification of configuration and secret values required by the source.
+
+### Response actions
+
+The `response_actions` field in the endpoint configuration allows you to specify how to handle specific responses from the API based on status codes or content substrings. This is useful for handling edge cases like ignoring responses on specific conditions.
+
+:::caution Experimental Feature
+This is an experimental feature and may change in future releases.
+:::
+
+#### Example
+
+```py
+{
+    "path": "issues",
+    "response_actions": [
+        {"status_code": 404, "action": "ignore"},
+        {"content": "Not found", "action": "ignore"},
+        {"status_code": 200, "content": "some text", "action": "ignore"},
+    ],
+}
+```
+
+In this example, the source will ignore responses with a status code of 404, responses with the content "Not found", and responses with a status code of 200 _and_ content "some text".
+
+**Fields:**
+
+- `status_code` (int, optional): The HTTP status code to match.
+- `content` (str, optional): A substring to search for in the response content.
+- `action` (str): The action to take when the condition is met. Currently supported actions:
+  - `ignore`: Ignore the response.
diff --git a/docs/website/docs/general-usage/destination-tables.md b/docs/website/docs/general-usage/destination-tables.md
@@ -74,7 +74,7 @@ pipeline = dlt.pipeline(
 load_info = pipeline.run(users)
 ```
 
-The result will be the same, but the table is implicitly named `users` based on the resource name.
+The result will be the same; note that we do not explicitly pass `table_name="users"` to `pipeline.run`, and the table is implicitly named `users` based on the resource name (e.g., `users()` decorated with `@dlt.resource`).
 
 :::note
 
@@ -117,9 +117,7 @@ pipeline = dlt.pipeline(
 load_info = pipeline.run(data, table_name="users")
 ```
 
-Running this pipeline will create two tables in the destination, `users` and `users__pets`. The
-`users` table will contain the top level data, and the `users__pets` table will contain the child
-data. Here is what the tables may look like:
+Running this pipeline will create two tables in the destination, `users` and `users__pets`. The `users` table will contain the top-level data, and the `users__pets` table will contain the child data. Here is what the tables may look like:
 
 **mydata.users**
 
@@ -141,21 +139,14 @@ creating and linking children and parent tables.
 
 This is how it works:
 
-1. Each row in all (top level and child) data tables created by `dlt` contains UNIQUE column named
-   `_dlt_id`.
-1. Each child table contains FOREIGN KEY column `_dlt_parent_id` linking to a particular row
-   (`_dlt_id`) of a parent table.
-1. Rows in child tables come from the lists: `dlt` stores the position of each item in the list in
-   `_dlt_list_idx`.
-1. For tables that are loaded with the `merge` write disposition, we add a ROOT KEY column
-   `_dlt_root_id`, which links child table to a row in top level table.
-
+1. Each row in all (top level and child) data tables created by `dlt` contains a `UNIQUE` column named `_dlt_id`.
+1. Each child table contains a `FOREIGN KEY` column `_dlt_parent_id` linking to a particular row (`_dlt_id`) of a parent table.
+1. Rows in child tables come from the lists: `dlt` stores the position of each item in the list in `_dlt_list_idx`.
+1. For tables that are loaded with the `merge` write disposition, we add a root key column `_dlt_root_id`, which links the child table to a row in the top-level table.
 
 :::note
 
-If you define your own primary key in a child table, it will be used to link to parent table
-and the `_dlt_parent_id` and `_dlt_list_idx` will not be added. `_dlt_id` is always added even in
-case the primary key or other unique columns are defined.
+If you define your own primary key in a child table, it will be used to link to the parent table, and the `_dlt_parent_id` and `_dlt_list_idx` will not be added. `_dlt_id` is always added even if the primary key or other unique columns are defined.
 
 :::
 
@@ -164,17 +155,15 @@ case the primary key or other unique columns are defined.
 During a pipeline run, dlt [normalizes both table and column names](schema.md#naming-convention) to ensure compatibility with the destination database's accepted format. All names from your source data will be transformed into snake_case and will only include alphanumeric characters. Please be aware that the names in the destination database may differ somewhat from those in your original input.
 
 ### Variant columns
-If your data has inconsistent types, `dlt` will dispatch the data to several **variant columns**. For example, if you have a resource (ie json file) with a filed with name **answer** and your data contains boolean values, you will get get a column with name **answer** of type **BOOLEAN** in your destination. If for some reason, on next load you get integer value and string value in **answer**, the inconsistent data will go to **answer__v_bigint** and **answer__v_text** columns respectively.
-The general naming rule for variant columns is `<original name>__v_<type>` where `original_name` is the existing column name (with data type clash) and `type` is the name of data type stored in the variant.
-
+If your data has inconsistent types, `dlt` will dispatch the data to several **variant columns**. For example, if you have a resource (i.e., JSON file) with a field with name `answer` and your data contains boolean values, you will get a column with name `answer` of type `BOOLEAN` in your destination. If for some reason, on the next load, you get integer and string values in `answer`, the inconsistent data will go to `answer__v_bigint` and `answer__v_text` columns respectively.
+The general naming rule for variant columns is `<original name>__v_<type>` where `original_name` is the existing column name (with data type clash) and `type` is the name of the data type stored in the variant.
 
 ## Load Packages and Load IDs
 
 Each execution of the pipeline generates one or more load packages. A load package typically contains data retrieved from
 all the [resources](glossary.md#resource) of a particular [source](glossary.md#source).
 These packages are uniquely identified by a `load_id`. The `load_id` of a particular package is added to the top data tables
-(referenced as `_dlt_load_id` column in the example above) and to the special `_dlt_loads` table with a status 0
-(when the load process is fully completed).
+(referenced as `_dlt_load_id` column in the example above) and to the special `_dlt_loads` table with a status of 0 (when the load process is fully completed).
 
 To illustrate this, let's load more data into the same destination:
 
@@ -189,8 +178,7 @@ data = [
 ```
 
 The rest of the pipeline definition remains the same. Running this pipeline will create a new load
-package with a new `load_id` and add the data to the existing tables. The `users` table will now
-look like this:
+package with a new `load_id` and add the data to the existing tables. The `users` table will now look like this:
 
 **mydata.users**
 
@@ -210,12 +198,12 @@ The `_dlt_loads` table will look like this:
 | **1234563456.12345** | quick_start | 0 | 2023-09-12 16:46:03.10662+00 | aOEb...Qekd/58= |
 
 The `_dlt_loads` table tracks complete loads and allows chaining transformations on top of them.
-Many destinations do not support distributed and long-running transactions (e.g. Amazon Redshift).
+Many destinations do not support distributed and long-running transactions (e.g., Amazon Redshift).
 In that case, the user may see the partially loaded data. It is possible to filter such data out: any
 row with a `load_id` that does not exist in `_dlt_loads` is not yet completed. The same procedure may be used to identify
 and delete data for packages that never got completed.
 
-For each load, you can test and [alert](../running-in-production/alerting.md) on anomalies (e.g.
+For each load, you can test and [alert](../running-in-production/alerting.md) on anomalies (e.g.,
 no data, too much loaded to a table). There are also some useful load stats in the `Load info` tab
 of the [Streamlit app](../dlt-ecosystem/visualizations/exploring-the-data.md#exploring-the-data)
 mentioned above.
@@ -231,8 +219,7 @@ Data lineage can be super relevant for architectures like the
 [data vault architecture](https://www.data-vault.co.uk/what-is-data-vault/) or when troubleshooting.
 The data vault architecture is a data warehouse that large organizations use when representing the
 same process across multiple systems, which adds data lineage requirements. Using the pipeline name
-and `load_id` provided out of the box by `dlt`, you are able to identify the source and time of
-data.
+and `load_id` provided out of the box by `dlt`, you are able to identify the source and time of data.
 
 You can [save](../running-in-production/running.md#inspect-and-save-the-load-info-and-trace)
 complete lineage info for a particular `load_id` including a list of loaded files, error messages
@@ -242,11 +229,7 @@ problems.
 ## Staging dataset
 
 So far we've been using the `append` write disposition in our example pipeline. This means that
-each time we run the pipeline, the data is appended to the existing tables. When you use [the
-merge write disposition](incremental-loading.md), dlt creates a staging database schema for
-staging data. This schema is named `<dataset_name>_staging` and contains the same tables as the
-destination schema. When you run the pipeline, the data from the staging tables is loaded into the
-destination tables in a single atomic transaction.
+each time we run the pipeline, the data is appended to the existing tables. When you use the [merge write disposition](incremental-loading.md), dlt creates a staging database schema for staging data. This schema is named `<dataset_name>_staging` and contains the same tables as the destination schema. When you run the pipeline, the data from the staging tables is loaded into the destination tables in a single atomic transaction.
 
 Let's illustrate this with an example. We change our pipeline to use the `merge` write disposition:
 
@@ -270,8 +253,7 @@ load_info = pipeline.run(users)
 ```
 
 Running this pipeline will create a schema in the destination database with the name `mydata_staging`.
-If you inspect the tables in this schema, you will find `mydata_staging.users` table identical to the
-`mydata.users` table in the previous example.
+If you inspect the tables in this schema, you will find the `mydata_staging.users` table identical to the`mydata.users` table in the previous example.
 
 Here is what the tables may look like after running the pipeline:
 
@@ -290,8 +272,7 @@ Here is what the tables may look like after running the pipeline:
 | 2 | Bob 2 | rX8ybgTeEmAmmA | 2345672350.98417 |
 | 3 | Charlie | h8lehZEvT3fASQ | 1234563456.12345 |
 
-Notice that the `mydata.users` table now contains the data from both the previous pipeline run and
-the current one.
+Notice that the `mydata.users` table now contains the data from both the previous pipeline run and the current one.
 
 ## Versioned datasets
 
@@ -322,4 +303,4 @@ load_info = pipeline.run(data, table_name="users")
 Every time you run this pipeline, a new schema will be created in the destination database with a
 datetime-based suffix. The data will be loaded into tables in this schema.
 For example, the first time you run the pipeline, the schema will be named
-`mydata_20230912064403`, the second time it will be named `mydata_20230912064407`, and so on.
+`mydata_20230912064403`, the second time it will be named `mydata_20230912064407`, and so on.
diff --git a/docs/website/docs/general-usage/http/rest-client.md b/docs/website/docs/general-usage/http/rest-client.md
@@ -585,7 +585,7 @@ from dlt.sources.helpers.rest_client import RESTClient
 from dlt.sources.helpers.rest_client.auth import BearerTokenAuth
 
 client = RESTClient(base_url="https://api.example.com")
-response = client.get("/posts", auth=BearerTokenAuth(token="your_access_token"))
+response = client.get("/posts", auth=BearerTokenAuth(token="your_access_token"))  # type: ignore
 
 print(response.status_code)
 print(response.headers)
@@ -632,7 +632,7 @@ def response_hook(response, **kwargs):
 
 for page in client.paginate(
     "/posts",
-    auth=BearerTokenAuth(token="your_access_token"),
+    auth=BearerTokenAuth(token="your_access_token"),  # type: ignore
     hooks={"response": [response_hook]}
 ):
     print(page)