From 6e50af16927d55a92474c28b642ae80c665f8f96 Mon Sep 17 00:00:00 2001 From: Dave Date: Tue, 17 Sep 2024 11:21:57 +0200 Subject: [PATCH] more fixes files (no 78) --- .../docs/dlt-ecosystem/destinations/athena.md | 43 +++-- .../dlt-ecosystem/destinations/bigquery.md | 76 ++++----- .../dlt-ecosystem/destinations/clickhouse.md | 62 +++---- .../dlt-ecosystem/destinations/databricks.md | 58 +++---- .../dlt-ecosystem/destinations/destination.md | 43 ++--- .../docs/dlt-ecosystem/destinations/dremio.md | 33 ++-- .../docs/dlt-ecosystem/destinations/duckdb.md | 30 ++-- .../dlt-ecosystem/destinations/filesystem.md | 151 +++++++++--------- .../docs/dlt-ecosystem/destinations/index.md | 1 + .../dlt-ecosystem/destinations/lancedb.md | 28 ++-- .../dlt-ecosystem/destinations/motherduck.md | 21 +-- .../docs/dlt-ecosystem/destinations/mssql.md | 14 +- .../dlt-ecosystem/destinations/postgres.md | 23 ++- .../docs/dlt-ecosystem/destinations/qdrant.md | 6 +- .../dlt-ecosystem/destinations/redshift.md | 28 ++-- .../dlt-ecosystem/destinations/snowflake.md | 76 +++++---- .../dlt-ecosystem/destinations/synapse.md | 23 +-- .../dlt-ecosystem/destinations/weaviate.md | 42 +++-- .../docs/dlt-ecosystem/file-formats/csv.md | 52 +++--- .../file-formats/insert-format.md | 6 +- .../docs/dlt-ecosystem/file-formats/jsonl.md | 8 +- .../dlt-ecosystem/file-formats/parquet.md | 40 +++-- docs/website/docs/dlt-ecosystem/staging.md | 12 +- .../docs/dlt-ecosystem/table-formats/delta.md | 5 +- .../dlt-ecosystem/table-formats/iceberg.md | 5 +- .../dlt-ecosystem/transformations/dbt/dbt.md | 22 ++- .../transformations/dbt/dbt_cloud.md | 13 +- .../dlt-ecosystem/transformations/pandas.md | 8 +- .../docs/dlt-ecosystem/transformations/sql.md | 13 +- .../verified-sources/_source-info-header.md | 3 +- .../verified-sources/amazon_kinesis.md | 36 ++--- .../verified-sources/arrow-pandas.md | 31 ++-- .../dlt-ecosystem/verified-sources/github.md | 53 +++--- .../verified-sources/google_analytics.md | 61 +++---- .../verified-sources/google_sheets.md | 142 +++++++--------- .../dlt-ecosystem/verified-sources/inbox.md | 45 +++--- .../dlt-ecosystem/verified-sources/jira.md | 99 +++++------- .../dlt-ecosystem/verified-sources/kafka.md | 5 +- .../dlt-ecosystem/verified-sources/notion.md | 52 +++--- .../verified-sources/openapi-generator.md | 13 +- .../verified-sources/personio.md | 35 ++-- .../verified-sources/pipedrive.md | 12 +- .../dlt-ecosystem/verified-sources/scrapy.md | 32 ++-- .../dlt-ecosystem/verified-sources/slack.md | 14 +- .../general-usage/credentials/advanced.md | 18 +-- .../credentials/complex_types.md | 49 +++--- .../docs/general-usage/credentials/index.md | 7 +- .../docs/general-usage/credentials/setup.md | 75 ++++----- .../pseudonymizing_columns.md | 14 +- .../customising-pipelines/removing_columns.md | 8 +- .../customising-pipelines/renaming_columns.md | 9 +- .../currency_conversion_data_enrichment.md | 66 +++----- .../url-parser-data-enrichment.md | 42 +++-- .../docs/general-usage/http/overview.md | 9 +- .../docs/general-usage/http/requests.md | 19 +-- .../docs/general-usage/http/rest-client.md | 67 ++++---- docs/website/docs/tutorial/filesystem.md | 37 ++--- .../docs/tutorial/load-data-from-an-api.md | 62 ++++--- docs/website/docs/tutorial/rest-api.md | 16 +- docs/website/docs/tutorial/sql-database.md | 24 +-- 60 files changed, 963 insertions(+), 1134 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/destinations/athena.md b/docs/website/docs/dlt-ecosystem/destinations/athena.md index e6f99adc48..2b9e68a6fc 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/athena.md +++ b/docs/website/docs/dlt-ecosystem/destinations/athena.md @@ -6,7 +6,7 @@ keywords: [aws, athena, glue catalog] # AWS Athena / Glue Catalog -The Athena destination stores data as Parquet files in S3 buckets and creates [external tables in AWS Athena](https://docs.aws.amazon.com/athena/latest/ug/creating-tables.html). You can then query those tables with Athena SQL commands, which will scan the entire folder of Parquet files and return the results. This destination works very similarly to other SQL-based destinations, with the exception that the merge write disposition is not supported at this time. The `dlt` metadata will be stored in the same bucket as the Parquet files, but as iceberg tables. Athena also supports writing individual data tables as Iceberg tables, so they may be manipulated later. A common use case would be to strip GDPR data from them. +The Athena destination stores data as Parquet files in S3 buckets and creates [external tables in AWS Athena](https://docs.aws.amazon.com/athena/latest/ug/creating-tables.html). You can then query those tables with Athena SQL commands, which will scan the entire folder of Parquet files and return the results. This destination works very similarly to other SQL-based destinations, with the exception that the merge write disposition is not supported at this time. The `dlt` metadata will be stored in the same bucket as the Parquet files, but as Iceberg tables. Athena also supports writing individual data tables as Iceberg tables, so they may be manipulated later. A common use case would be to strip GDPR data from them. ## Install dlt with Athena **To install the dlt library with Athena dependencies:** @@ -14,7 +14,7 @@ The Athena destination stores data as Parquet files in S3 buckets and creates [e pip install "dlt[athena]" ``` -## Setup Guide +## Setup guide ### 1. Initialize the dlt project Let's start by initializing a new `dlt` project as follows: @@ -24,7 +24,7 @@ Let's start by initializing a new `dlt` project as follows: > 💡 This command will initialize your pipeline with chess as the source and AWS Athena as the destination using the filesystem staging destination. -### 2. Setup bucket storage and Athena credentials +### 2. Set up bucket storage and Athena credentials First, install dependencies by running: ```sh @@ -44,7 +44,7 @@ pip install pyathena so pip does not fail on backtracking. ::: -To edit the `dlt` credentials file with your secret info, open `.dlt/secrets.toml`. You will need to provide a `bucket_url`, which holds the uploaded parquet files, a `query_result_bucket`, which Athena uses to write query results to, and credentials that have write and read access to these two buckets as well as the full Athena access AWS role. +To edit the `dlt` credentials file with your secret info, open `.dlt/secrets.toml`. You will need to provide a `bucket_url`, which holds the uploaded Parquet files, a `query_result_bucket`, which Athena uses to write query results to, and credentials that have write and read access to these two buckets as well as the full Athena access AWS role. The toml file looks like this: @@ -65,7 +65,7 @@ aws_secret_access_key="please set me up!" # same as credentials for filesystem region_name="please set me up!" # set your AWS region, for example "eu-central-1" for Frankfurt ``` -If you have your credentials stored in `~/.aws/credentials`, just remove the **[destination.filesystem.credentials]** and **[destination.athena.credentials]** section above and `dlt` will fall back to your **default** profile in local credentials. If you want to switch the profile, pass the profile name as follows (here: `dlt-ci-user`): +If you have your credentials stored in `~/.aws/credentials`, just remove the **[destination.filesystem.credentials]** and **[destination.athena.credentials]** sections above and `dlt` will fall back to your **default** profile in local credentials. If you want to switch the profile, pass the profile name as follows (here: `dlt-ci-user`): ```toml [destination.filesystem.credentials] profile_name="dlt-ci-user" @@ -74,7 +74,7 @@ profile_name="dlt-ci-user" profile_name="dlt-ci-user" ``` -## Additional Destination Configuration +## Additional destination configuration You can provide an Athena workgroup like so: ```toml @@ -87,26 +87,26 @@ athena_work_group="my_workgroup" The `athena` destination handles the write dispositions as follows: - `append` - files belonging to such tables are added to the dataset folder. - `replace` - all files that belong to such tables are deleted from the dataset folder, and then the current set of files is added. -- `merge` - falls back to `append` (unless you're using [iceberg](#iceberg-data-tables) tables). +- `merge` - falls back to `append` (unless you're using [Iceberg](#iceberg-data-tables) tables). ## Data loading -Data loading happens by storing parquet files in an S3 bucket and defining a schema on Athena. If you query data via SQL queries on Athena, the returned data is read by scanning your bucket and reading all relevant parquet files in there. +Data loading happens by storing Parquet files in an S3 bucket and defining a schema on Athena. If you query data via SQL queries on Athena, the returned data is read by scanning your bucket and reading all relevant Parquet files in there. `dlt` internal tables are saved as Iceberg tables. ### Data types -Athena tables store timestamps with millisecond precision, and with that precision, we generate parquet files. Keep in mind that Iceberg tables have microsecond precision. +Athena tables store timestamps with millisecond precision, and with that precision, we generate Parquet files. Keep in mind that Iceberg tables have microsecond precision. Athena does not support JSON fields, so JSON is stored as a string. -> ❗**Athena does not support TIME columns in parquet files**. `dlt` will fail such jobs permanently. Convert `datetime.time` objects to `str` or `datetime.datetime` to load them. +> ❗**Athena does not support TIME columns in Parquet files**. `dlt` will fail such jobs permanently. Convert `datetime.time` objects to `str` or `datetime.datetime` to load them. ### Table and column identifiers -Athena uses case insensitive identifiers and **will lower case all the identifiers** that are stored in the INFORMATION SCHEMA. Do not use -[case sensitive naming conventions](../../general-usage/naming-convention.md#case-sensitive-and-insensitive-destinations). Letter casing will be removed anyway and you risk to generate identifier collisions, which are detected by `dlt` and will fail the load process. +Athena uses case-insensitive identifiers and **will lowercase all the identifiers** that are stored in the INFORMATION SCHEMA. Do not use +[case-sensitive naming conventions](../../general-usage/naming-convention.md#case-sensitive-and-insensitive-destinations). Letter casing will be removed anyway, and you risk generating identifier collisions, which are detected by `dlt` and will fail the load process. -Under the hood Athena uses different SQL engines for DDL (catalog) and DML/Queries: +Under the hood, Athena uses different SQL engines for DDL (catalog) and DML/Queries: * DDL uses HIVE escaping with `````` * Other queries use PRESTO and regular SQL escaping. @@ -119,11 +119,10 @@ If you decide to change the [filename layout](./filesystem#data-loading) from th - You need to provide the `{file_id}` placeholder, and it needs to be somewhere after the `{table_name}` placeholder. - `{table_name}` must be the first placeholder in the layout. - ## Additional destination options ### Iceberg data tables -You can save your tables as Iceberg tables to Athena. This will enable you, for example, to delete data from them later if you need to. To switch a resource to the iceberg table format, supply the table_format argument like this: +You can save your tables as Iceberg tables to Athena. This will enable you, for example, to delete data from them later if you need to. To switch a resource to the Iceberg table format, supply the table_format argument like this: ```py @dlt.resource(table_format="iceberg") @@ -131,18 +130,18 @@ def data() -> Iterable[TDataItem]: ... ``` -For every table created as an iceberg table, the Athena destination will create a regular Athena table in the staging dataset of both the filesystem and the Athena glue catalog, and then copy all data into the final iceberg table that lives with the non-iceberg tables in the same dataset on both the filesystem and the glue catalog. Switching from iceberg to regular table or vice versa is not supported. +For every table created as an Iceberg table, the Athena destination will create a regular Athena table in the staging dataset of both the filesystem and the Athena glue catalog, and then copy all data into the final Iceberg table that lives with the non-Iceberg tables in the same dataset on both the filesystem and the glue catalog. Switching from Iceberg to regular table or vice versa is not supported. #### `merge` support -The `merge` write disposition is supported for Athena when using iceberg tables. +The `merge` write disposition is supported for Athena when using Iceberg tables. > Note that: -> 1. there is a risk of tables ending up in inconsistent state in case a pipeline run fails mid flight, because Athena doesn't support transactions, and `dlt` uses multiple DELETE/UPDATE/INSERT statements to implement `merge`, +> 1. There is a risk of tables ending up in an inconsistent state in case a pipeline run fails mid-flight because Athena doesn't support transactions, and `dlt` uses multiple DELETE/UPDATE/INSERT statements to implement `merge`. > 2. `dlt` creates additional helper tables called `insert_` and `delete_
` in the staging schema to work around Athena's lack of temporary tables. ### dbt support -Athena is supported via `dbt-athena-community`. Credentials are passed into `aws_access_key_id` and `aws_secret_access_key` of the generated dbt profile. Iceberg tables are supported, but you need to make sure that you materialize your models as iceberg tables if your source table is iceberg. We encountered problems with materializing date time columns due to different precision on iceberg (nanosecond) and regular Athena tables (millisecond). +Athena is supported via `dbt-athena-community`. Credentials are passed into `aws_access_key_id` and `aws_secret_access_key` of the generated dbt profile. Iceberg tables are supported, but you need to make sure that you materialize your models as Iceberg tables if your source table is Iceberg. We encountered problems with materializing date-time columns due to different precision on Iceberg (nanosecond) and regular Athena tables (millisecond). The Athena adapter requires that you set up **region_name** in the Athena configuration below. You can also set up the table catalog name to change the default: **awsdatacatalog** ```toml [destination.athena] @@ -150,7 +149,7 @@ aws_data_catalog="awsdatacatalog" ``` ### Syncing of `dlt` state -- This destination fully supports [dlt state sync.](../../general-usage/state#syncing-state-with-destination). The state is saved in Athena iceberg tables in your S3 bucket. +- This destination fully supports [dlt state sync.](../../general-usage/state#syncing-state-with-destination). The state is saved in Athena Iceberg tables in your S3 bucket. ## Supported file formats @@ -170,8 +169,8 @@ Use the `athena_partition` helper to generate the partitioning hints for these f * `athena_partition.month(column_name: str)`: Partition by month of date/datetime column. * `athena_partition.day(column_name: str)`: Partition by day of date/datetime column. * `athena_partition.hour(column_name: str)`: Partition by hour of date/datetime column. -* `athena_partition.bucket(n: int, column_name: str)`: Partition by hashed value to `n` buckets -* `athena_partition.truncate(length: int, column_name: str)`: Partition by truncated value to `length` (or width for numbers) +* `athena_partition.bucket(n: int, column_name: str)`: Partition by hashed value to `n` buckets. +* `athena_partition.truncate(length: int, column_name: str)`: Partition by truncated value to `length` (or width for numbers). Here is an example of how to use the adapter to partition a table: diff --git a/docs/website/docs/dlt-ecosystem/destinations/bigquery.md b/docs/website/docs/dlt-ecosystem/destinations/bigquery.md index 3dd625212f..9da298f090 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/bigquery.md +++ b/docs/website/docs/dlt-ecosystem/destinations/bigquery.md @@ -14,7 +14,7 @@ keywords: [bigquery, destination, data warehouse] pip install "dlt[bigquery]" ``` -## Setup Guide +## Setup guide **1. Initialize a project with a pipeline that loads to BigQuery by running:** @@ -28,7 +28,7 @@ dlt init chess bigquery pip install -r requirements.txt ``` -This will install dlt with the `bigquery` extra, which contains all the dependencies required by the bigquery client. +This will install dlt with the `bigquery` extra, which contains all the dependencies required by the BigQuery client. **3. Log in to or create a Google Cloud account** @@ -83,12 +83,12 @@ private_key = "private_key" # please set me up! client_email = "client_email" # please set me up! ``` -You can specify the location of the data i.e. `EU` instead of `US` which is the default. +You can specify the location of the data, i.e., `EU` instead of `US`, which is the default. -### OAuth 2.0 Authentication +### OAuth 2.0 authentication You can use OAuth 2.0 authentication. You'll need to generate a **refresh token** with the right scopes (we suggest asking our GPT-4 assistant for details). -Then you can fill the following information in `secrets.toml` +Then you can fill in the following information in `secrets.toml`: ```toml [destination.bigquery] @@ -101,9 +101,9 @@ client_secret = "client_secret" # please set me up! refresh_token = "refresh_token" # please set me up! ``` -### Using Default Credentials +### Using default credentials -Google provides several ways to get default credentials i.e. from the `GOOGLE_APPLICATION_CREDENTIALS` environment variable or metadata services. +Google provides several ways to get default credentials, i.e., from the `GOOGLE_APPLICATION_CREDENTIALS` environment variable or metadata services. VMs available on GCP (cloud functions, Composer runners, Colab notebooks) have associated service accounts or authenticated users. `dlt` will try to use default credentials if nothing is explicitly specified in the secrets. @@ -112,7 +112,7 @@ VMs available on GCP (cloud functions, Composer runners, Colab notebooks) have a location = "US" ``` -### Using Different `project_id` +### Using different `project_id` You can set the `project_id` in your configuration to be different from the one in your credentials, provided your account has access to it: ```toml @@ -124,20 +124,17 @@ project_id = "project_id_credentials" ``` In this scenario, `project_id_credentials` will be used for authentication, while `project_id_destination` will be used as the data destination. -## Write Disposition +## Write disposition All write dispositions are supported. -If you set the [`replace` strategy](../../general-usage/full-loading.md) to `staging-optimized`, the destination tables will be dropped and -recreated with a [clone command](https://cloud.google.com/bigquery/docs/table-clones-create) from the staging tables. +If you set the [`replace` strategy](../../general-usage/full-loading.md) to `staging-optimized`, the destination tables will be dropped and recreated with a [clone command](https://cloud.google.com/bigquery/docs/table-clones-create) from the staging tables. -## Data Loading +## Data loading -`dlt` uses `BigQuery` load jobs that send files from the local filesystem or GCS buckets. -The loader follows [Google recommendations](https://cloud.google.com/bigquery/docs/error-messages) when retrying and terminating jobs. -The Google BigQuery client implements an elaborate retry mechanism and timeouts for queries and file uploads, which may be configured in destination options. +`dlt` uses `BigQuery` load jobs that send files from the local filesystem or GCS buckets. The loader follows [Google recommendations](https://cloud.google.com/bigquery/docs/error-messages) when retrying and terminating jobs. The Google BigQuery client implements an elaborate retry mechanism and timeouts for queries and file uploads, which may be configured in destination options. -BigQuery destination also supports [streaming insert](https://cloud.google.com/bigquery/docs/streaming-data-into-bigquery). The mode provides better performance with small (<500 records) batches, but it buffers the data, preventing any update/delete operations on it. Due to this, streaming inserts are only available with `write_disposition="append"`, and the inserted data is blocked for editing for up to 90 min (reading, however, is available immediately). [See more](https://cloud.google.com/bigquery/quotas#streaming_inserts). +BigQuery destination also supports [streaming insert](https://cloud.google.com/bigquery/docs/streaming-data-into-bigquery). The mode provides better performance with small (<500 records) batches, but it buffers the data, preventing any update/delete operations on it. Due to this, streaming inserts are only available with `write_disposition="append"`, and the inserted data is blocked for editing for up to 90 minutes (reading, however, is available immediately). [See more](https://cloud.google.com/bigquery/quotas#streaming_inserts). To switch the resource into streaming insert mode, use hints: ```py @@ -149,14 +146,12 @@ streamed_resource.apply_hints(additional_table_hints={"x-insert-api": "streaming ``` ### Use BigQuery schema autodetect for nested fields -You can let BigQuery to infer schemas and create destination tables instead of `dlt`. As a consequence, nested fields (ie. `RECORD`), which `dlt` does not support at -this moment (they are stored as JSON), may be created. You select certain resources with [BigQuery Adapter](#bigquery-adapter) or all of them with the following config option: +You can let BigQuery infer schemas and create destination tables instead of `dlt`. As a consequence, nested fields (i.e., `RECORD`), which `dlt` does not support at this moment (they are stored as JSON), may be created. You can select certain resources with [BigQuery Adapter](#bigquery-adapter) or all of them with the following config option: ```toml [destination.bigquery] autodetect_schema=true ``` -We recommend to yield [arrow tables](../verified-sources/arrow-pandas.md) from your resources and `parquet` file format to load the data. In that case the schemas generated by `dlt` and BigQuery -will be identical. BigQuery will also preserve the column order from the generated parquet files. You can convert `json` data into arrow tables with [pyarrow or duckdb](../verified-sources/arrow-pandas.md#loading-json-documents). +We recommend yielding [arrow tables](../verified-sources/arrow-pandas.md) from your resources and using the `parquet` file format to load the data. In that case, the schemas generated by `dlt` and BigQuery will be identical. BigQuery will also preserve the column order from the generated parquet files. You can convert `json` data into arrow tables with [pyarrow or duckdb](../verified-sources/arrow-pandas.md#loading-json-documents). ```py import pyarrow.json as paj @@ -167,7 +162,7 @@ from dlt.destinations.adapters import bigquery_adapter @dlt.resource(name="cve") def load_cve(): with open("cve.json", 'rb') as f: - # autodetect arrow schema and yields arrow table + # autodetect arrow schema and yield arrow table yield paj.read_json(f) pipeline = dlt.pipeline("load_json_struct", destination="bigquery") @@ -175,9 +170,9 @@ pipeline.run( bigquery_adapter(load_cve(), autodetect_schema=True) ) ``` -Above, we use `pyarrow` library to convert `json` document into `arrow` table and use `biguery_adapter` to enable schema autodetect for **cve** resource. +Above, we use the `pyarrow` library to convert a `json` document into an `arrow` table and use `bigquery_adapter` to enable schema autodetect for the **cve** resource. -Yielding Python dicts/lists and loading them as `jsonl` works as well. In many cases, the resulting nested structure is simpler than those obtained via pyarrow/duckdb and parquet. However there are slight differences in inferred types from `dlt` (BigQuery coerces types more aggressively). BigQuery also does not try to preserve the column order in relation to the order of fields in JSON. +Yielding Python dicts/lists and loading them as `jsonl` works as well. In many cases, the resulting nested structure is simpler than those obtained via pyarrow/duckdb and parquet. However, there are slight differences in inferred types from `dlt` (BigQuery coerces types more aggressively). BigQuery also does not try to preserve the column order in relation to the order of fields in JSON. ```py import dlt @@ -193,14 +188,13 @@ pipeline.run( bigquery_adapter(load_cve(), autodetect_schema=True) ) ``` -In the example below we represent `json` data as tables up until nesting level 1. Above this nesting level, we let BigQuery to create nested fields. +In the example below, we represent `json` data as tables up until nesting level 1. Above this nesting level, we let BigQuery create nested fields. :::caution -If you yield data as Python objects (dicts) and load this data as `parquet`, the nested fields will be converted into strings. This is one of the consequences of -`dlt` not being able to infer nested fields. +If you yield data as Python objects (dicts) and load this data as `parquet`, the nested fields will be converted into strings. This is one of the consequences of `dlt` not being able to infer nested fields. ::: -## Supported File Formats +## Supported file formats You can configure the following file formats to load data to BigQuery: @@ -213,12 +207,13 @@ When staging is enabled: * [parquet](../file-formats/parquet.md) is supported. :::caution -**Bigquery cannot load JSON columns from `parquet` files**. `dlt` will fail such jobs permanently. Instead: +**BigQuery cannot load JSON columns from `parquet` files**. `dlt` will fail such jobs permanently. Instead: * Switch to `jsonl` to load and parse JSON properly. * Use schema [autodetect and nested fields](#use-bigquery-schema-autodetect-for-nested-fields) ::: -## Supported Column Hints + +## Supported column hints BigQuery supports the following [column hints](https://dlthub.com/docs/general-usage/schema#tables-and-columns): @@ -236,22 +231,21 @@ BigQuery supports the following [column hints](https://dlthub.com/docs/general-u * `cluster` - creates a cluster column(s). Many columns per table are supported and only when a new table is created. ### Table and column identifiers -BigQuery uses case sensitive identifiers by default and this is what `dlt` assumes. If the dataset you use has case insensitive identifiers (you have such option -when you create it) make sure that you use case insensitive [naming convention](../../general-usage/naming-convention.md#case-sensitive-and-insensitive-destinations) or you tell `dlt` about it so identifier collisions are properly detected. + +BigQuery uses case-sensitive identifiers by default, and this is what `dlt` assumes. If the dataset you use has case-insensitive identifiers (you have such an option when you create it), make sure that you use a case-insensitive [naming convention](../../general-usage/naming-convention.md#case-sensitive-and-insensitive-destinations) or you tell `dlt` about it so identifier collisions are properly detected. ```toml [destination.bigquery] has_case_sensitive_identifiers=false ``` -You have an option to allow `dlt` to set the case sensitivity for newly created datasets. In that case it will follow the case sensitivity of current -naming convention (ie. the default **snake_case** will create dataset with case insensitive identifiers). +You have an option to allow `dlt` to set the case sensitivity for newly created datasets. In that case, it will follow the case sensitivity of the current naming convention (i.e., the default **snake_case** will create a dataset with case-insensitive identifiers). ```toml [destination.bigquery] should_set_case_sensitivity_on_new_dataset=true ``` The option above is off by default. -## Staging Support +## Staging support BigQuery supports GCS as a file staging destination. `dlt` will upload files in the parquet format to GCS and ask BigQuery to copy their data directly into the database. Please refer to the [Google Storage filesystem documentation](./filesystem.md#google-storage) to learn how to set up your GCS bucket with the bucket_url and credentials. @@ -259,7 +253,7 @@ If you use the same service account for GCS and your Redshift deployment, you do Alternatively to parquet files, you can specify jsonl as the staging file format. For this, set the `loader_file_format` argument of the `run` command of the pipeline to `jsonl`. -### BigQuery/GCS Staging Example +### BigQuery/GCS staging example ```py # Create a dlt pipeline that will load @@ -273,7 +267,7 @@ pipeline = dlt.pipeline( ) ``` -## Additional Destination Options +## Additional destination options You can configure the data location and various timeouts as shown below. This information is not a secret so it can be placed in `config.toml` as well: @@ -290,17 +284,17 @@ retry_deadline=60.0 * `file_upload_timeout` is a timeout for file upload when loading local files: the total time of the upload may not exceed this value (default: **30 minutes**, set in seconds) * `retry_deadline` is a deadline for a [DEFAULT_RETRY used by Google](https://cloud.google.com/python/docs/reference/storage/1.39.0/retry_timeout) -### dbt Support +### dbt support This destination [integrates with dbt](../transformations/dbt/dbt.md) via [dbt-bigquery](https://github.com/dbt-labs/dbt-bigquery). Credentials, if explicitly defined, are shared with `dbt` along with other settings like **location** and retries and timeouts. In the case of implicit credentials (i.e. available in a cloud function), `dlt` shares the `project_id` and delegates obtaining credentials to the `dbt` adapter. -### Syncing of `dlt` State +### Syncing of `dlt` state -This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination) +This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination). -## Bigquery Adapter +## Bigquery adapter You can use the `bigquery_adapter` to add BigQuery-specific hints to a resource. These hints influence how data is loaded into BigQuery tables, such as specifying partitioning, clustering, and numeric column rounding modes. @@ -308,7 +302,7 @@ Hints can be defined at both the column level and table level. The adapter updates the DltResource with metadata about the destination column and table DDL options. -### Use an Adapter to Apply Hints to a Resource +### Use an adapter to apply hints to a resource Here is an example of how to use the `bigquery_adapter` method to apply hints to a resource on both the column level and table level: diff --git a/docs/website/docs/dlt-ecosystem/destinations/clickhouse.md b/docs/website/docs/dlt-ecosystem/destinations/clickhouse.md index 8752c571b1..755fcef81b 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/clickhouse.md +++ b/docs/website/docs/dlt-ecosystem/destinations/clickhouse.md @@ -14,7 +14,7 @@ keywords: [ clickhouse, destination, data warehouse ] pip install "dlt[clickhouse]" ``` -## Setup Guide +## Setup guide ### 1. Initialize the dlt project @@ -26,8 +26,7 @@ dlt init chess clickhouse > 💡 This command will initialize your pipeline with chess as the source and ClickHouse as the destination. -The above command generates several files and directories, including `.dlt/secrets.toml` and a requirements file for ClickHouse. You can install the necessary dependencies specified in the -requirements file by executing it as follows: +The above command generates several files and directories, including `.dlt/secrets.toml` and a requirements file for ClickHouse. You can install the necessary dependencies specified in the requirements file by executing it as follows: ```sh pip install -r requirements.txt @@ -43,7 +42,7 @@ To load data into ClickHouse, you need to create a ClickHouse database. While we 2. To create a new database, connect to your ClickHouse server using the `clickhouse-client` command line tool or a SQL client of your choice. -3. Run the following SQL commands to create a new database, user and grant the necessary permissions: +3. Run the following SQL commands to create a new database, user, and grant the necessary permissions: ```sql CREATE DATABASE IF NOT EXISTS dlt; @@ -73,7 +72,7 @@ To load data into ClickHouse, you need to create a ClickHouse database. While we The default non-secure HTTP port for ClickHouse is `8123`. This is different from the default port `9000`, which is used for the native TCP protocol. - You must set `http_port` if you are not using external staging (i.e. you don't set the `staging` parameter in your pipeline). This is because dlt's built-in ClickHouse local storage staging uses the [clickhouse-connect](https://github.com/ClickHouse/clickhouse-connect) library, which communicates with ClickHouse over HTTP. + You must set `http_port` if you are not using external staging (i.e., you don't set the `staging` parameter in your pipeline). This is because dlt's built-in ClickHouse local storage staging uses the [clickhouse-connect](https://github.com/ClickHouse/clickhouse-connect) library, which communicates with ClickHouse over HTTP. Make sure your ClickHouse server is configured to accept HTTP connections on the port specified by `http_port`. For example: @@ -114,34 +113,24 @@ All [write dispositions](../../general-usage/incremental-loading#choosing-a-writ Data is loaded into ClickHouse using the most efficient method depending on the data source: - For local files, the `clickhouse-connect` library is used to directly load files into ClickHouse tables using the `INSERT` command. -- For files in remote storage like S3, Google Cloud Storage, or Azure Blob Storage, ClickHouse table functions like `s3`, `gcs` and `azureBlobStorage` are used to read the files and insert the data - into tables. +- For files in remote storage like S3, Google Cloud Storage, or Azure Blob Storage, ClickHouse table functions like `s3`, `gcs`, and `azureBlobStorage` are used to read the files and insert the data into tables. ## Datasets -`Clickhouse` does not support multiple datasets in one database, dlt relies on datasets to exist for multiple reasons. -To make `clickhouse` work with `dlt`, tables generated by `dlt` in your `clickhouse` database will have their name prefixed with the dataset name separated by -the configurable `dataset_table_separator`. -Additionally, a special sentinel table that doesn't contain any data will be created, so dlt knows which virtual datasets already exist in a -clickhouse -destination. +`Clickhouse` does not support multiple datasets in one database. dlt relies on datasets to exist for multiple reasons. To make `clickhouse` work with `dlt`, tables generated by `dlt` in your `clickhouse` database will have their names prefixed with the dataset name separated by the configurable `dataset_table_separator`. Additionally, a special sentinel table that doesn't contain any data will be created, so dlt knows which virtual datasets already exist in a `clickhouse` destination. ## Supported file formats - [jsonl](../file-formats/jsonl.md) is the preferred format for both direct loading and staging. - [parquet](../file-formats/parquet.md) is supported for both direct loading and staging. -The `clickhouse` destination has a few specific deviations from the default sql destinations: +The `clickhouse` destination has a few specific deviations from the default SQL destinations: -1. `Clickhouse` has an experimental `object` datatype, but we've found it to be a bit unpredictable, so the dlt clickhouse destination will load the `json` datatype to a `text` column. - If you need - this feature, get in touch with our Slack community, and we will consider adding it. +1. `Clickhouse` has an experimental `object` datatype, but we've found it to be a bit unpredictable, so the dlt clickhouse destination will load the `json` datatype to a `text` column. If you need this feature, get in touch with our Slack community, and we will consider adding it. 2. `Clickhouse` does not support the `time` datatype. Time will be loaded to a `text` column. -3. `Clickhouse` does not support the `binary` datatype. Binary will be loaded to a `text` column. When loading from `jsonl`, this will be a base64 string, when loading from parquet this will be - the `binary` object converted to `text`. +3. `Clickhouse` does not support the `binary` datatype. Binary will be loaded to a `text` column. When loading from `jsonl`, this will be a base64 string. When loading from parquet, this will be the `binary` object converted to `text`. 4. `Clickhouse` accepts adding columns to a populated table that aren’t null. -5. `Clickhouse` can produce rounding errors under certain conditions when using the float / double datatype. Make sure to use decimal if you can’t afford to have rounding errors. Loading the value - 12.7001 to a double column with the loader file format jsonl set will predictably produce a rounding error, for example. +5. `Clickhouse` can produce rounding errors under certain conditions when using the float/double datatype. Make sure to use decimal if you can’t afford to have rounding errors. Loading the value 12.7001 to a double column with the loader file format jsonl set will predictably produce a rounding error, for example. ## Supported column hints @@ -149,9 +138,9 @@ ClickHouse supports the following [column hints](../../general-usage/schema#tabl - `primary_key` - marks the column as part of the primary key. Multiple columns can have this hint to create a composite primary key. -## Choosing a Table Engine +## Choosing a table engine -dlt defaults to `MergeTree` table engine. You can specify an alternate table engine in two ways: +dlt defaults to the `MergeTree` table engine. You can specify an alternate table engine in two ways: ### Setting a default table engine in the configuration @@ -165,7 +154,7 @@ table_engine_type = "merge_tree" # The default table engi ### Setting the table engine for specific resources -You can also set the table engine for specific resources using the clickhouse_adapter, which will override the default engine set in `.dlt/secrets.toml`, for that resource: +You can also set the table engine for specific resources using the clickhouse_adapter, which will override the default engine set in `.dlt/secrets.toml` for that resource: ```py from dlt.destinations.adapters import clickhouse_adapter @@ -180,7 +169,7 @@ clickhouse_adapter(my_resource, table_engine_type="merge_tree") Supported values for `table_engine_type` are: - `merge_tree` (default) - creates tables using the `MergeTree` engine, suitable for most use cases. [Learn more about MergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree). -- `shared_merge_tree` - creates tables using the `SharedMergeTree` engine, optimized for cloud-native environments with shared storage. This table is **only** available on ClickHouse Cloud, and it the default selection if `merge_tree` is selected. [Learn more about SharedMergeTree](https://clickhouse.com/docs/en/cloud/reference/shared-merge-tree). +- `shared_merge_tree` - creates tables using the `SharedMergeTree` engine, optimized for cloud-native environments with shared storage. This table is **only** available on ClickHouse Cloud, and it is the default selection if `merge_tree` is selected. [Learn more about SharedMergeTree](https://clickhouse.com/docs/en/cloud/reference/shared-merge-tree). - `replicated_merge_tree` - creates tables using the `ReplicatedMergeTree` engine, which supports data replication across multiple nodes for high availability. [Learn more about ReplicatedMergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replication). This defaults to `shared_merge_tree` on ClickHouse Cloud. - Experimental support for the `Log` engine family with `stripe_log` and `tiny_log`. @@ -209,25 +198,19 @@ pipeline = dlt.pipeline( ) ``` -### Using Google Cloud or S3-Compatible Storage as a Staging Area +### Using Google Cloud or S3-compatible storage as a staging area -dlt supports using S3-compatible storage services, including Google Cloud Storage (GCS), as a staging area when loading data into ClickHouse. -This is handled automatically by -ClickHouse's [GCS table function](https://clickhouse.com/docs/en/sql-reference/table-functions/gcs), which dlt uses under the hood. +dlt supports using S3-compatible storage services, including Google Cloud Storage (GCS), as a staging area when loading data into ClickHouse. This is handled automatically by ClickHouse's [GCS table function](https://clickhouse.com/docs/en/sql-reference/table-functions/gcs), which dlt uses under the hood. -The ClickHouse GCS table function only supports authentication using Hash-based Message Authentication Code (HMAC) keys, which is compatible with the Amazon S3 API. -To enable this, GCS provides an S3 -compatibility mode that emulates the S3 API, allowing ClickHouse to access GCS buckets via its S3 integration. +The ClickHouse GCS table function only supports authentication using Hash-based Message Authentication Code (HMAC) keys, which is compatible with the Amazon S3 API. To enable this, GCS provides an S3 compatibility mode that emulates the S3 API, allowing ClickHouse to access GCS buckets via its S3 integration. -For detailed instructions on setting up S3-compatible storage with dlt, including AWS S3, MinIO, and Cloudflare R2, refer to -the [dlt documentation on filesystem destinations](https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#using-s3-compatible-storage). +For detailed instructions on setting up S3-compatible storage with dlt, including AWS S3, MinIO, and Cloudflare R2, refer to the [dlt documentation on filesystem destinations](https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#using-s3-compatible-storage). To set up GCS staging with HMAC authentication in dlt: 1. Create HMAC keys for your GCS service account by following the [Google Cloud guide](https://cloud.google.com/storage/docs/authentication/managing-hmackeys#create). -2. Configure the HMAC keys (`aws_access_key_id` and `aws_secret_access_key`) in your dlt project's ClickHouse destination settings in `config.toml`, similar to how you would configure AWS S3 - credentials: +2. Configure the HMAC keys (`aws_access_key_id` and `aws_secret_access_key`) in your dlt project's ClickHouse destination settings in `config.toml`, similar to how you would configure AWS S3 credentials: ```toml [destination.filesystem] @@ -241,9 +224,7 @@ endpoint_url = "https://storage.googleapis.com" ``` :::caution -When configuring the `bucket_url` for S3-compatible storage services like Google Cloud Storage (GCS) with ClickHouse in dlt, ensure that the URL is prepended with `s3://` instead of `gs://`. This is -because the ClickHouse GCS table function requires the use of HMAC credentials, which are compatible with the S3 API. Prepending with `s3://` allows the HMAC credentials to integrate properly with -dlt's staging mechanisms for ClickHouse. +When configuring the `bucket_url` for S3-compatible storage services like Google Cloud Storage (GCS) with ClickHouse in dlt, ensure that the URL is prepended with `s3://` instead of `gs://`. This is because the ClickHouse GCS table function requires the use of HMAC credentials, which are compatible with the S3 API. Prepending with `s3://` allows the HMAC credentials to integrate properly with dlt's staging mechanisms for ClickHouse. ::: ### dbt support @@ -254,4 +235,5 @@ Integration with [dbt](../transformations/dbt/dbt.md) is generally supported via This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination). - \ No newline at end of file + + diff --git a/docs/website/docs/dlt-ecosystem/destinations/databricks.md b/docs/website/docs/dlt-ecosystem/destinations/databricks.md index 12b267c9d6..79810e029f 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/databricks.md +++ b/docs/website/docs/dlt-ecosystem/destinations/databricks.md @@ -22,7 +22,7 @@ To use the Databricks destination, you need: * A Databricks workspace with a Unity Catalog metastore connected * A Gen 2 Azure storage account and container -If you already have your Databricks workspace set up, you can skip to the [Loader setup Guide](#loader-setup-guide). +If you already have your Databricks workspace set up, you can skip to the [Loader setup guide](#loader-setup-guide). ### 1. Create a Databricks workspace in Azure @@ -33,7 +33,7 @@ If you already have your Databricks workspace set up, you can skip to the [Loade 2. Create an ADLS Gen 2 storage account Search for "Storage accounts" in the Azure Portal and create a new storage account. - Make sure it's a Data Lake Storage Gen 2 account, you do this by enabling "hierarchical namespace" when creating the account. Refer to the [Azure documentation](https://learn.microsoft.com/en-us/azure/storage/blobs/create-data-lake-storage-account) for further info. + Make sure it's a Data Lake Storage Gen 2 account. You do this by enabling "hierarchical namespace" when creating the account. Refer to the [Azure documentation](https://learn.microsoft.com/en-us/azure/storage/blobs/create-data-lake-storage-account) for further info. 3. Create a container in the storage account @@ -48,47 +48,46 @@ If you already have your Databricks workspace set up, you can skip to the [Loade Navigate to the storage container you created before and select "Access control (IAM)" in the left-hand menu. - Add a new role assignment and select "Storage Blob Data Contributor" as the role. Under "Members" select "Managed Identity" and add the Databricks Access Connector you created in the previous step. + Add a new role assignment and select "Storage Blob Data Contributor" as the role. Under "Members," select "Managed Identity" and add the Databricks Access Connector you created in the previous step. ### 2. Set up a metastore and Unity Catalog and get your access token 1. Now go to your Databricks workspace - To get there from the Azure Portal, search for "Databricks", select your Databricks, and click "Launch Workspace". + To get there from the Azure Portal, search for "Databricks," select your Databricks, and click "Launch Workspace." -2. In the top right corner, click on your email address and go to "Manage Account" +2. In the top right corner, click on your email address and go to "Manage Account." -3. Go to "Data" and click on "Create Metastore" +3. Go to "Data" and click on "Create Metastore." Name your metastore and select a region. - If you'd like to set up a storage container for the whole metastore, you can add your ADLS URL and Access Connector Id here. You can also do this on a granular level when creating the catalog. + If you'd like to set up a storage container for the whole metastore, you can add your ADLS URL and Access Connector ID here. You can also do this on a granular level when creating the catalog. In the next step, assign your metastore to your workspace. -4. Go back to your workspace and click on "Catalog" in the left-hand menu +4. Go back to your workspace and click on "Catalog" in the left-hand menu. -5. Click "+ Add" and select "Add Storage Credential" +5. Click "+ Add" and select "Add Storage Credential." Create a name and paste in the resource ID of the Databricks Access Connector from the Azure portal. It will look something like this: `/subscriptions//resourceGroups//providers/Microsoft.Databricks/accessConnectors/` +6. Click "+ Add" again and select "Add external location." -6. Click "+ Add" again and select "Add external location" - - Set the URL of our storage container. This should be in the form: `abfss://@.dfs.core.windows.net/` + Set the URL of your storage container. This should be in the form: `abfss://@.dfs.core.windows.net/` Once created, you can test the connection to make sure the container is accessible from Databricks. 7. Now you can create a catalog - Go to "Catalog" and click "Create Catalog". Name your catalog and select the storage location you created in the previous step. + Go to "Catalog" and click "Create Catalog." Name your catalog and select the storage location you created in the previous step. 8. Create your access token - Click your email in the top right corner and go to "User Settings". Go to "Developer" -> "Access Tokens". + Click your email in the top right corner and go to "User Settings." Go to "Developer" -> "Access Tokens." Generate a new token and save it. You will use it in your `dlt` configuration. -## Loader setup Guide +## Loader setup guide **1. Initialize a project with a pipeline that loads to Databricks by running** ```sh @@ -99,13 +98,13 @@ dlt init chess databricks ```sh pip install -r requirements.txt ``` -This will install dlt with **databricks** extra which contains Databricks Python dbapi client. +This will install dlt with **databricks** extra which contains the Databricks Python dbapi client. **4. Enter your credentials into `.dlt/secrets.toml`.** This should have your connection parameters and your personal access token. -You will find your server hostname and HTTP path in the Databricks workspace dashboard. Go to "SQL Warehouses", select your warehouse (default is called "Starter Warehouse") and go to "Connection details". +You will find your server hostname and HTTP path in the Databricks workspace dashboard. Go to "SQL Warehouses", select your warehouse (default is called "Starter Warehouse"), and go to "Connection details". Example: @@ -120,7 +119,7 @@ catalog = "my_catalog" See [staging support](#staging-support) for authentication options when `dlt` copies files from buckets. ## Write disposition -All write dispositions are supported +All write dispositions are supported. ## Data loading Data is loaded using `INSERT VALUES` statements by default. @@ -136,9 +135,8 @@ For more information on staging, see the [staging support](#staging-support) sec The `jsonl` format has some limitations when used with Databricks: 1. Compression must be disabled to load jsonl files in Databricks. Set `data_writer.disable_compression` to `true` in dlt config when using this format. -2. The following data types are not supported when using `jsonl` format with `databricks`: `decimal`, `json`, `date`, `binary`. Use `parquet` if your data contains these types. -3. `bigint` data type with precision is not supported with `jsonl` format - +2. The following data types are not supported when using the `jsonl` format with `databricks`: `decimal`, `json`, `date`, `binary`. Use `parquet` if your data contains these types. +3. `bigint` data type with precision is not supported with the `jsonl` format. ## Staging support @@ -168,36 +166,38 @@ pipeline = dlt.pipeline( Refer to the [Azure Blob Storage filesystem documentation](./filesystem.md#azure-blob-storage) for details on connecting your Azure Blob Storage container with the bucket_url and credentials. -Databricks requires that you use ABFS urls in following format: +Databricks requires that you use ABFS URLs in the following format: **abfss://container_name@storage_account_name.dfs.core.windows.net/path** -`dlt` is able to adapt the other representation (ie **az://container-name/path**') still we recommend that you use the correct form. +`dlt` is able to adapt the other representation (i.e., **az://container-name/path**), still, we recommend that you use the correct form. Example to set up Databricks with Azure as a staging destination: ```py # Create a dlt pipeline that will load # chess player data to the Databricks destination -# via staging on Azure Blob Storage +``` + +# Via staging on Azure Blob Storage pipeline = dlt.pipeline( pipeline_name='chess_pipeline', destination='databricks', - staging=dlt.destinations.filesystem('abfss://dlt-ci-data@dltdata.dfs.core.windows.net'), # add this to activate the staging location + staging=dlt.destinations.filesystem('abfss://dlt-ci-data@dltdata.dfs.core.windows.net'), # Add this to activate the staging location dataset_name='player_data' ) ``` ### Use external locations and stored credentials -`dlt` forwards bucket credentials to `COPY INTO` SQL command by default. You may prefer to use [external locations or stored credentials instead](https://docs.databricks.com/en/sql/language-manual/sql-ref-external-locations.html#external-location) that are stored on the Databricks side. +`dlt` forwards bucket credentials to the `COPY INTO` SQL command by default. You may prefer to use [external locations or stored credentials instead](https://docs.databricks.com/en/sql/language-manual/sql-ref-external-locations.html#external-location) that are stored on the Databricks side. -If you set up external location for your staging path, you can tell `dlt` to use it: +If you set up an external location for your staging path, you can tell `dlt` to use it: ```toml [destination.databricks] is_staging_external_location=true ``` -If you set up Databricks credential named ie. **credential_x**, you can tell `dlt` to use it: +If you set up a Databricks credential named, e.g., **credential_x**, you can tell `dlt` to use it: ```toml [destination.databricks] staging_credentials_name="credential_x" @@ -211,7 +211,7 @@ bricks = dlt.destinations.databricks(staging_credentials_name="credential_x") ``` ### dbt support -This destination [integrates with dbt](../transformations/dbt/dbt.md) via [dbt-databricks](https://github.com/databricks/dbt-databricks) +This destination [integrates with dbt](../transformations/dbt/dbt.md) via [dbt-databricks](https://github.com/databricks/dbt-databricks). ### Syncing of `dlt` state This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination). diff --git a/docs/website/docs/dlt-ecosystem/destinations/destination.md b/docs/website/docs/dlt-ecosystem/destinations/destination.md index bd26aa366b..b718bf5189 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/destination.md +++ b/docs/website/docs/dlt-ecosystem/destinations/destination.md @@ -21,7 +21,7 @@ pip install dlt ## Set up a destination function for your pipeline -The custom destination decorator differs from other destinations in that you do not need to provide connection credentials, but rather you provide a function which gets called for all items loaded during a pipeline run or load operation. With the `@dlt.destination`, you can convert any function that takes two arguments into a `dlt` destination. +The custom destination decorator differs from other destinations in that you do not need to provide connection credentials, but rather you provide a function that gets called for all items loaded during a pipeline run or load operation. With the `@dlt.destination`, you can convert any function that takes two arguments into a `dlt` destination. A very simple dlt pipeline that pushes a list of items into a destination function might look like this: @@ -64,17 +64,17 @@ def my_destination(items: TDataItems, table: TTableSchema) -> None: ``` ### Decorator arguments -* The `batch_size` parameter on the destination decorator defines how many items per function call are batched together and sent as an array. If you set a batch-size of `0`, instead of passing in actual data items, you will receive one call per load job with the path of the file as the items argument. You can then open and process that file in any way you like. +* The `batch_size` parameter on the destination decorator defines how many items per function call are batched together and sent as an array. If you set a batch size of `0`, instead of passing in actual data items, you will receive one call per load job with the path of the file as the items argument. You can then open and process that file in any way you like. * The `loader_file_format` parameter on the destination decorator defines in which format files are stored in the load package before being sent to the destination function. This can be `jsonl` or `parquet`. * The `name` parameter on the destination decorator defines the name of the destination that gets created by the destination decorator. * The `naming_convention` parameter on the destination decorator defines the name of the destination that gets created by the destination decorator. This controls how table and column names are normalized. The default is `direct`, which will keep all names the same. * The `max_nesting_level` parameter on the destination decorator defines how deep the normalizer will go to normalize nested fields on your data to create subtables. This overwrites any settings on your `source` and is set to zero to not create any nested tables by default. * The `skip_dlt_columns_and_tables` parameter on the destination decorator defines whether internal tables and columns will be fed into the custom destination function. This is set to `True` by default. -* The `max_parallel_load_jobs` parameter will define how many load jobs will run in parallel in threads, if you have a destination that only allows five connections at a time, you can set this value to 5 for example -* The `loader_parallelism_strategy` parameter will control how load jobs are parallelized. Set to `parallel`, the default, jobs will be parallelized no matter which table is being loaded to. `table-sequential` will parallelize loading but only ever have one load job per table at a time, `sequential` will run all load jobs sequentially on the main thread. +* The `max_parallel_load_jobs` parameter will define how many load jobs will run in parallel in threads. If you have a destination that only allows five connections at a time, you can set this value to 5, for example. +* The `loader_parallelism_strategy` parameter will control how load jobs are parallelized. Set to `parallel`, the default, jobs will be parallelized no matter which table is being loaded to. `table-sequential` will parallelize loading but only ever have one load job per table at a time. `sequential` will run all load jobs sequentially on the main thread. :::note -Settings above make sure that shape of the data you receive in the destination function is as close as possible to what you see in the data source. +Settings above make sure that the shape of the data you receive in the destination function is as close as possible to what you see in the data source. * The custom destination sets the `max_nesting_level` to 0 by default, which means no sub-tables will be generated during the normalization phase. * The custom destination also skips all internal tables and columns by default. If you need these, set `skip_dlt_columns_and_tables` to False. @@ -85,7 +85,7 @@ Settings above make sure that shape of the data you receive in the destination f * The `table` parameter contains the schema table the current call belongs to, including all table hints and columns. For example, the table name can be accessed with `table["name"]`. * You can also add config values and secrets to the function arguments, see below! -## Add configuration, credentials and other secret to the destination function +## Add configuration, credentials, and other secrets to the destination function The destination decorator supports settings and secrets variables. If you, for example, plan to connect to a service that requires an API secret or a login, you can do the following: ```py @@ -94,18 +94,18 @@ def my_destination(items: TDataItems, table: TTableSchema, api_key: dlt.secrets. ... ``` -You can then set a config variable in your `.dlt/secrets.toml`: like so: +You can then set a config variable in your `.dlt/secrets.toml` like so: ```toml [destination.my_destination] api_key="" ``` -Custom destinations follow the same configuration rules as [regular named destinations](../../general-usage/destination.md#configure-a-destination) +Custom destinations follow the same configuration rules as [regular named destinations](../../general-usage/destination.md#configure-a-destination). ## Use the custom destination in `dlt` pipeline -There are multiple ways to pass the custom destination function to `dlt` pipeline: +There are multiple ways to pass the custom destination function to the `dlt` pipeline: - Directly reference the destination function ```py @@ -118,7 +118,7 @@ There are multiple ways to pass the custom destination function to `dlt` pipelin ``` Like for [regular destinations](../../general-usage/destination.md#pass-explicit-credentials), you are allowed to pass configuration and credentials - explicitly to destination function. + explicitly to the destination function. ```py @dlt.destination(batch_size=10, loader_file_format="jsonl", name="my_destination") def my_destination(items: TDataItems, table: TTableSchema, api_key: dlt.secrets.value) -> None: @@ -162,17 +162,17 @@ There are multiple ways to pass the custom destination function to `dlt` pipelin ## Adjust batch size and retry policy for atomic loads The destination keeps a local record of how many `DataItems` were processed, so if you, for example, use the custom destination to push `DataItems` to a remote API, and this -API becomes unavailable during the load resulting in a failed `dlt` pipeline run, you can repeat the run of your pipeline at a later moment and the custom destination will **restart from the whole batch that failed**. We are preventing any data from being lost, but you can still get duplicated data if you committed half of the batch ie. to a database and then failed. -**Keeping the batch atomicity is on you**. For this reason it makes sense to choose a batch size that you can process in one transaction (say one api request or one database transaction) so that if this request or transaction fail repeatedly you can repeat it at the next run without pushing duplicate data to your remote location. For systems that -are not transactional and do not tolerate duplicated data, you can use batch of size 1. +API becomes unavailable during the load resulting in a failed `dlt` pipeline run, you can repeat the run of your pipeline at a later moment and the custom destination will **restart from the whole batch that failed**. We are preventing any data from being lost, but you can still get duplicated data if you committed half of the batch, i.e., to a database and then failed. +**Keeping the batch atomicity is on you**. For this reason, it makes sense to choose a batch size that you can process in one transaction (say one API request or one database transaction) so that if this request or transaction fails repeatedly, you can repeat it at the next run without pushing duplicate data to your remote location. For systems that +are not transactional and do not tolerate duplicated data, you can use a batch of size 1. Destination functions that raise exceptions are retried 5 times before giving up (`load.raise_on_max_retries` config option). If you run the pipeline again, it will resume loading before extracting new data. If your exception derives from `DestinationTerminalException`, the whole load job will be marked as failed and not retried again. :::caution -If you wipe out the pipeline folder (where job files and destination state are saved) you will not be able to restart from the last failed batch. -However, it is fairly easy to backup and restore the pipeline directory, [see details below](#manage-pipeline-state-for-incremental-loading). +If you wipe out the pipeline folder (where job files and destination state are saved), you will not be able to restart from the last failed batch. +However, it is fairly easy to back up and restore the pipeline directory, [see details below](#manage-pipeline-state-for-incremental-loading). ::: ## Increase or decrease loading parallelism @@ -184,21 +184,22 @@ For performance reasons, we recommend keeping the multithreaded approach and mak ## Write disposition -`@dlt.destination` will forward all normalized `DataItems` encountered during a pipeline run to the custom destination function, so there is no notion of "write dispositions". +`@dlt.destination` will forward all normalized `DataItems` encountered during a pipeline run to the custom destination function, so there is no notion of "write dispositions." ## Staging support `@dlt.destination` does not support staging files in remote locations before being called at this time. If you need this feature, please let us know. ## Manage pipeline state for incremental loading -Custom destinations do not have a general mechanism to restore pipeline state. This will impact data sources that rely on the state being kept ie. all incremental resources. -If you wipe the pipeline directory (ie. by deleting a folder or running on AWS lambda / Github Actions where you get a clean runner) the progress of the incremental loading is lost. On the next run you will re-acquire the data from the beginning. +Custom destinations do not have a general mechanism to restore pipeline state. This will impact data sources that rely on the state being kept, i.e., all incremental resources. +If you wipe the pipeline directory (i.e., by deleting a folder or running on AWS Lambda / GitHub Actions where you get a clean runner), the progress of the incremental loading is lost. On the next run, you will re-acquire the data from the beginning. -While we are working on a pluggable state storage you can fix the problem above by: -1. Not wiping the pipeline directory. For example if you run your pipeline on an EC instance periodically, the state will be preserved. -2. By doing a restore/backup of the pipeline directory before/after it runs. This is way easier than it sounds and [here's a script you can reuse](https://gist.github.com/rudolfix/ee6e16d8671f26ac4b9ffc915ad24b6e). +While we are working on a pluggable state storage, you can fix the problem above by: +1. Not wiping the pipeline directory. For example, if you run your pipeline on an EC instance periodically, the state will be preserved. +2. Doing a restore/backup of the pipeline directory before/after it runs. This is way easier than it sounds and [here's a script you can reuse](https://gist.github.com/rudolfix/ee6e16d8671f26ac4b9ffc915ad24b6e). ## What's next * Check out our [Custom BigQuery Destination](../../examples/custom_destination_bigquery/) example. * Need help with building a custom destination? Ask your questions in our [Slack Community](https://dlthub.com/community) technical help channel. + diff --git a/docs/website/docs/dlt-ecosystem/destinations/dremio.md b/docs/website/docs/dlt-ecosystem/destinations/dremio.md index c087d5dc0a..253fa1fc3f 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/dremio.md +++ b/docs/website/docs/dlt-ecosystem/destinations/dremio.md @@ -12,19 +12,19 @@ keywords: [dremio, iceberg, aws, glue catalog] pip install "dlt[dremio,s3]" ``` -## Setup Guide +## Setup guide ### 1. Initialize the dlt project Let's start by initializing a new dlt project as follows: ```sh dlt init chess dremio ``` - > 💡 This command will initialise your pipeline with chess as the source and aws dremio as the destination using the filesystem staging destination + > 💡 This command will initialize your pipeline with chess as the source and aws dremio as the destination using the filesystem staging destination. -### 2. Setup bucket storage and dremio credentials +### 2. Setup bucket storage and Dremio credentials -First install dependencies by running: +First, install dependencies by running: ```sh pip install -r requirements.txt ``` @@ -36,7 +36,7 @@ The toml file looks like this: ```toml [destination.filesystem] -bucket_url = "s3://[your_bucket_name]" # replace with your bucket name, +bucket_url = "s3://[your_bucket_name]" # replace with your bucket name [destination.filesystem.credentials] aws_access_key_id = "please set me up!" # copy the access key here @@ -46,22 +46,22 @@ aws_secret_access_key = "please set me up!" # copy the secret access key here staging_data_source = "" # the name of the "Object Storage" data source in Dremio containing the s3 bucket [destination.dremio.credentials] -username = "" # the dremio username -password = "" # dremio password or PAT token +username = "" # the Dremio username +password = "" # Dremio password or PAT token database = "" # the name of the "data source" set up in Dremio where you want to load your data host = "localhost" # the Dremio hostname port = 32010 # the Dremio Arrow Flight grpc port drivername="grpc" # either 'grpc' or 'grpc+tls' ``` -You can also pass SqlAlchemy-like connection like below +You can also pass SqlAlchemy-like connection like below: ```toml [destination.dremio] staging_data_source="s3_staging" credentials="grpc://:@:/" ``` -if you have your credentials stored in `~/.aws/credentials` just remove the **[destination.filesystem.credentials]** and **[destination.dremio.credentials]** section above and `dlt` will fall back to your **default** profile in local credentials. If you want to switch the profile, pass the profile name as follows (here: `dlt-ci-user`): +If you have your credentials stored in `~/.aws/credentials`, just remove the **[destination.filesystem.credentials]** and **[destination.dremio.credentials]** sections above and `dlt` will fall back to your **default** profile in local credentials. If you want to switch the profile, pass the profile name as follows (here: `dlt-ci-user`): ```toml [destination.filesystem.credentials] profile_name="dlt-ci-user" @@ -69,20 +69,20 @@ profile_name="dlt-ci-user" ## Write disposition -`dremio` destination handles the write dispositions as follows: +The `dremio` destination handles the write dispositions as follows: - `append` - `replace` - `merge` -> The `merge` write disposition uses the default DELETE/UPDATE/INSERT strategy to merge data into the destination. Be aware that Dremio does not support transactions so a partial pipeline failure can result in the destination table being in an inconsistent state. The `merge` write disposition will eventually be implemented using [MERGE INTO](https://docs.dremio.com/current/reference/sql/commands/apache-iceberg-tables/apache-iceberg-merge/) to resolve this issue. +> The `merge` write disposition uses the default DELETE/UPDATE/INSERT strategy to merge data into the destination. Be aware that Dremio does not support transactions, so a partial pipeline failure can result in the destination table being in an inconsistent state. The `merge` write disposition will eventually be implemented using [MERGE INTO](https://docs.dremio.com/current/reference/sql/commands/apache-iceberg-tables/apache-iceberg-merge/) to resolve this issue. ## Data loading -Data loading happens by copying a staged parquet files from an object storage bucket to the destination table in Dremio using [COPY INTO](https://docs.dremio.com/cloud/reference/sql/commands/copy-into-table/) statements. The destination table format is specified by the storage format for the data source in Dremio. Typically, this will be Apache Iceberg. +Data loading happens by copying staged parquet files from an object storage bucket to the destination table in Dremio using [COPY INTO](https://docs.dremio.com/cloud/reference/sql/commands/copy-into-table/) statements. The destination table format is specified by the storage format for the data source in Dremio. Typically, this will be Apache Iceberg. > ❗ **Dremio cannot load `fixed_len_byte_array` columns from `parquet` files**. -## Dataset Creation +## Dataset creation Dremio does not support `CREATE SCHEMA` DDL statements. @@ -92,9 +92,9 @@ Therefore, "Metastore" data sources, such as Hive or Glue, require that the data ## Staging support -Using a staging destination is mandatory when using the dremio destination. If you do not set staging to `filesystem`, dlt will automatically do this for you. +Using a staging destination is mandatory when using the Dremio destination. If you do not set staging to `filesystem`, dlt will automatically do this for you. -## Table Partitioning and Local Sort +## Table partitioning and local sort Apache Iceberg table partitions and local sort properties can be configured as shown below: ```py import dlt @@ -118,4 +118,5 @@ This will result in `PARTITION BY ("foo","bar")` and `LOCALSORT BY ("baz")` clau ### Syncing of `dlt` state - This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination). - \ No newline at end of file + + diff --git a/docs/website/docs/dlt-ecosystem/destinations/duckdb.md b/docs/website/docs/dlt-ecosystem/destinations/duckdb.md index 4b8ecec4ca..7bb38f6087 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/duckdb.md +++ b/docs/website/docs/dlt-ecosystem/destinations/duckdb.md @@ -12,7 +12,7 @@ keywords: [duckdb, destination, data warehouse] pip install "dlt[duckdb]" ``` -## Setup Guide +## Setup guide **1. Initialize a project with a pipeline that loads to DuckDB by running:** ```sh @@ -38,7 +38,7 @@ All write dispositions are supported. ### Data types `duckdb` supports various [timestamp types](https://duckdb.org/docs/sql/data_types/timestamp.html). These can be configured using the column flags `timezone` and `precision` in the `dlt.resource` decorator or the `pipeline.run` method. -- **Precision**: supported precision values are 0, 3, 6, and 9 for fractional seconds. Note that `timezone` and `precision` cannot be used together; attempting to combine them will result in an error. +- **Precision**: Supported precision values are 0, 3, 6, and 9 for fractional seconds. Note that `timezone` and `precision` cannot be used together; attempting to combine them will result in an error. - **Timezone**: - Setting `timezone=False` maps to `TIMESTAMP`. - Setting `timezone=True` (or omitting the flag, which defaults to `True`) maps to `TIMESTAMP WITH TIME ZONE` (`TIMESTAMPTZ`). @@ -73,7 +73,7 @@ pipeline.run(events()) ### Names normalization `dlt` uses the standard **snake_case** naming convention to keep identical table and column identifiers across all destinations. If you want to use the **duckdb** wide range of characters (i.e., emojis) for table and column names, you can switch to the **duck_case** naming convention, which accepts almost any string as an identifier: -* `\n` `\r` and `"` are translated to `_` +* `\n` `\r` and `"` are translated to `_` * multiple `_` are translated to a single `_` Switch the naming convention using `config.toml`: @@ -102,9 +102,7 @@ You can configure the following file formats to load data to duckdb: * [jsonl](../file-formats/jsonl.md) :::tip -`duckdb` has [timestamp types](https://duckdb.org/docs/sql/data_types/timestamp.html) with resolutions from milliseconds to nanoseconds. However -only microseconds resolution (the most common used) is time zone aware. `dlt` generates timestamps with timezones by default so loading parquet files -with default settings will fail (`duckdb` does not coerce tz-aware timestamps to naive timestamps). +`duckdb` has [timestamp types](https://duckdb.org/docs/sql/data_types/timestamp.html) with resolutions from milliseconds to nanoseconds. However, only microseconds resolution (the most commonly used) is time zone aware. `dlt` generates timestamps with timezones by default, so loading parquet files with default settings will fail (`duckdb` does not coerce tz-aware timestamps to naive timestamps). Disable the timezones by changing `dlt` [parquet writer settings](../file-formats/parquet.md#writer-settings) as follows: ```sh DATA_WRITER__TIMESTAMP_TIMEZONE="" @@ -116,7 +114,7 @@ to disable tz adjustments. `duckdb` can create unique indexes for columns with `unique` hints. However, **this feature is disabled by default** as it can significantly slow down data loading. -## Destination Configuration +## Destination configuration By default, a DuckDB database will be created in the current working directory with a name `.duckdb` (`chess.duckdb` in the example above). After loading, it is available in `read/write` mode via `with pipeline.sql_client() as con:`, which is a wrapper over `DuckDBPyConnection`. See [duckdb docs](https://duckdb.org/docs/api/python/overview#persistent-storage) for details. @@ -152,7 +150,7 @@ p = dlt.pipeline( dev_mode=False, ) -# Or if you would like to use in-memory duckdb instance +# Or if you would like to use an in-memory duckdb instance db = duckdb.connect(":memory:") p = pipeline_one = dlt.pipeline( pipeline_name="in_memory_pipeline", @@ -171,33 +169,35 @@ print(db.sql("DESCRIBE;")) # │ memory │ chess_data │ _dlt_pipeline_state │ [version, engine_v… │ [BIGINT, BIGINT, VA… │ false │ # │ memory │ chess_data │ _dlt_version │ [version, engine_v… │ [BIGINT, BIGINT, TI… │ false │ # │ memory │ chess_data │ my_table │ [a, _dlt_load_id, … │ [BIGINT, VARCHAR, V… │ false │ +``` + # └──────────┴───────────────┴─────────────────────┴──────────────────────┴───────────────────────┴───────────┘ ``` :::note -Be careful! The in-memory instance of the database will be destroyed, once your Python script exits. +Be careful! The in-memory instance of the database will be destroyed once your Python script exits. ::: This destination accepts database connection strings in the format used by [duckdb-engine](https://github.com/Mause/duckdb_engine#configuration). -You can configure a DuckDB destination with [secret / config values](../../general-usage/credentials) (e.g., using a `secrets.toml` file) +You can configure a DuckDB destination with [secret / config values](../../general-usage/credentials) (e.g., using a `secrets.toml` file): ```toml destination.duckdb.credentials="duckdb:///_storage/test_quack.duckdb" ``` The **duckdb://** URL above creates a **relative** path to `_storage/test_quack.duckdb`. To define an **absolute** path, you need to specify four slashes, i.e., `duckdb:////_storage/test_quack.duckdb`. -Dlt supports a unique connection string that triggers specific behavior for duckdb destination: +Dlt supports a unique connection string that triggers specific behavior for the duckdb destination: * **:pipeline:** creates the database in the working directory of the pipeline, naming it `quack.duckdb`. -Please see the code snippets below showing how to use it +Please see the code snippets below showing how to use it: -1. Via `config.toml` +1. Via `config.toml`: ```toml destination.duckdb.credentials=":pipeline:" ``` -2. In Python code +2. In Python code: ```py p = pipeline_one = dlt.pipeline( pipeline_name="my_pipeline", @@ -219,3 +219,5 @@ This destination [integrates with dbt](../transformations/dbt/dbt.md) via [dbt-d This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination). +``` + diff --git a/docs/website/docs/dlt-ecosystem/destinations/filesystem.md b/docs/website/docs/dlt-ecosystem/destinations/filesystem.md index cfeb03655c..9d31470b5b 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/filesystem.md +++ b/docs/website/docs/dlt-ecosystem/destinations/filesystem.md @@ -1,5 +1,6 @@ # Cloud storage and filesystem -The filesystem destination stores data in remote file systems and cloud storage services like **AWS S3**, **Google Cloud Storage**, or **Azure Blob Storage**. Underneath, it uses [fsspec](https://github.com/fsspec/filesystem_spec) to abstract file operations. Its primary role is to be used as a staging for other destinations, but you can also quickly build a data lake with it. + +The filesystem destination stores data in remote file systems and cloud storage services like **AWS S3**, **Google Cloud Storage**, or **Azure Blob Storage**. Underneath, it uses [fsspec](https://github.com/fsspec/filesystem_spec) to abstract file operations. Its primary role is to be used as a staging area for other destinations, but you can also quickly build a data lake with it. :::tip Please read the notes on the layout of the data files. Currently, we are getting feedback on it. Please join our Slack (icon at the top of the page) and help us find the optimal layout. @@ -33,13 +34,13 @@ dlt init chess filesystem ``` :::note -This command will initialize your pipeline with chess as the source and the AWS S3 as the destination. +This command will initialize your pipeline with chess as the source and AWS S3 as the destination. ::: ## Set up the destination and credentials ### AWS S3 -The command above creates a sample `secrets.toml` and requirements file for AWS S3 bucket. You can install those dependencies by running: +The command above creates a sample `secrets.toml` and requirements file for the AWS S3 bucket. You can install those dependencies by running: ```sh pip install -r requirements.txt ``` @@ -72,14 +73,14 @@ region_name="eu-central-1" You need to create an S3 bucket and a user who can access that bucket. dlt does not create buckets automatically. 1. You can create the S3 bucket in the AWS console by clicking on "Create Bucket" in S3 and assigning the appropriate name and permissions to the bucket. -2. Once the bucket is created, you'll have the bucket URL. For example, If the bucket name is `dlt-ci-test-bucket`, then the bucket URL will be: +2. Once the bucket is created, you'll have the bucket URL. For example, if the bucket name is `dlt-ci-test-bucket`, then the bucket URL will be: ```text s3://dlt-ci-test-bucket ``` -3. To grant permissions to the user being used to access the S3 bucket, go to the IAM > Users, and click on “Add Permissions”. -4. Below you can find a sample policy that gives a minimum permission required by dlt to a bucket we created above. The policy contains permissions to list files in a bucket, get, put, and delete objects. **Remember to place your bucket name in the Resource section of the policy!** +3. To grant permissions to the user being used to access the S3 bucket, go to IAM > Users, and click on “Add Permissions”. +4. Below you can find a sample policy that gives the minimum permission required by dlt to a bucket we created above. The policy contains permissions to list files in a bucket, get, put, and delete objects. **Remember to place your bucket name in the Resource section of the policy!** ```json { @@ -103,16 +104,16 @@ You need to create an S3 bucket and a user who can access that bucket. dlt does ] } ``` -5. To grab the access and secret key for the user. Go to IAM > Users and in the “Security Credentials”, click on “Create Access Key”, and preferably select “Command Line Interface” and create the access key. +5. To grab the access and secret key for the user, go to IAM > Users and in the “Security Credentials”, click on “Create Access Key”, and preferably select “Command Line Interface” and create the access key. 6. Grab the “Access Key” and “Secret Access Key” created that are to be used in "secrets.toml". #### Using S3 compatible storage -To use an S3 compatible storage other than AWS S3 like [MinIO](https://min.io/) or [Cloudflare R2](https://www.cloudflare.com/en-ca/developer-platform/r2/), you may supply an `endpoint_url` in the config. This should be set along with AWS credentials: +To use an S3 compatible storage other than AWS S3, like [MinIO](https://min.io/) or [Cloudflare R2](https://www.cloudflare.com/en-ca/developer-platform/r2/), you may supply an `endpoint_url` in the config. This should be set along with AWS credentials: ```toml [destination.filesystem] -bucket_url = "s3://[your_bucket_name]" # replace with your bucket name, +bucket_url = "s3://[your_bucket_name]" # replace with your bucket name [destination.filesystem.credentials] aws_access_key_id = "please set me up!" # copy the access key here @@ -120,7 +121,7 @@ aws_secret_access_key = "please set me up!" # copy the secret access key here endpoint_url = "https://.r2.cloudflarestorage.com" # copy your endpoint URL here ``` -#### Adding Additional Configuration +#### Adding additional configuration To pass any additional arguments to `fsspec`, you may supply `kwargs` and `client_kwargs` in the config as a **stringified dictionary**: @@ -131,25 +132,28 @@ client_kwargs = '{"verify": "public.crt"}' ``` ### Google Storage + Run `pip install "dlt[gs]"` which will install the `gcfs` package. To edit the `dlt` credentials file with your secret info, open `.dlt/secrets.toml`. You'll see AWS credentials by default. -Use Google cloud credentials that you may know from [BigQuery destination](bigquery.md) +Use Google Cloud credentials that you may know from [BigQuery destination](bigquery.md): + ```toml [destination.filesystem] -bucket_url = "gs://[your_bucket_name]" # replace with your bucket name, +bucket_url = "gs://[your_bucket_name]" # replace with your bucket name [destination.filesystem.credentials] project_id = "project_id" # please set me up! private_key = "private_key" # please set me up! client_email = "client_email" # please set me up! ``` + :::note -Note that you can share the same credentials with BigQuery, replace the `[destination.filesystem.credentials]` section with a less specific one: `[destination.credentials]` which applies to both destinations +Note that you can share the same credentials with BigQuery. Replace the `[destination.filesystem.credentials]` section with a less specific one: `[destination.credentials]` which applies to both destinations. ::: -if you have default google cloud credentials in your environment (i.e. on cloud function) remove the credentials sections above and `dlt` will fall back to the available default. +If you have default Google Cloud credentials in your environment (i.e., on Cloud Function), remove the credentials sections above and `dlt` will fall back to the available default. Use **Cloud Storage** admin to create a new bucket. Then assign the **Storage Object Admin** role to your service account. @@ -157,17 +161,17 @@ Use **Cloud Storage** admin to create a new bucket. Then assign the **Storage Ob Run `pip install "dlt[az]"` which will install the `adlfs` package to interface with Azure Blob Storage. -Edit the credentials in `.dlt/secrets.toml`, you'll see AWS credentials by default replace them with your Azure credentials. +Edit the credentials in `.dlt/secrets.toml`. You'll see AWS credentials by default; replace them with your Azure credentials. Two forms of Azure credentials are supported: #### SAS token credentials -Supply storage account name and either sas token or storage account key +Supply the storage account name and either the SAS token or storage account key: ```toml [destination.filesystem] -bucket_url = "az://[your_container name]" # replace with your container name +bucket_url = "az://[your_container_name]" # replace with your container name [destination.filesystem.credentials] # The storage account name is always required @@ -177,13 +181,13 @@ azure_storage_account_key = "account_key" # please set me up! azure_storage_sas_token = "sas_token" # please set me up! ``` -If you have the correct Azure credentials set up on your machine (e.g. via azure cli), +If you have the correct Azure credentials set up on your machine (e.g., via Azure CLI), you can omit both `azure_storage_account_key` and `azure_storage_sas_token` and `dlt` will fall back to the available default. Note that `azure_storage_account_name` is still required as it can't be inferred from the environment. #### Service principal credentials -Supply a client ID, client secret and a tenant ID for a service principal authorized to access your container +Supply a client ID, client secret, and a tenant ID for a service principal authorized to access your container. ```toml [destination.filesystem] @@ -197,7 +201,7 @@ azure_tenant_id = "tenant_id" # please set me up! :::caution **Concurrent blob uploads** -`dlt` limits the number of concurrent connections for a single uploaded blob to 1. By default `adlfs` that we use, splits blobs into 4 MB chunks and uploads them concurrently which leads to gigabytes of used memory and thousands of connections for a larger load packages. You can increase the maximum concurrency as follows: +`dlt` limits the number of concurrent connections for a single uploaded blob to 1. By default, `adlfs` that we use splits blobs into 4 MB chunks and uploads them concurrently, which leads to gigabytes of used memory and thousands of connections for larger load packages. You can increase the maximum concurrency as follows: ```toml [destination.filesystem.kwargs] max_concurrency=3 @@ -206,7 +210,7 @@ max_concurrency=3 ### Local file system -If for any reason you want to have those files in a local folder, set up the `bucket_url` as follows (you are free to use `config.toml` for that as there are no secrets required) +If for any reason you want to have those files in a local folder, set up the `bucket_url` as follows (you are free to use `config.toml` for that as there are no secrets required). ```toml [destination.filesystem] @@ -221,20 +225,20 @@ For handling deeply nested layouts, consider enabling automatic directory creati kwargs = '{"auto_mkdir": true}' ``` -Or by setting environment variable: +Or by setting the environment variable: ```sh export DESTINATION__FILESYSTEM__KWARGS = '{"auto_mkdir": true/false}' ``` ::: -`dlt` correctly handles the native local file paths. Indeed, using the `file://` schema may be not intuitive especially for Windows users. +`dlt` correctly handles the native local file paths. Indeed, using the `file://` schema may not be intuitive, especially for Windows users. ```toml [destination.unc_destination] bucket_url = 'C:\a\b\c' ``` -In the example above we specify `bucket_url` using **toml's literal strings** that do not require [escaping of backslashes](https://github.com/toml-lang/toml/blob/main/toml.md#string). +In the example above, we specify `bucket_url` using **toml's literal strings** that do not require [escaping of backslashes](https://github.com/toml-lang/toml/blob/main/toml.md#string). ```toml [destination.unc_destination] @@ -247,14 +251,12 @@ bucket_url = '/var/local/data' # absolute POSIX style path bucket_url = '_storage/data' # relative POSIX style path ``` -In the examples above we define a few named filesystem destinations: -* **unc_destination** demonstrates Windows UNC path in native form -* **posix_destination** demonstrates native POSIX (Linux/Mac) absolute path -* **relative_destination** demonstrates native POSIX (Linux/Mac) relative path. In this case `filesystem` destination will store files in `$cwd/_storage/data` path -where **$cwd** is your current working directory. +In the examples above, we define a few named filesystem destinations: +* **unc_destination** demonstrates Windows UNC path in native form. +* **posix_destination** demonstrates native POSIX (Linux/Mac) absolute path. +* **relative_destination** demonstrates native POSIX (Linux/Mac) relative path. In this case, the `filesystem` destination will store files in the `$cwd/_storage/data` path where **$cwd** is your current working directory. -`dlt` supports Windows [UNC paths with file:// scheme](https://en.wikipedia.org/wiki/File_URI_scheme). They can be specified using **host** or purely as **path** -component. +`dlt` supports Windows [UNC paths with file:// scheme](https://en.wikipedia.org/wiki/File_URI_scheme). They can be specified using **host** or purely as **path** component. ```toml [destination.unc_with_host] @@ -265,9 +267,9 @@ bucket_url="file:////localhost/c$/a/b/c" ``` :::caution -Windows supports paths up to 255 characters. When you access a path longer than 255 characters you'll see `FileNotFound` exception. +Windows supports paths up to 255 characters. When you access a path longer than 255 characters, you'll see a `FileNotFound` exception. - To go over this limit you can use [extended paths](https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry). `dlt` recognizes both regular and UNC extended paths +To go over this limit, you can use [extended paths](https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry). `dlt` recognizes both regular and UNC extended paths. ```toml [destination.regular_extended] @@ -278,8 +280,10 @@ bucket_url='\\?\UNC\localhost\c$\a\b\c' ``` ::: + + ### SFTP -Run `pip install "dlt[sftp]` which will install the `paramiko` package alongside `dlt`, enabling secure SFTP transfers. +Run `pip install "dlt[sftp]"` which will install the `paramiko` package alongside `dlt`, enabling secure SFTP transfers. Configure your SFTP credentials by editing the `.dlt/secrets.toml` file. By default, the file contains placeholders for AWS credentials. You should replace these with your SFTP credentials. @@ -306,20 +310,19 @@ sftp_gss_trust_dns # Trust DNS for GSS-API, defaults to True ``` > For more information about credentials parameters: https://docs.paramiko.org/en/3.3/api/client.html#paramiko.client.SSHClient.connect -### Authentication Methods +### Authentication methods SFTP authentication is attempted in the following order of priority: 1. **Key-based authentication**: If you provide a `key_filename` containing the path to a private key or a corresponding OpenSSH public certificate (e.g., `id_rsa` and `id_rsa-cert.pub`), these will be used for authentication. If the private key requires a passphrase, you can specify it via `sftp_key_passphrase`. If your private key requires a passphrase to unlock, and you’ve provided one, it will be used to attempt to unlock the key. -2. **SSH Agent-based authentication**: If `allow_agent=True` (default), Paramiko will look for any SSH keys stored in your local SSH agent (such as `id_rsa`, `id_dsa`, or `id_ecdsa` keys stored in `~/.ssh/`). +2. **SSH agent-based authentication**: If `allow_agent=True` (default), Paramiko will look for any SSH keys stored in your local SSH agent (such as `id_rsa`, `id_dsa`, or `id_ecdsa` keys stored in `~/.ssh/`). 3. **Username/Password authentication**: If a password is provided (`sftp_password`), plain username/password authentication will be attempted. 4. **GSS-API authentication**: If GSS-API (Kerberos) is enabled (sftp_gss_auth=True), authentication will use the Kerberos protocol. GSS-API may also be used for key exchange (sftp_gss_kex=True) and credential delegation (sftp_gss_deleg_creds=True). This method is useful in environments where Kerberos is set up, often in enterprise networks. - -#### 1. **Key-based Authentication** +#### 1. **Key-based authentication** If you use an SSH key instead of a password, you can specify the path to your private key in the configuration. @@ -334,7 +337,7 @@ sftp_key_filename = "/path/to/id_rsa" # Replace with the path to your privat sftp_key_passphrase = "your_passphrase" # Optional: passphrase for your private key ``` -#### 2. **SSH Agent-based Authentication** +#### 2. **SSH agent-based authentication** If you have an SSH agent running with loaded keys, you can allow Paramiko to use these keys automatically. You can omit the password and key fields if you're relying on the SSH agent. @@ -349,7 +352,7 @@ sftp_key_passphrase = "your_passphrase" # Optional: passphrase for your privat ``` The loaded key must be one of the following types stored in ~/.ssh/: id_rsa, id_dsa, or id_ecdsa. -#### 3. **Username/Password Authentication** +#### 3. **Username/password authentication** This is the simplest form of authentication, where you supply a username and password directly. @@ -363,9 +366,8 @@ sftp_username = "foo" # Replace "foo" with your SFTP username sftp_password = "pass" # Replace "pass" with your SFTP password ``` - ### Notes: -- **Key-based Authentication**: Make sure your private key has the correct permissions (`chmod 600`), or SSH will refuse to use it. +- **Key-based authentication**: Make sure your private key has the correct permissions (`chmod 600`), or SSH will refuse to use it. - **Timeouts**: It's important to adjust timeout values based on your network conditions to avoid connection issues. This configuration allows flexible SFTP authentication, whether you're using passwords, keys, or agents, and ensures secure communication between your local environment and the SFTP server. @@ -396,10 +398,10 @@ def my_upsert_resource(): #### Known limitations - `hard_delete` hint not supported -- deleting records from nested tables not supported - - This means updates to json columns that involve element removals are not propagated. For example, if you first load `{"key": 1, "nested": [1, 2]}` and then load `{"key": 1, "nested": [1]}`, then the record for element `2` will not be deleted from the nested table. +- Deleting records from nested tables not supported + - This means updates to JSON columns that involve element removals are not propagated. For example, if you first load `{"key": 1, "nested": [1, 2]}` and then load `{"key": 1, "nested": [1]}`, then the record for element `2` will not be deleted from the nested table. -## File Compression +## File compression The filesystem destination in the dlt library uses `gzip` compression by default for efficiency, which may result in the files being stored in a compressed format. This format may not be easily readable as plain text or JSON Lines (`jsonl`) files. If you encounter files that seem unreadable, they may be compressed. @@ -420,10 +422,10 @@ For more details on managing file compression, please visit our documentation on All the files are stored in a single folder with the name of the dataset that you passed to the `run` or `load` methods of the `pipeline`. In our example chess pipeline, it is **chess_players_games_data**. :::note -Object storages are, in fact, key-blob storage so the folder structure is emulated by splitting file names into components by separator (`/`). +Object storages are, in fact, key-blob storage, so the folder structure is emulated by splitting file names into components by a separator (`/`). ::: -You can control files layout by specifying the desired configuration. There are several ways to do this. +You can control the files layout by specifying the desired configuration. There are several ways to do this. ### Default layout @@ -439,9 +441,9 @@ The default layout format has changed from `{schema_name}.{table_name}.{load_id} * `schema_name` - the name of the [schema](../../general-usage/schema.md) * `table_name` - table name -* `load_id` - the id of the [load package](../../general-usage/destination-tables.md#load-packages-and-load-ids) from which the file comes from -* `file_id` - the id of the file, is there are many files with data for a single table, they are copied with different file ids -* `ext` - a format of the file i.e. `jsonl` or `parquet` +* `load_id` - the id of the [load package](../../general-usage/destination-tables.md#load-packages-and-load-ids) from which the file comes +* `file_id` - the id of the file; if there are many files with data for a single table, they are copied with different file ids +* `ext` - the format of the file, i.e., `jsonl` or `parquet` #### Date and time placeholders :::tip @@ -454,7 +456,7 @@ Keep in mind all values are lowercased. * `load_package_timestamp_ms` - timestamp from [load package](../../general-usage/destination-tables.md#load-packages-and-load-ids) in Unix Timestamp format in milliseconds :::note -Both `timestamp_ms` and `load_package_timestamp_ms` are in milliseconds (e.g., 12334455233), not fractional seconds to make sure millisecond precision without decimals. +Both `timestamp_ms` and `load_package_timestamp_ms` are in milliseconds (e.g., 12334455233), not fractional seconds to ensure millisecond precision without decimals. ::: * Years @@ -506,20 +508,22 @@ layout="{table_name}/{load_id}.{file_id}.{ext}" # current preconfigured naming s # Custom placeholders # extra_placeholders = { "owner" = "admin", "department" = "finance" } +``` + # layout = "{table_name}/{owner}/{department}/{load_id}.{file_id}.{ext}" ``` A few things to know when specifying your filename layout: - If you want a different base path that is common to all filenames, you can suffix your `bucket_url` rather than prefix your `layout` setting. - If you do not provide the `{ext}` placeholder, it will automatically be added to your layout at the end with a dot as a separator. -- It is the best practice to have a separator between each placeholder. Separators can be any character allowed as a filename character, but dots, dashes, and forward slashes are most common. -- When you are using the `replace` disposition, `dlt` will have to be able to figure out the correct files to delete before loading the new data. For this to work, you have to - - include the `{table_name}` placeholder in your layout - - not have any other placeholders except for the `{schema_name}` placeholder before the table_name placeholder and - - have a separator after the table_name placeholder +- It is best practice to have a separator between each placeholder. Separators can be any character allowed as a filename character, but dots, dashes, and forward slashes are most common. +- When you are using the `replace` disposition, `dlt` will have to be able to figure out the correct files to delete before loading the new data. For this to work, you have to: + - include the `{table_name}` placeholder in your layout, + - not have any other placeholders except for the `{schema_name}` placeholder before the table_name placeholder, and + - have a separator after the table_name placeholder. Please note: -- `dlt` will mark complete loads by creating a json file in the `./_dlt_loads` folders that corresponds to the`_dlt_loads` table. For example, if `chess__1685299832.jsonl` file is present in the loads folder, you can be sure that all files for the load package `1685299832` are completely loaded +- `dlt` will mark complete loads by creating a json file in the `./_dlt_loads` folders that corresponds to the `_dlt_loads` table. For example, if the `chess__1685299832.jsonl` file is present in the loads folder, you can be sure that all files for the load package `1685299832` are completely loaded. ### Advanced layout configuration @@ -564,10 +568,10 @@ pipeline = dlt.pipeline( ) ``` -Furthermore, it is possible to +Furthermore, it is possible to: 1. Customize the behavior with callbacks for extra placeholder functionality. Each callback must accept the following positional arguments and return a string. -2. Customize the `current_datetime`, which can also be a callback function and expected to return a `pendulum.DateTime` instance. +2. Customize the `current_datetime`, which can also be a callback function and is expected to return a `pendulum.DateTime` instance. ```py import pendulum @@ -603,20 +607,20 @@ layout="{table_name}/{load_id}.{file_id}.{ext}" ``` Adopting this layout offers several advantages: -1. **Efficiency:** it's fast and simple to process. -2. **Compatibility:** supports `replace` as the write disposition method. -3. **Flexibility:** compatible with various destinations, including Athena. -4. **Performance:** a deeply nested structure can slow down file navigation, whereas a simpler layout mitigates this issue. +1. **Efficiency:** It's fast and simple to process. +2. **Compatibility:** Supports `replace` as the write disposition method. +3. **Flexibility:** Compatible with various destinations, including Athena. +4. **Performance:** A deeply nested structure can slow down file navigation, whereas a simpler layout mitigates this issue. ## Supported file formats You can choose the following file formats: -* [jsonl](../file-formats/jsonl.md) is used by default -* [parquet](../file-formats/parquet.md) is supported -* [csv](../file-formats/csv.md) is supported +* [jsonl](../file-formats/jsonl.md) is used by default. +* [parquet](../file-formats/parquet.md) is supported. +* [csv](../file-formats/csv.md) is supported. ## Supported table formats You can choose the following table formats: -* [Delta](../table-formats/delta.md) is supported +* [Delta](../table-formats/delta.md) is supported. ### Delta table format You need the `deltalake` package to use this format: @@ -657,7 +661,6 @@ def my_delta_resource(): It is **not** possible to change partition columns after the Delta table has been created. Trying to do so causes an error stating that the partition columns don't match. ::: - #### Storage options You can pass storage options by configuring `destination.filesystem.deltalake_storage_options`: @@ -668,7 +671,7 @@ deltalake_storage_options = '{"AWS_S3_LOCKING_PROVIDER": "dynamodb", DELTA_DYNAM `dlt` passes these options to the `storage_options` argument of the `write_deltalake` method in the `deltalake` library. Look at their [documentation](https://delta-io.github.io/delta-rs/api/delta_writer/#deltalake.write_deltalake) to see which options can be used. -You don't need to specify credentials here. `dlt` merges the required credentials with the options you provided, before passing it as `storage_options`. +You don't need to specify credentials here. `dlt` merges the required credentials with the options you provided before passing it as `storage_options`. >❗When using `s3`, you need to specify storage options to [configure](https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/) locking behavior. @@ -692,11 +695,9 @@ delta_tables["another_delta_table"].optimize.z_order(["col_a", "col_b"]) ``` ## Syncing of `dlt` state -This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination). To this end, special folders and files that will be created at your destination which hold information about your pipeline state, schemas and completed loads. These folders DO NOT respect your -settings in the layout section. When using filesystem as a staging destination, not all of these folders are created, as the state and schemas are -managed in the regular way by the final destination you have configured. +This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination). To this end, special folders and files will be created at your destination, which hold information about your pipeline state, schemas, and completed loads. These folders DO NOT respect your settings in the layout section. When using the filesystem as a staging destination, not all of these folders are created, as the state and schemas are managed in the regular way by the final destination you have configured. + +You will also notice `init` files being present in the root folder and the special `dlt` folders. In the absence of the concepts of schemas and tables in blob storages and directories, `dlt` uses these special files to harmonize the behavior of the `filesystem` destination with the other implemented destinations. -You will also notice `init` files being present in the root folder and the special `dlt` folders. In the absence of the concepts of schemas and tables -in blob storages and directories, `dlt` uses these special files to harmonize the behavior of the `filesystem` destination with the other implemented destinations. + - \ No newline at end of file diff --git a/docs/website/docs/dlt-ecosystem/destinations/index.md b/docs/website/docs/dlt-ecosystem/destinations/index.md index fef79d4364..e1bc6bfd92 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/index.md +++ b/docs/website/docs/dlt-ecosystem/destinations/index.md @@ -14,3 +14,4 @@ Pick one of our high-quality destinations and load your data into a local databa Otherwise, pick a destination below: + diff --git a/docs/website/docs/dlt-ecosystem/destinations/lancedb.md b/docs/website/docs/dlt-ecosystem/destinations/lancedb.md index 0d726508e6..1151dd4323 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/lancedb.md +++ b/docs/website/docs/dlt-ecosystem/destinations/lancedb.md @@ -9,13 +9,12 @@ keywords: [ lancedb, vector database, destination, dlt ] [LanceDB](https://lancedb.com/) is an open-source, high-performance vector database. It allows you to store data objects and perform similarity searches over them. This destination helps you load data into LanceDB from [dlt resources](../../general-usage/resource.md). -## Setup Guide +## Setup guide -### Choosing a Model Provider +### Choosing a model provider First, you need to decide which embedding model provider to use. You can find all supported providers by visiting the official [LanceDB docs](https://lancedb.github.io/lancedb/embeddings/default_embedding_functions/). - ### Install dlt with LanceDB To use LanceDB as a destination, make sure `dlt` is installed with the `lancedb` extra: @@ -24,9 +23,9 @@ To use LanceDB as a destination, make sure `dlt` is installed with the `lancedb` pip install "dlt[lancedb]" ``` -the lancedb extra only installs `dlt` and `lancedb`. You will need to install your model provider's SDK. +The lancedb extra only installs `dlt` and `lancedb`. You will need to install your model provider's SDK. -You can find which libraries you need to also referring to the [LanceDB docs](https://lancedb.github.io/lancedb/embeddings/default_embedding_functions/). +You can find which libraries you need by referring to the [LanceDB docs](https://lancedb.github.io/lancedb/embeddings/default_embedding_functions/). ### Configure the destination @@ -43,14 +42,14 @@ embedding_model_provider_api_key = "embedding_model_provider_api_key" # Not need ``` - The `uri` specifies the location of your LanceDB instance. It defaults to a local, on-disk instance if not provided. -- The `api_key` is your api key for LanceDB Cloud connections. If you're using LanceDB OSS, you don't need to supply this key. +- The `api_key` is your API key for LanceDB Cloud connections. If you're using LanceDB OSS, you don't need to supply this key. - The `embedding_model_provider` specifies the embedding provider used for generating embeddings. The default is `cohere`. - The `embedding_model` specifies the model used by the embedding provider for generating embeddings. Check with the embedding provider which options are available. Reference https://lancedb.github.io/lancedb/embeddings/default_embedding_functions/. - The `embedding_model_provider_api_key` is the API key for the embedding model provider used to generate embeddings. If you're using a provider that doesn't need authentication, say ollama, you don't need to supply this key. -:::info Available Model Providers +:::info Available model providers - "gemini-text" - "bedrock-text" - "cohere" @@ -118,7 +117,7 @@ The data is now loaded into LanceDB. To use **vector search** after loading, you **must specify which fields LanceDB should generate embeddings for**. Do this by wrapping the data (or dlt resource) with the **`lancedb_adapter`** function. -## Using an Adapter to Specify Columns to Vectorise +## Using an adapter to specify columns to vectorize Out of the box, LanceDB will act as a normal database. To use LanceDB's embedding facilities, you'll need to specify which fields you'd like to embed in your dlt resource. @@ -130,7 +129,7 @@ lancedb_adapter(data, embed) It accepts the following arguments: -- `data`: a dlt resource object, or a Python data structure (e.g. a list of dictionaries). +- `data`: a dlt resource object, or a Python data structure (e.g., a list of dictionaries). - `embed`: a name of the field or a list of names to generate embeddings for. Returns: [dlt resource](../../general-usage/resource.md) object that you can pass to the `pipeline.run()`. @@ -198,14 +197,13 @@ pipeline.run( This is the default disposition. It will append the data to the existing data in the destination. -## Additional Destination Options +## Additional destination options - `dataset_separator`: The character used to separate the dataset name from table names. Defaults to "___". - `vector_field_name`: The name of the special field to store vector embeddings. Defaults to "vector". - `id_field_name`: The name of the special field used for deduplication and merging. Defaults to "id__". - `max_retries`: The maximum number of retries for embedding operations. Set to 0 to disable retries. Defaults to 3. - ## dbt support The LanceDB destination doesn't support dbt integration. @@ -214,9 +212,9 @@ The LanceDB destination doesn't support dbt integration. The LanceDB destination supports syncing of the `dlt` state. -## Current Limitations +## Current limitations -### In-Memory Tables +### In-memory tables Adding new fields to an existing LanceDB table requires loading the entire table data into memory as a PyArrow table. This is because PyArrow tables are immutable, so adding fields requires creating a new table with the updated schema. @@ -226,9 +224,9 @@ Keep these considerations in mind when working with large datasets and monitor m ### Null string handling for OpenAI embeddings -OpenAI embedding service doesn't accept empty string bodies. We deal with this by replacing empty strings with a placeholder that should be very semantically dissimilar to 99.9% of queries. +The OpenAI embedding service doesn't accept empty string bodies. We deal with this by replacing empty strings with a placeholder that should be very semantically dissimilar to 99.9% of queries. -If your source column (column which is embedded) has empty values, it is important to consider the impact of this. There might be a _slight_ change that semantic queries can hit these empty strings. +If your source column (the column which is embedded) has empty values, it is important to consider the impact of this. There might be a _slight_ chance that semantic queries can hit these empty strings. We reported this issue to LanceDB: https://github.com/lancedb/lancedb/issues/1577. diff --git a/docs/website/docs/dlt-ecosystem/destinations/motherduck.md b/docs/website/docs/dlt-ecosystem/destinations/motherduck.md index f75314bb44..a79138ccb7 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/motherduck.md +++ b/docs/website/docs/dlt-ecosystem/destinations/motherduck.md @@ -21,7 +21,7 @@ workers=3 or export the **LOAD__WORKERS=3** env variable. See more in [performance](../../reference/performance.md) ::: -## Setup Guide +## Setup guide **1. Initialize a project with a pipeline that loads to MotherDuck by running** ```sh @@ -50,7 +50,7 @@ motherduck.credentials="md:///dlt_data_3?token=" ``` :::tip -Motherduck now supports configurable **access tokens**. Please refer to the [documentation](https://motherduck.com/docs/key-tasks/authenticating-to-motherduck/#authentication-using-an-access-token) +MotherDuck now supports configurable **access tokens**. Please refer to the [documentation](https://motherduck.com/docs/key-tasks/authenticating-to-motherduck/#authentication-using-an-access-token) ::: **4. Run the pipeline** @@ -58,9 +58,8 @@ Motherduck now supports configurable **access tokens**. Please refer to the [doc python3 chess_pipeline.py ``` -### Motherduck connection identifier -We enable Motherduck to identify that the connection is created by `dlt`. Motherduck will use this identifier to better understand the usage patterns -associated with `dlt` integration. The connection identifier is `dltHub_dlt/DLT_VERSION(OS_NAME)`. +### MotherDuck connection identifier +We enable MotherDuck to identify that the connection is created by `dlt`. MotherDuck will use this identifier to better understand the usage patterns associated with `dlt` integration. The connection identifier is `dltHub_dlt/DLT_VERSION(OS_NAME)`. ## Write disposition All write dispositions are supported. @@ -71,19 +70,19 @@ By default, Parquet files and the `COPY` command are used to move files to the r The **INSERT** format is also supported and will execute large INSERT queries directly into the remote database. This method is significantly slower and may exceed the maximum query size, so it is not advised. ## dbt support -This destination [integrates with dbt](../transformations/dbt/dbt.md) via [dbt-duckdb](https://github.com/jwills/dbt-duckdb), which is a community-supported package. `dbt` version >= 1.7 is required +This destination [integrates with dbt](../transformations/dbt/dbt.md) via [dbt-duckdb](https://github.com/jwills/dbt-duckdb), which is a community-supported package. `dbt` version >= 1.7 is required. ## Multi-statement transaction support -Motherduck supports multi-statement transactions. This change happened with `duckdb 0.10.2`. +MotherDuck supports multi-statement transactions. This change happened with `duckdb 0.10.2`. ## Syncing of `dlt` state This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination). ## Troubleshooting -### My database is attached in read only mode -ie. `Error: Invalid Input Error: Cannot execute statement of type "CREATE" on database "dlt_data" which is attached in read-only mode!` -We encountered this problem for databases created with `duckdb 0.9.x` and then migrated to `0.10.x`. After switch to `1.0.x` on Motherduck, all our databases had permission "read-only" visible in UI. We could not figure out how to change it so we dropped and recreated our databases. +### My database is attached in read-only mode +i.e., `Error: Invalid Input Error: Cannot execute statement of type "CREATE" on database "dlt_data" which is attached in read-only mode!` +We encountered this problem for databases created with `duckdb 0.9.x` and then migrated to `0.10.x`. After switching to `1.0.x` on MotherDuck, all our databases had permission "read-only" visible in the UI. We could not figure out how to change it, so we dropped and recreated our databases. ### I see some exception with home_dir missing when opening `md:` connection. Some internal component (HTTPS) requires the **HOME** env variable to be present. Export such a variable to the command line. Here is what we do in our tests: @@ -94,3 +93,5 @@ before opening the connection. + + diff --git a/docs/website/docs/dlt-ecosystem/destinations/mssql.md b/docs/website/docs/dlt-ecosystem/destinations/mssql.md index 0512fd5fca..b5ef0248f4 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/mssql.md +++ b/docs/website/docs/dlt-ecosystem/destinations/mssql.md @@ -54,7 +54,7 @@ host = "loader.database.windows.net" port = 1433 connect_timeout = 15 [destination.mssql.credentials.query] -# trust self signed SSL certificates +# trust self-signed SSL certificates TrustServerCertificate="yes" # require SSL connection Encrypt="yes" @@ -76,17 +76,17 @@ You can place any ODBC-specific settings into the query string or **destination. destination.mssql.credentials="mssql://loader.database.windows.net/dlt_data?trusted_connection=yes" ``` -**To connect to a local sql server instance running without SSL** pass `encrypt=no` parameter: +**To connect to a local SQL server instance running without SSL** pass `encrypt=no` parameter: ```toml destination.mssql.credentials="mssql://loader:loader@localhost/dlt_data?encrypt=no" ``` -**To allow self signed SSL certificate** when you are getting `certificate verify failed:unable to get local issuer certificate`: +**To allow self-signed SSL certificate** when you are getting `certificate verify failed: unable to get local issuer certificate`: ```toml destination.mssql.credentials="mssql://loader:loader@localhost/dlt_data?TrustServerCertificate=yes" ``` -***To use long strings (>8k) and avoid collation errors**: +**To use long strings (>8k) and avoid collation errors**: ```toml destination.mssql.credentials="mssql://loader:loader@localhost/dlt_data?LongAsMax=yes" ``` @@ -111,13 +111,15 @@ Data is loaded via INSERT statements by default. MSSQL has a limit of 1000 rows ## Supported file formats * [insert-values](../file-formats/insert-format.md) is used by default + + ## Supported column hints **mssql** will create unique indexes for all columns with `unique` hints. This behavior **may be disabled**. ### Table and column identifiers -SQL Server **with the default collation** uses case insensitive identifiers but will preserve the casing of identifiers that are stored in the INFORMATION SCHEMA. You can use [case sensitive naming conventions](../../general-usage/naming-convention.md#case-sensitive-and-insensitive-destinations) to keep the identifier casing. Note that you risk to generate identifier collisions, which are detected by `dlt` and will fail the load process. +SQL Server **with the default collation** uses case-insensitive identifiers but will preserve the casing of identifiers that are stored in the INFORMATION SCHEMA. You can use [case-sensitive naming conventions](../../general-usage/naming-convention.md#case-sensitive-and-insensitive-destinations) to keep the identifier casing. Note that you risk generating identifier collisions, which are detected by `dlt` and will fail the load process. -If you change SQL Server server/database collation to case sensitive, this will also affect the identifiers. Configure your destination as below in order to use case sensitive naming conventions without collisions: +If you change SQL Server server/database collation to case-sensitive, this will also affect the identifiers. Configure your destination as below in order to use case-sensitive naming conventions without collisions: ```toml [destination.mssql] has_case_sensitive_identifiers=true diff --git a/docs/website/docs/dlt-ecosystem/destinations/postgres.md b/docs/website/docs/dlt-ecosystem/destinations/postgres.md index e506eb79fe..53da436853 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/postgres.md +++ b/docs/website/docs/dlt-ecosystem/destinations/postgres.md @@ -12,7 +12,7 @@ keywords: [postgres, destination, data warehouse] pip install "dlt[postgres]" ``` -## Setup Guide +## Setup guide **1. Initialize a project with a pipeline that loads to Postgres by running:** ```sh @@ -37,7 +37,7 @@ Add the `dlt_data` database to `.dlt/secrets.toml`. CREATE USER loader WITH PASSWORD ''; ``` -Add the `loader` user and `` password to `.dlt/secrets.toml`. +Add the `loader` user and `` password to `.dlt/secrets.toml. **5. Give the `loader` user owner permissions by running:** ```sql @@ -46,7 +46,7 @@ ALTER DATABASE dlt_data OWNER TO loader; You can set more restrictive permissions (e.g., give user access to a specific schema). -**6. Enter your credentials into `.dlt/secrets.toml`.** +**6. Enter your credentials into `.dlt/secrets.toml.** It should now look like this: ```toml [destination.postgres.credentials] @@ -104,24 +104,21 @@ pipeline.run(events()) ``` ### Fast loading with arrow tables and csv -You can use [arrow tables](../verified-sources/arrow-pandas.md) and [csv](../file-formats/csv.md) to quickly load tabular data. Pick the `csv` loader file format -like below +You can use [arrow tables](../verified-sources/arrow-pandas.md) and [csv](../file-formats/csv.md) to quickly load tabular data. Pick the `csv` loader file format like below: ```py info = pipeline.run(arrow_table, loader_file_format="csv") ``` -In the example above `arrow_table` will be converted to csv with **pyarrow** and then streamed into **postgres** with COPY command. This method skips the regular -`dlt` normalizer used for Python objects and is several times faster. +In the example above, `arrow_table` will be converted to csv with **pyarrow** and then streamed into **postgres** with the COPY command. This method skips the regular `dlt` normalizer used for Python objects and is several times faster. ## Supported file formats * [insert-values](../file-formats/insert-format.md) is used by default. -* [csv](../file-formats/csv.md) is supported +* [csv](../file-formats/csv.md) is supported. ## Supported column hints `postgres` will create unique indexes for all columns with `unique` hints. This behavior **may be disabled**. ### Table and column identifiers -Postgres supports both case sensitive and case insensitive identifiers. All unquoted and lowercase identifiers resolve case-insensitively in SQL statements. Case insensitive [naming conventions](../../general-usage/naming-convention.md#case-sensitive-and-insensitive-destinations) like the default **snake_case** will generate case insensitive identifiers. Case sensitive (like **sql_cs_v1**) will generate -case sensitive identifiers that must be quoted in SQL statements. +Postgres supports both case-sensitive and case-insensitive identifiers. All unquoted and lowercase identifiers resolve case-insensitively in SQL statements. Case-insensitive [naming conventions](../../general-usage/naming-convention.md#case-sensitive-and-insensitive-destinations) like the default **snake_case** will generate case-insensitive identifiers. Case-sensitive (like **sql_cs_v1**) will generate case-sensitive identifiers that must be quoted in SQL statements. ## Additional destination options The Postgres destination creates UNIQUE indexes by default on columns with the `unique` hint (i.e., `_dlt_id`). To disable this behavior: @@ -131,7 +128,7 @@ create_indexes=false ``` ### Setting up `csv` format -You can provide [non-default](../file-formats/csv.md#default-settings) csv settings via configuration file or explicitly. +You can provide [non-default](../file-formats/csv.md#default-settings) csv settings via a configuration file or explicitly. ```toml [destination.postgres.csv_format] delimiter="|" @@ -146,10 +143,10 @@ csv_format = CsvFormatConfiguration(delimiter="|", include_header=False) dest_ = postgres(csv_format=csv_format) ``` -Above we set `csv` file without header, with **|** as a separator. +Above, we set the `csv` file without a header, with **|** as a separator. :::tip -You'll need those setting when [importing external files](../../general-usage/resource.md#import-external-files) +You'll need those settings when [importing external files](../../general-usage/resource.md#import-external-files). ::: ### dbt support diff --git a/docs/website/docs/dlt-ecosystem/destinations/qdrant.md b/docs/website/docs/dlt-ecosystem/destinations/qdrant.md index 5fc8097440..7a9c7e43af 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/qdrant.md +++ b/docs/website/docs/dlt-ecosystem/destinations/qdrant.md @@ -9,7 +9,7 @@ keywords: [qdrant, vector database, destination, dlt] [Qdrant](https://qdrant.tech/) is an open-source, high-performance vector search engine/database. It deploys as an API service, providing a search for the nearest high-dimensional vectors. This destination helps you load data into Qdrant from [dlt resources](../../general-usage/resource.md). -## Setup Guide +## Setup guide 1. To use Qdrant as a destination, make sure `dlt` is installed with the `qdrant` extra: @@ -131,6 +131,8 @@ A more comprehensive pipeline would load data from some API or use one of dlt's A [write disposition](../../general-usage/incremental-loading.md#choosing-a-write-disposition) defines how the data should be written to the destination. All write dispositions are supported by the Qdrant destination. + + ### Replace The [replace](../../general-usage/full-loading.md) disposition replaces the data in the destination with the data from the resource. It deletes all the classes and objects and recreates the schema before loading the data. @@ -230,7 +232,7 @@ The `QdrantClientOptions` class provides options for configuring the Qdrant clie ### Run Qdrant locally -You can find the setup instructions to run Qdrant [here](https://qdrant.tech/documentation/quick-start/#download-and-run) +You can find the setup instructions to run Qdrant [here](https://qdrant.tech/documentation/quick-start/#download-and-run). ### Syncing of `dlt` state diff --git a/docs/website/docs/dlt-ecosystem/destinations/redshift.md b/docs/website/docs/dlt-ecosystem/destinations/redshift.md index 529424a198..20b157b112 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/redshift.md +++ b/docs/website/docs/dlt-ecosystem/destinations/redshift.md @@ -12,7 +12,7 @@ keywords: [redshift, destination, data warehouse] pip install "dlt[redshift]" ``` -## Setup Guide +## Setup guide ### 1. Initialize the dlt project Let's start by initializing a new dlt project as follows: @@ -94,16 +94,15 @@ Amazon Redshift supports the following column hints: - `sort` - This hint creates a SORTKEY to order rows on disk physically. It is used to improve query and join speed in Redshift. Please read the [sort key docs](https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html) to learn more. ### Table and column identifiers -Redshift **by default** uses case insensitive identifiers and **will lower case all the identifiers** that are stored in the INFORMATION SCHEMA. Do not use -[case sensitive naming conventions](../../general-usage/naming-convention.md#case-sensitive-and-insensitive-destinations). Letter casing will be removed anyway and you risk to generate identifier collisions, which are detected by `dlt` and will fail the load process. +Redshift **by default** uses case-insensitive identifiers and **will lowercase all the identifiers** that are stored in the INFORMATION SCHEMA. Do not use +[case-sensitive naming conventions](../../general-usage/naming-convention.md#case-sensitive-and-insensitive-destinations). Letter casing will be removed anyway and you risk generating identifier collisions, which are detected by `dlt` and will fail the load process. -You can [put Redshift in case sensitive mode](https://docs.aws.amazon.com/redshift/latest/dg/r_enable_case_sensitive_identifier.html). Configure your destination as below in order to use case sensitive naming conventions: +You can [put Redshift in case-sensitive mode](https://docs.aws.amazon.com/redshift/latest/dg/r_enable_case_sensitive_identifier.html). Configure your destination as below in order to use case-sensitive naming conventions: ```toml [destination.redshift] has_case_sensitive_identifiers=true ``` - ## Staging support Redshift supports s3 as a file staging destination. dlt will upload files in the parquet format to s3 and ask Redshift to copy their data directly into the db. Please refer to the [S3 documentation](./filesystem.md#aws-s3) to learn how to set up your s3 bucket with the bucket_url and credentials. The `dlt` Redshift loader will use the AWS credentials provided for s3 to access the s3 bucket if not specified otherwise (see config options below). Alternatively to parquet files, you can also specify jsonl as the staging file format. For this, set the `loader_file_format` argument of the `run` command of the pipeline to `jsonl`. @@ -111,8 +110,8 @@ Redshift supports s3 as a file staging destination. dlt will upload files in the ## Identifier names and case sensitivity * Up to 127 characters * Case insensitive -* Stores identifiers in lower case -* Has case sensitive mode, if enabled you must [enable case sensitivity in destination factory](../../general-usage/destination.md#control-how-dlt-creates-table-column-and-other-identifiers) +* Stores identifiers in lowercase +* Has case-sensitive mode, if enabled you must [enable case sensitivity in destination factory](../../general-usage/destination.md#control-how-dlt-creates-table-column-and-other-identifiers) ### Authentication IAM Role @@ -128,22 +127,25 @@ staging_iam_role="arn:aws:iam::..." ```py # Create a dlt pipeline that will load # chess player data to the redshift destination -# via staging on s3 +``` + +# Via staging on S3 +```python pipeline = dlt.pipeline( pipeline_name='chess_pipeline', destination='redshift', - staging='filesystem', # add this to activate the staging location + staging='filesystem', # Add this to activate the staging location dataset_name='player_data' ) ``` ## Additional destination options -### dbt support +### Dbt support -- This destination [integrates with dbt](../transformations/dbt) via [dbt-redshift](https://github.com/dbt-labs/dbt-redshift). Credentials and timeout settings are shared automatically with `dbt`. +- This destination [integrates with dbt](../transformations/dbt) via [dbt-redshift](https://github.com/dbt-labs/dbt-redshift). Credentials and timeout settings are shared automatically with `dbt`. -### Syncing of `dlt` state -- This destination fully supports [dlt state sync.](../../general-usage/state#syncing-state-with-destination) +### Syncing of `dlt` state +- This destination fully supports [dlt state sync.](../../general-usage/state#syncing-state-with-destination) ## Supported loader file formats diff --git a/docs/website/docs/dlt-ecosystem/destinations/snowflake.md b/docs/website/docs/dlt-ecosystem/destinations/snowflake.md index 74688ba7fa..2b2735be9a 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/snowflake.md +++ b/docs/website/docs/dlt-ecosystem/destinations/snowflake.md @@ -12,7 +12,7 @@ keywords: [Snowflake, destination, data warehouse] pip install "dlt[snowflake]" ``` -## Setup Guide +## Setup guide **1. Initialize a project with a pipeline that loads to Snowflake by running:** ```sh @@ -44,7 +44,7 @@ In the case of Snowflake, the **host** is your [Account Identifier](https://docs The **warehouse** and **role** are optional if you assign defaults to your user. In the example below, we do not do that, so we set them explicitly. -### Setup the database user and permissions +### Set up the database user and permissions The instructions below assume that you use the default account setup that you get after creating a Snowflake account. You should have a default warehouse named **COMPUTE_WH** and a Snowflake account. Below, we create a new database, user, and assign permissions. The permissions are very generous. A more experienced user can easily reduce `dlt` permissions to just one schema in the database. ```sql --create database with standard settings @@ -67,11 +67,10 @@ GRANT ALL PRIVILEGES ON FUTURE TABLES IN DATABASE dlt_data TO DLT_LOADER_ROLE; Now you can use the user named `LOADER` to access the database `DLT_DATA` and log in with the specified password. -You can also decrease the suspend time for your warehouse to 1 minute (**Admin**/**Warehouses** in Snowflake UI) +You can also decrease the suspend time for your warehouse to 1 minute (**Admin**/**Warehouses** in Snowflake UI). ### Authentication types Snowflake destination accepts three authentication types: -Snowflake destination accepts three authentication types: - password authentication - [key pair authentication](https://docs.snowflake.com/en/user-guide/key-pair-auth) - oauth authentication @@ -80,6 +79,8 @@ The **password authentication** is not any different from other databases like P You can also pass credentials as a database connection string. For example: ```toml + + # keep it at the top of your toml file! before any section starts destination.snowflake.credentials="snowflake://loader:@kgiotue-wn98412/dlt_data?warehouse=COMPUTE_WH&role=DLT_LOADER_ROLE" @@ -102,7 +103,7 @@ If you pass a passphrase in the connection string, please URL encode it. destination.snowflake.credentials="snowflake://loader:@kgiotue-wn98412/dlt_data?private_key=&private_key_passphrase=" ``` -In **oauth authentication**, you can use an OAuth provider like Snowflake, Okta or an external browser to authenticate. In case of Snowflake oauth, you pass your `authenticator` and refresh `token` as below: +In **oauth authentication**, you can use an OAuth provider like Snowflake, Okta, or an external browser to authenticate. In the case of Snowflake oauth, you pass your `authenticator` and refresh `token` as below: ```toml [destination.snowflake.credentials] database = "dlt_data" @@ -112,10 +113,10 @@ token="..." ``` or in the connection string as query parameters. -In case of external authentication, you need to find documentation for your OAuth provider. Refer to Snowflake [OAuth](https://docs.snowflake.com/en/user-guide/oauth-intro) for more details. +In the case of external authentication, you need to find documentation for your OAuth provider. Refer to Snowflake [OAuth](https://docs.snowflake.com/en/user-guide/oauth-intro) for more details. ### Additional connection options -We pass all query parameters to `connect` function of Snowflake Python Connector. For example: +We pass all query parameters to the `connect` function of the Snowflake Python Connector. For example: ```toml [destination.snowflake.credentials] database = "dlt_data" @@ -126,14 +127,13 @@ timezone="UTC" client_session_keep_alive=true ``` Will set the timezone and session keep alive. Mind that if you use `toml` your configuration is typed. The alternative: -`"snowflake://loader/dlt_data?authenticator=oauth&timezone=UTC&client_session_keep_alive=true"` -will pass `client_session_keep_alive` as string to the connect method (which we didn't verify if it works). +`snowflake://loader/dlt_data?authenticator=oauth&timezone=UTC&client_session_keep_alive=true` +will pass `client_session_keep_alive` as a string to the connect method (which we didn't verify if it works). ## Write disposition All write dispositions are supported. -If you set the [`replace` strategy](../../general-usage/full-loading.md) to `staging-optimized`, the destination tables will be dropped and -recreated with a [clone command](https://docs.snowflake.com/en/sql-reference/sql/create-clone) from the staging tables. +If you set the [`replace` strategy](../../general-usage/full-loading.md) to `staging-optimized`, the destination tables will be dropped and recreated with a [clone command](https://docs.snowflake.com/en/sql-reference/sql/create-clone) from the staging tables. ## Data loading The data is loaded using an internal Snowflake stage. We use the `PUT` command and per-table built-in stages by default. Stage files are kept by default, unless specified otherwise via the `keep_staged_files` parameter: @@ -180,16 +180,16 @@ When loading from `parquet`, Snowflake will store `json` types (JSON) in `VARIAN ::: ### Custom csv formats -By default we support csv format [produced by our writers](../file-formats/csv.md#default-settings) which is comma delimited, with header and optionally quoted. +By default, we support the csv format [produced by our writers](../file-formats/csv.md#default-settings), which is comma-delimited, with a header and optionally quoted. -You can configure your own formatting ie. when [importing](../../general-usage/resource.md#import-external-files) external `csv` files. +You can configure your own formatting, i.e., when [importing](../../general-usage/resource.md#import-external-files) external `csv` files. ```toml [destination.snowflake.csv_format] delimiter="|" include_header=false on_error_continue=true ``` -Which will read, `|` delimited file, without header and will continue on errors. +This will read a `|` delimited file without a header and will continue on errors. Note that we ignore missing columns `ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE` and we will insert NULL into them. @@ -198,11 +198,10 @@ Snowflake supports the following [column hints](https://dlthub.com/docs/general- * `cluster` - creates a cluster column(s). Many columns per table are supported and only when a new table is created. ## Table and column identifiers -Snowflake supports both case sensitive and case insensitive identifiers. All unquoted and uppercase identifiers resolve case-insensitively in SQL statements. Case insensitive [naming conventions](../../general-usage/naming-convention.md#case-sensitive-and-insensitive-destinations) like the default **snake_case** will generate case insensitive identifiers. Case sensitive (like **sql_cs_v1**) will generate -case sensitive identifiers that must be quoted in SQL statements. +Snowflake supports both case-sensitive and case-insensitive identifiers. All unquoted and uppercase identifiers resolve case-insensitively in SQL statements. Case-insensitive [naming conventions](../../general-usage/naming-convention.md#case-sensitive-and-insensitive-destinations) like the default **snake_case** will generate case-insensitive identifiers. Case-sensitive (like **sql_cs_v1**) will generate case-sensitive identifiers that must be quoted in SQL statements. :::note -Names of tables and columns in [schemas](../../general-usage/schema.md) are kept in lower case like for all other destinations. This is the pattern we observed in other tools, i.e., `dbt`. In the case of `dlt`, it is, however, trivial to define your own uppercase [naming convention](../../general-usage/schema.md#naming-convention) +Names of tables and columns in [schemas](../../general-usage/schema.md) are kept in lower case like for all other destinations. This is the pattern we observed in other tools, i.e., `dbt`. In the case of `dlt`, it is, however, trivial to define your own uppercase [naming convention](../../general-usage/schema.md#naming-convention). ::: ## Staging support @@ -213,13 +212,13 @@ Alternatively to parquet files, you can also specify jsonl as the staging file f ### Snowflake and Amazon S3 -Please refer to the [S3 documentation](./filesystem.md#aws-s3) to learn how to set up your bucket with the bucket_url and credentials. For S3, the `dlt` Redshift loader will use the AWS credentials provided for S3 to access the S3 bucket if not specified otherwise (see config options below). Alternatively, you can create a stage for your S3 Bucket by following the instructions provided in the [Snowflake S3 documentation](https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration). +Please refer to the [S3 documentation](./filesystem.md#aws-s3) to learn how to set up your bucket with the bucket_url and credentials. For S3, the `dlt` Redshift loader will use the AWS credentials provided for S3 to access the S3 bucket if not specified otherwise (see config options below). Alternatively, you can create a stage for your S3 bucket by following the instructions provided in the [Snowflake S3 documentation](https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration). The basic steps are as follows: -* Create a storage integration linked to GCS and the right bucket +* Create a storage integration linked to GCS and the right bucket. * Grant access to this storage integration to the Snowflake role you are using to load the data into Snowflake. * Create a stage from this storage integration in the PUBLIC namespace, or the namespace of the schema of your data. -* Also grant access to this stage for the role you are using to load data into Snowflake. +* Also, grant access to this stage for the role you are using to load data into Snowflake. * Provide the name of your stage (including the namespace) to `dlt` like so: To prevent `dlt` from forwarding the S3 bucket credentials on every command, and set your S3 stage, change these settings: @@ -245,12 +244,12 @@ pipeline = dlt.pipeline( ### Snowflake and Google Cloud Storage -Please refer to the [Google Storage filesystem documentation](./filesystem.md#google-storage) to learn how to set up your bucket with the bucket_url and credentials. For GCS, you can define a stage in Snowflake and provide the stage identifier in the configuration (see config options below.) Please consult the Snowflake Documentation on [how to create a stage for your GCS Bucket](https://docs.snowflake.com/en/user-guide/data-load-gcs-config). The basic steps are as follows: +Please refer to the [Google Storage filesystem documentation](./filesystem.md#google-storage) to learn how to set up your bucket with the bucket_url and credentials. For GCS, you can define a stage in Snowflake and provide the stage identifier in the configuration (see config options below). Please consult the Snowflake documentation on [how to create a stage for your GCS bucket](https://docs.snowflake.com/en/user-guide/data-load-gcs-config). The basic steps are as follows: -* Create a storage integration linked to GCS and the right bucket +* Create a storage integration linked to GCS and the right bucket. * Grant access to this storage integration to the Snowflake role you are using to load the data into Snowflake. * Create a stage from this storage integration in the PUBLIC namespace, or the namespace of the schema of your data. -* Also grant access to this stage for the role you are using to load data into Snowflake. +* Also, grant access to this stage for the role you are using to load data into Snowflake. * Provide the name of your stage (including the namespace) to `dlt` like so: ```toml @@ -272,16 +271,15 @@ pipeline = dlt.pipeline( ) ``` + ### Snowflake and Azure Blob Storage -Please refer to the [Azure Blob Storage filesystem documentation](./filesystem.md#azure-blob-storage) to learn how to set up your bucket with the bucket_url and credentials. For Azure, the Snowflake loader will use -the filesystem credentials for your Azure Blob Storage container if not specified otherwise (see config options below). Alternatively, you can define an external stage in Snowflake and provide the stage identifier. -Please consult the Snowflake Documentation on [how to create a stage for your Azure Blob Storage Container](https://docs.snowflake.com/en/user-guide/data-load-azure). The basic steps are as follows: +Please refer to the [Azure Blob Storage filesystem documentation](./filesystem.md#azure-blob-storage) to learn how to set up your bucket with the bucket_url and credentials. For Azure, the Snowflake loader will use the filesystem credentials for your Azure Blob Storage container if not specified otherwise (see config options below). Alternatively, you can define an external stage in Snowflake and provide the stage identifier. Please consult the Snowflake Documentation on [how to create a stage for your Azure Blob Storage Container](https://docs.snowflake.com/en/user-guide/data-load-azure). The basic steps are as follows: -* Create a storage integration linked to Azure Blob Storage and the right container +* Create a storage integration linked to Azure Blob Storage and the right container. * Grant access to this storage integration to the Snowflake role you are using to load the data into Snowflake. * Create a stage from this storage integration in the PUBLIC namespace, or the namespace of the schema of your data. -* Also grant access to this stage for the role you are using to load data into Snowflake. +* Also, grant access to this stage for the role you are using to load data into Snowflake. * Provide the name of your stage (including the namespace) to `dlt` like so: ```toml @@ -333,39 +331,39 @@ dest_ = snowflake(csv_format=csv_format) Above we set `csv` file without header, with **|** as a separator and we request to ignore lines with errors. :::tip -You'll need those setting when [importing external files](../../general-usage/resource.md#import-external-files) +You'll need those settings when [importing external files](../../general-usage/resource.md#import-external-files) ::: -### Query Tagging -`dlt` [tags sessions](https://docs.snowflake.com/en/sql-reference/parameters#query-tag) that execute loading jobs with following job properties: +### Query tagging +`dlt` [tags sessions](https://docs.snowflake.com/en/sql-reference/parameters#query-tag) that execute loading jobs with the following job properties: * **source** - name of the source (identical with the name of `dlt` schema) * **resource** - name of the resource (if known, else empty string) * **table** - name of the table loaded by the job * **load_id** - load id of the job * **pipeline_name** - name of the active pipeline (or empty string if not found) -You can define query tag by defining a query tag placeholder in snowflake credentials: +You can define a query tag by defining a query tag placeholder in Snowflake credentials: ```toml [destination.snowflake] query_tag='{{"source":"{source}", "resource":"{resource}", "table": "{table}", "load_id":"{load_id}", "pipeline_name":"{pipeline_name}"}}' ``` -which contains Python named formatters corresponding to tag names ie. `{source}` will assume the name of the dlt source. +which contains Python named formatters corresponding to tag names i.e. `{source}` will assume the name of the dlt source. :::note -1. query tagging is off by default. `query_tag` configuration field is `None` by default and must be set to enable tagging. -2. only sessions associated with a job are tagged. sessions that migrate schemas remain untagged -3. jobs processing table chains (ie. sql merge jobs) will use top level table as **table** +1. Query tagging is off by default. The `query_tag` configuration field is `None` by default and must be set to enable tagging. +2. Only sessions associated with a job are tagged. Sessions that migrate schemas remain untagged. +3. Jobs processing table chains (i.e. SQL merge jobs) will use the top-level table as **table**. ::: + ### dbt support This destination [integrates with dbt](../transformations/dbt/dbt.md) via [dbt-snowflake](https://github.com/dbt-labs/dbt-snowflake). Both password and key pair authentication are supported and shared with dbt runners. ### Syncing of `dlt` state -This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination) +This destination fully supports [dlt state sync](../../general-usage/state#syncing-state-with-destination). ### Snowflake connection identifier -We enable Snowflake to identify that the connection is created by `dlt`. Snowflake will use this identifier to better understand the usage patterns -associated with `dlt` integration. The connection identifier is `dltHub_dlt`. +We enable Snowflake to identify that the connection is created by `dlt`. Snowflake will use this identifier to better understand the usage patterns associated with `dlt` integration. The connection identifier is `dltHub_dlt`. diff --git a/docs/website/docs/dlt-ecosystem/destinations/synapse.md b/docs/website/docs/dlt-ecosystem/destinations/synapse.md index 0d50924cdf..c49ee8a2c7 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/synapse.md +++ b/docs/website/docs/dlt-ecosystem/destinations/synapse.md @@ -19,7 +19,7 @@ pip install "dlt[synapse]" * **Microsoft ODBC Driver for SQL Server** The _Microsoft ODBC Driver for SQL Server_ must be installed to use this destination. - This can't be included with `dlt`'s python dependencies, so you must install it separately on your system. You can find the official installation instructions [here](https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server?view=sql-server-ver16). + This can't be included with `dlt`'s Python dependencies, so you must install it separately on your system. You can find the official installation instructions [here](https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server?view=sql-server-ver16). Supported driver versions: * `ODBC Driver 18 for SQL Server` @@ -81,11 +81,13 @@ host = "your_synapse_workspace_name.sql.azuresynapse.net" Equivalently, you can also pass a connection string as follows: ```toml -# keep it at the top of your toml file! before any section starts +``` + +# Keep it at the top of your toml file! Before any section starts destination.synapse.credentials = "synapse://loader:your_loader_password@your_synapse_workspace_name.azuresynapse.net/yourpool" ``` -To pass credentials directly you can use the `credentials` argument of `dlt.destinations.synapse(...)`: +To pass credentials directly, you can use the `credentials` argument of `dlt.destinations.synapse(...)`: ```py pipeline = dlt.pipeline( pipeline_name='chess', @@ -95,7 +97,7 @@ pipeline = dlt.pipeline( dataset_name='chess_data' ) ``` -To use **Active Directory Principal**, you can use the `sqlalchemy.engine.URL.create` method to create the connection URL using your Active Directory Service Principal credentials. First create the connection string as: +To use **Active Directory Principal**, you can use the `sqlalchemy.engine.URL.create` method to create the connection URL using your Active Directory Service Principal credentials. First, create the connection string as: ```py conn_str = ( f"DRIVER={{ODBC Driver 18 for SQL Server}};" @@ -134,14 +136,14 @@ All write dispositions are supported. ## Data loading Data is loaded via `INSERT` statements by default. -> 💡 Multi-row `INSERT INTO ... VALUES` statements are **not** possible in Synapse, because it doesn't support the [Table Value Constructor](https://learn.microsoft.com/en-us/sql/t-sql/queries/table-value-constructor-transact-sql). `dlt` uses `INSERT INTO ... SELECT ... UNION` statements as described [here](https://stackoverflow.com/a/73579830) to work around this limitation. +> 💡 Multi-row `INSERT INTO ... VALUES` statements are **not** possible in Synapse because it doesn't support the [Table Value Constructor](https://learn.microsoft.com/en-us/sql/t-sql/queries/table-value-constructor-transact-sql). `dlt` uses `INSERT INTO ... SELECT ... UNION` statements as described [here](https://stackoverflow.com/a/73579830) to work around this limitation. ## Supported file formats * [insert-values](../file-formats/insert-format.md) is used by default * [parquet](../file-formats/parquet.md) is used when [staging](#staging-support) is enabled ## Data type limitations -* **Synapse cannot load `TIME` columns from `parquet` files**. `dlt` will fail such jobs permanently. Use the `insert_values` file format instead, or convert `datetime.time` objects to `str` or `datetime.datetime`, to load `TIME` columns. +* **Synapse cannot load `TIME` columns from `parquet` files**. `dlt` will fail such jobs permanently. Use the `insert_values` file format instead, or convert `datetime.time` objects to `str` or `datetime.datetime` to load `TIME` columns. * **Synapse does not have a nested/JSON/struct data type**. The `dlt` `json` data type is mapped to the `nvarchar` type in Synapse. ## Table index type @@ -159,10 +161,9 @@ info = pipeline.run( ``` Possible values: -* `heap`: create [HEAP](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index#heap-tables) tables that do not have an index **(default)** +* `heap`: create [HEAP](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index#heap-ttables) tables that do not have an index **(default)** * `clustered_columnstore_index`: create [CLUSTERED COLUMNSTORE INDEX](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index#clustered-columnstore-indexes) tables - > ❗ Important: >* **Set `default_table_index_type` to `"clustered_columnstore_index"` if you want to change the default** (see [additional destination options](#additional-destination-options)). >* **CLUSTERED COLUMNSTORE INDEX tables do not support the `varchar(max)`, `nvarchar(max)`, and `varbinary(max)` data types.** If you don't specify the `precision` for columns that map to any of these types, `dlt` will use the maximum lengths `varchar(4000)`, `nvarchar(4000)`, and `varbinary(8000)`. @@ -181,9 +182,9 @@ Synapse supports the following [column hints](https://dlthub.com/docs/general-us > ❗ These hints are **disabled by default**. This is because the `PRIMARY KEY` and `UNIQUE` [constraints](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-table-constraints) are tricky in Synapse: they are **not enforced** and can lead to inaccurate results if the user does not ensure all column values are unique. For the column hints to take effect, the `create_indexes` configuration needs to be set to `True`, see [additional destination options](#additional-destination-options). ## Staging support -Synapse supports Azure Blob Storage (both standard and [ADLS Gen2](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction)) as a file staging destination. `dlt` first uploads Parquet files to the blob container, and then instructs Synapse to read the Parquet file and load its data into a Synapse table using the [COPY INTO](https://learn.microsoft.com/en-us/sql/t-sql/statements/copy-into-transact-sql) statement. +Synapse supports Azure Blob Storage (both standard and [ADLS Gen2](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction)) as a file staging destination. `dlt` first uploads Parquet files to the blob container and then instructs Synapse to read the Parquet file and load its data into a Synapse table using the [COPY INTO](https://learn.microsoft.com/en-us/sql/t-sql/statements/copy-into-transact-sql) statement. -Please refer to the [Azure Blob Storage filesystem documentation](./filesystem.md#azure-blob-storage) to learn how to configure credentials for the staging destination. By default, `dlt` will use these credentials for both the write into the blob container, and the read from it to load into Synapse. Managed Identity authentication can be enabled through the `staging_use_msi` option (see [additional destination options](#additional-destination-options)). +Please refer to the [Azure Blob Storage filesystem documentation](./filesystem.md#azure-blob-storage) to learn how to configure credentials for the staging destination. By default, `dlt` will use these credentials for both the write into the blob container and the read from it to load into Synapse. Managed Identity authentication can be enabled through the `staging_use_msi` option (see [additional destination options](#additional-destination-options)). To run Synapse with staging on Azure Blob Storage: @@ -223,7 +224,7 @@ Descriptions: - `default_table_index_type` sets the [table index type](#table-index-type) that is used if no table index type is specified on the resource. - `create_indexes` determines if `primary_key` and `unique` [column hints](#supported-column-hints) are applied. - `staging_use_msi` determines if the Managed Identity of the Synapse workspace is used to authorize access to the [staging](#staging-support) Storage Account. Ensure the Managed Identity has the [Storage Blob Data Reader](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles#storage-blob-data-reader) role (or a higher-privileged role) assigned on the blob container if you set this option to `"true"`. -- `port` used for the ODBC connection. +- `port` is used for the ODBC connection. - `connect_timeout` sets the timeout for the `pyodbc` connection attempt, in seconds. ### dbt support diff --git a/docs/website/docs/dlt-ecosystem/destinations/weaviate.md b/docs/website/docs/dlt-ecosystem/destinations/weaviate.md index 962239b7e6..8d478261c1 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/weaviate.md +++ b/docs/website/docs/dlt-ecosystem/destinations/weaviate.md @@ -9,7 +9,7 @@ keywords: [weaviate, vector database, destination, dlt] [Weaviate](https://weaviate.io/) is an open-source vector database. It allows you to store data objects and perform similarity searches over them. This destination helps you load data into Weaviate from [dlt resources](../../general-usage/resource.md). -## Setup Guide +## Setup guide 1. To use Weaviate as a destination, make sure dlt is installed with the 'weaviate' extra: @@ -38,7 +38,6 @@ X-OpenAI-Api-Key = "your-openai-api-key" ``` The `url` will default to **http://localhost:8080** and `api_key` is not defined - which are the defaults for the Weaviate container. - 3. Define the source of the data. For starters, let's load some data from a simple data structure: ```py @@ -92,7 +91,7 @@ The data is now loaded into Weaviate. Weaviate destination is different from other [dlt destinations](../destinations/). To use vector search after the data has been loaded, you must specify which fields Weaviate needs to include in the vector index. You do that by wrapping the data (or dlt resource) with the `weaviate_adapter` function. -## weaviate_adapter +## Weaviate_adapter The `weaviate_adapter` is a helper function that configures the resource for the Weaviate destination: @@ -177,7 +176,6 @@ info = pipeline.run( Internally, dlt will use `primary_key` (`document_id` in the example above) to generate a unique identifier ([UUID](https://weaviate.io/developers/weaviate/manage-data/create#id)) for each object in Weaviate. If the object with the same UUID already exists in Weaviate, it will be updated with the new data. Otherwise, a new object will be created. - :::caution If you are using the `merge` write disposition, you must set it from the first run of your pipeline; otherwise, the data will be duplicated in the database on subsequent loads. @@ -211,7 +209,7 @@ Data loaded into Weaviate from various sources might have different types. To en ### Dataset name -Weaviate uses classes to categorize and identify data. To avoid potential naming conflicts, especially when dealing with multiple datasets that might have overlapping table names, dlt includes the dataset name into the Weaviate class name. This ensures a unique identifier for every class. +Weaviate uses classes to categorize and identify data. To avoid potential naming conflicts, especially when dealing with multiple datasets that might have overlapping table names, dlt includes the dataset name in the Weaviate class name. This ensures a unique identifier for every class. For example, if you have a dataset named `movies_dataset` and a table named `actors`, the Weaviate class name would be `MoviesDataset_Actors` (the default separator is an underscore). @@ -244,8 +242,8 @@ Here's a summary of the naming normalization approach: #### Property names - Snake case and camel case remain unchanged: `snake_case_name` and `camelCaseName`. -- Names starting with a capital letter have it lowercased: `CamelCase` -> `camelCase` -- Names with multiple underscores, such as `Snake-______c__ase_``, are compacted to `snake_c_asex`. Except for the case when underscores are leading, in which case they are kept: `___snake_case_name` becomes `___snake_case_name`. +- Names starting with a capital letter have it lowercased: `CamelCase` -> `camelCase`. +- Names with multiple underscores, such as `Snake-______c__ase_`, are compacted to `snake_c_asex`. Except for the case when underscores are leading, in which case they are kept: `___snake_case_name` becomes `___snake_case_name`. - Names starting with a number are prefixed with a "p_". For example, `123snake_case_name` becomes `p_123snake_case_name`. #### Reserved property names @@ -253,11 +251,10 @@ Here's a summary of the naming normalization approach: Reserved property names like `id` or `additional` are prefixed with underscores for differentiation. Therefore, `id` becomes `__id` and `_id` is rendered as `___id`. ### Case insensitive naming convention -The default naming convention described above will preserve the casing of the properties (besides the first letter which is lowercased). This generates nice classes -in Weaviate but also requires that your input data does not have clashing property names when comparing case insensitive ie. (`caseName` == `casename`). In such case -Weaviate destination will fail to create classes and report a conflict. -You can configure an alternative naming convention which will lowercase all properties. The clashing properties will be merged and the classes created. Still, if you have a document where clashing properties like: +The default naming convention described above will preserve the casing of the properties (besides the first letter which is lowercased). This generates nice classes in Weaviate but also requires that your input data does not have clashing property names when comparing case insensitively, i.e., (`caseName` == `casename`). In such a case, the Weaviate destination will fail to create classes and report a conflict. + +You can configure an alternative naming convention which will lowercase all properties. The clashing properties will be merged and the classes created. Still, if you have a document with clashing properties like: ```json {"camelCase": 1, "CamelCase": 2} ``` @@ -274,23 +271,22 @@ naming="dlt.destinations.impl.weaviate.ci_naming" ## Additional destination options - `batch_size`: (int) the number of items in the batch insert request. The default is 100. -- `batch_workers`: (int) the maximal number of concurrent threads to run batch import. The default is 1. +- `batch_workers`: (int) the maximum number of concurrent threads to run batch import. The default is 1. - `batch_consistency`: (str) the number of replica nodes in the cluster that must acknowledge a write or read request before it's considered successful. The available consistency levels include: - `ONE`: Only one replica node needs to acknowledge. - - `QUORUM`: Majority of replica nodes (calculated as `replication_factor / 2 + 1`) must acknowledge. + - `QUORUM`: The majority of replica nodes (calculated as `replication_factor / 2 + 1`) must acknowledge. - `ALL`: All replica nodes in the cluster must send a successful response. The default is `ONE`. -- `batch_retries`: (int) number of retries to create a batch that failed with ReadTimeout. The default is 5. +- `batch_retries`: (int) the number of retries to create a batch that failed with ReadTimeout. The default is 5. - `dataset_separator`: (str) the separator to use when generating the class names in Weaviate. -- `conn_timeout` and `read_timeout`: (float) to set timeouts (in seconds) when connecting and reading from REST API. defaults to (10.0, 180.0) -- `startup_period` (int) - how long to wait for weaviate to start +- `conn_timeout` and `read_timeout`: (float) to set timeouts (in seconds) when connecting and reading from the REST API. Defaults to (10.0, 180.0). +- `startup_period` (int) - how long to wait for Weaviate to start. - `vectorizer`: (str) the name of [the vectorizer](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules) to use. The default is `text2vec-openai`. -- `moduleConfig`: (dict) configurations of various Weaviate modules +- `moduleConfig`: (dict) configurations of various Weaviate modules. ### Configure Weaviate modules -The default configuration for the Weaviate destination uses `text2vec-openai`. -To configure another vectorizer or a generative module, replace the default `module_config` value by updating `config.toml`: +The default configuration for the Weaviate destination uses `text2vec-openai`. To configure another vectorizer or a generative module, replace the default `module_config` value by updating `config.toml`: ```toml [destination.weaviate] @@ -307,17 +303,15 @@ Below is an example that configures the **contextionary** vectorizer. You can pu vectorizer="text2vec-contextionary" module_config={text2vec-contextionary = { vectorizeClassName = false, vectorizePropertyName = true}} ``` -You can find Docker Compose with the instructions to run [here](https://github.com/dlt-hub/dlt/tree/devel/dlt/destinations/weaviate/README.md) - +You can find Docker Compose with the instructions to run [here](https://github.com/dlt-hub/dlt/tree/devel/dlt/destinations/weaviate/README.md). ### dbt support -Currently, Weaviate destination does not support dbt. +Currently, the Weaviate destination does not support dbt. ### Syncing of `dlt` state -Weaviate destination supports syncing of the `dlt` state. - +The Weaviate destination supports syncing of the `dlt` state. diff --git a/docs/website/docs/dlt-ecosystem/file-formats/csv.md b/docs/website/docs/dlt-ecosystem/file-formats/csv.md index 05a0c2e50d..49ecd69800 100644 --- a/docs/website/docs/dlt-ecosystem/file-formats/csv.md +++ b/docs/website/docs/dlt-ecosystem/file-formats/csv.md @@ -7,29 +7,29 @@ import SetTheFormat from './_set_the_format.mdx'; # CSV file format -**csv** is the most basic file format to store tabular data, where all the values are strings and are separated by a delimiter (typically comma). -`dlt` uses it for specific use cases - mostly for the performance and compatibility reasons. +**csv** is the most basic file format to store tabular data, where all the values are strings and are separated by a delimiter (typically a comma). +`dlt` uses it for specific use cases - mostly for performance and compatibility reasons. -Internally we use two implementations: -- **pyarrow** csv writer - very fast, multithreaded writer for the [arrow tables](../verified-sources/arrow-pandas.md) +Internally, we use two implementations: +- **pyarrow** csv writer - a very fast, multithreaded writer for the [arrow tables](../verified-sources/arrow-pandas.md) - **python stdlib writer** - a csv writer included in the Python standard library for Python objects -## Supported Destinations +## Supported destinations -The `csv` format is supported by the following destinations: **Postgres**, **Filesystem**, **Snowflake** +The `csv` format is supported by the following destinations: **Postgres**, **Filesystem**, **Snowflake**. ## How to configure -## Default Settings -`dlt` attempts to make both writers to generate similarly looking files -* separators are commas -* quotes are **"** and are escaped as **""** -* `NULL` values both are empty strings and empty tokens as in the example below -* UNIX new lines are used -* dates are represented as ISO 8601 -* quoting style is "when needed" +## Default settings +`dlt` attempts to make both writers generate similarly looking files: +* Separators are commas. +* Quotes are **"** and are escaped as **""**. +* `NULL` values are both empty strings and empty tokens, as in the example below. +* UNIX new lines are used. +* Dates are represented as ISO 8601. +* Quoting style is "when needed." Example of NULLs: ```sh @@ -38,21 +38,20 @@ A,B,C A,,"" ``` -In the last row both `text2` and `text3` values are NULL. Python `csv` writer -is not able to write unquoted `None` values so we had to settle for `""` +In the last row, both `text2` and `text3` values are NULL. The Python `csv` writer +is not able to write unquoted `None` values, so we had to settle for `""`. -Note: all destinations capable of writing csvs must support it. +Note: All destinations capable of writing csvs must support it. ### Change settings -You can change basic **csv** settings, this may be handy when working with **filesystem** destination. Other destinations are tested +You can change basic **csv** settings; this may be handy when working with the **filesystem** destination. Other destinations are tested with standard settings: * delimiter: change the delimiting character (default: ',') * include_header: include the header row (default: True) * quoting: **quote_all** - all values are quoted, **quote_needed** - quote only values that need quoting (default: `quote_needed`) -When **quote_needed** is selected: in case of Python csv writer all non-numeric values are quoted. In case of pyarrow csv writer, the exact behavior is not described in the documentation. We observed that in some cases, strings are not quoted as well. - +When **quote_needed** is selected: in the case of the Python csv writer, all non-numeric values are quoted. In the case of the pyarrow csv writer, the exact behavior is not described in the documentation. We observed that in some cases, strings are not quoted as well. ```toml [normalize.data_writer] @@ -75,16 +74,17 @@ A few additional settings are available when copying `csv` to destination tables * **encoding** - encoding of the `csv` file :::tip -You'll need those setting when [importing external files](../../general-usage/resource.md#import-external-files) +You'll need these settings when [importing external files](../../general-usage/resource.md#import-external-files). ::: ## Limitations **arrow writer** -* binary columns are supported only if they contain valid UTF-8 characters -* json (nested, struct) types are not supported +* Binary columns are supported only if they contain valid UTF-8 characters. +* JSON (nested, struct) types are not supported. **csv writer** -* binary columns are supported only if they contain valid UTF-8 characters (easy to add more encodings) -* json columns dumped with json.dumps -* **None** values are always quoted \ No newline at end of file +* Binary columns are supported only if they contain valid UTF-8 characters (easy to add more encodings). +* JSON columns are dumped with json.dumps. +* **None** values are always quoted. + diff --git a/docs/website/docs/dlt-ecosystem/file-formats/insert-format.md b/docs/website/docs/dlt-ecosystem/file-formats/insert-format.md index 3e58b5a25d..d7322ccb75 100644 --- a/docs/website/docs/dlt-ecosystem/file-formats/insert-format.md +++ b/docs/website/docs/dlt-ecosystem/file-formats/insert-format.md @@ -5,7 +5,7 @@ keywords: [insert values, file formats] --- import SetTheFormat from './_set_the_format.mdx'; -# SQL INSERT File Format +# SQL Insert file format This file format contains an INSERT...VALUES statement to be executed on the destination during the `load` stage. @@ -18,7 +18,7 @@ Additional data types are stored as follows: This file format is [compressed](../../reference/performance.md#disabling-and-enabling-file-compression) by default. -## Supported Destinations +## Supported destinations This format is used by default by: **DuckDB**, **Postgres**, **Redshift**, **Synapse**, **MSSQL**, **Motherduck** @@ -27,3 +27,5 @@ It is also supported by: **Filesystem** if you'd like to store INSERT VALUES sta ## How to configure + + diff --git a/docs/website/docs/dlt-ecosystem/file-formats/jsonl.md b/docs/website/docs/dlt-ecosystem/file-formats/jsonl.md index 5957ccc8ad..2203676d52 100644 --- a/docs/website/docs/dlt-ecosystem/file-formats/jsonl.md +++ b/docs/website/docs/dlt-ecosystem/file-formats/jsonl.md @@ -5,9 +5,9 @@ keywords: [jsonl, file formats] --- import SetTheFormat from './_set_the_format.mdx'; -# jsonl - JSON Delimited +# jsonl - JSON delimited -JSON Delimited is a file format that stores several JSON documents in one file. The JSON +JSON delimited is a file format that stores several JSON documents in one file. The JSON documents are separated by a new line. Additional data types are stored as follows: @@ -21,10 +21,12 @@ Additional data types are stored as follows: This file format is [compressed](../../reference/performance.md#disabling-and-enabling-file-compression) by default. -## Supported Destinations +## Supported destinations This format is used by default by: **BigQuery**, **Snowflake**, **Filesystem**. ## How to configure + + diff --git a/docs/website/docs/dlt-ecosystem/file-formats/parquet.md b/docs/website/docs/dlt-ecosystem/file-formats/parquet.md index 30f7051386..cdb6cb23af 100644 --- a/docs/website/docs/dlt-ecosystem/file-formats/parquet.md +++ b/docs/website/docs/dlt-ecosystem/file-formats/parquet.md @@ -15,7 +15,7 @@ To use this format, you need a `pyarrow` package. You can get this package as a pip install "dlt[parquet]" ``` -## Supported Destinations +## Supported destinations Supported by: **BigQuery**, **DuckDB**, **Snowflake**, **Filesystem**, **Athena**, **Databricks**, **Synapse** @@ -32,17 +32,17 @@ Supported by: **BigQuery**, **DuckDB**, **Snowflake**, **Filesystem**, **Athena* Under the hood, `dlt` uses the [pyarrow parquet writer](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html) to create the files. The following options can be used to change the behavior of the writer: -- `flavor`: Sanitize schema or set other compatibility options to work with various target systems. Defaults to None which is **pyarrow** default. +- `flavor`: Sanitize schema or set other compatibility options to work with various target systems. Defaults to None, which is **pyarrow** default. - `version`: Determine which Parquet logical types are available for use, whether the reduced set from the Parquet 1.x.x format or the expanded logical types added in later format versions. Defaults to "2.6". -- `data_page_size`: Set a target threshold for the approximate encoded size of data pages within a column chunk (in bytes). Defaults to None which is **pyarrow** default. +- `data_page_size`: Set a target threshold for the approximate encoded size of data pages within a column chunk (in bytes). Defaults to None, which is **pyarrow** default. - `row_group_size`: Set the number of rows in a row group. [See here](#row-group-size) how this can optimize parallel processing of queries on your destination over the default setting of `pyarrow`. - `timestamp_timezone`: A string specifying timezone, default is UTC. -- `coerce_timestamps`: resolution to which coerce timestamps, choose from **s**, **ms**, **us**, **ns** -- `allow_truncated_timestamps` - will raise if precision is lost on truncated timestamp. +- `coerce_timestamps`: Resolution to which coerce timestamps, choose from **s**, **ms**, **us**, **ns**. +- `allow_truncated_timestamps`: Will raise if precision is lost on truncated timestamp. :::tip -Default parquet version used by `dlt` is 2.4. It coerces timestamps to microseconds and truncates nanoseconds silently. Such setting -provides best interoperability with database systems, including loading panda frames which have nanosecond resolution by default +The default parquet version used by `dlt` is 2.4. It coerces timestamps to microseconds and truncates nanoseconds silently. Such a setting +provides the best interoperability with database systems, including loading panda frames which have nanosecond resolution by default. ::: Read the [pyarrow parquet docs](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html) to learn more about these settings. @@ -68,28 +68,26 @@ NORMALIZE__DATA_WRITER__TIMESTAMP_TIMEZONE ``` ### Timestamps and timezones -`dlt` adds timezone (UTC adjustment) to all timestamps regardless of a precision (from seconds to nanoseconds). `dlt` will also create TZ aware timestamp columns in -the destinations. [duckdb is an exception here](../destinations/duckdb.md#supported-file-formats) +`dlt` adds timezone (UTC adjustment) to all timestamps regardless of precision (from seconds to nanoseconds). `dlt` will also create TZ-aware timestamp columns in +the destinations. [DuckDB is an exception here](../destinations/duckdb.md#supported-file-formats). -### Disable timezones / utc adjustment flags +### Disable timezones / UTC adjustment flags You can generate parquet files without timezone adjustment information in two ways: -1. Set the **flavor** to spark. All timestamps will be generated via deprecated `int96` physical data type, without the logical one -2. Set the **timestamp_timezone** to empty string (ie. `DATA_WRITER__TIMESTAMP_TIMEZONE=""`) to generate logical type without UTC adjustment. - -To our best knowledge, arrow will convert your timezone aware DateTime(s) to UTC and store them in parquet without timezone information. +1. Set the **flavor** to spark. All timestamps will be generated via the deprecated `int96` physical data type, without the logical one. +2. Set the **timestamp_timezone** to an empty string (i.e., `DATA_WRITER__TIMESTAMP_TIMEZONE=""`) to generate a logical type without UTC adjustment. +To our best knowledge, Arrow will convert your timezone-aware DateTime(s) to UTC and store them in parquet without timezone information. ### Row group size -The `pyarrow` parquet writer writes each item, i.e. table or record batch, in a separate row group. -This may lead to many small row groups which may not be optimal for certain query engines. For example, `duckdb` parallelizes on a row group. -`dlt` allows controlling the size of the row group by -[buffering and concatenating tables](../../reference/performance.md#controlling-in-memory-buffers) and batches before they are written. The concatenation is done as a zero-copy to save memory. -You can control the size of the row group by setting the maximum number of rows kept in the buffer. + +The `pyarrow` parquet writer writes each item, i.e., table or record batch, in a separate row group. This may lead to many small row groups, which may not be optimal for certain query engines. For example, `duckdb` parallelizes on a row group. `dlt` allows controlling the size of the row group by [buffering and concatenating tables](../../reference/performance.md#controlling-in-memory-buffers) and batches before they are written. The concatenation is done as a zero-copy to save memory. You can control the size of the row group by setting the maximum number of rows kept in the buffer. + ```toml [extract.data_writer] buffer_max_items=10e6 ``` + Mind that `dlt` holds the tables in memory. Thus, 1,000,000 rows in the example above may consume a significant amount of RAM. -`row_group_size` configuration setting has limited utility with `pyarrow` writer. It may be useful when you write single very large pyarrow tables -or when your in memory buffer is really large. \ No newline at end of file +The `row_group_size` configuration setting has limited utility with the `pyarrow` writer. It may be useful when you write single very large pyarrow tables or when your in-memory buffer is really large. + diff --git a/docs/website/docs/dlt-ecosystem/staging.md b/docs/website/docs/dlt-ecosystem/staging.md index 789189b7dd..5980283c87 100644 --- a/docs/website/docs/dlt-ecosystem/staging.md +++ b/docs/website/docs/dlt-ecosystem/staging.md @@ -18,7 +18,7 @@ staging_dataset_name_layout="staging_%s" ``` The entry above switches the pattern to `staging_` prefix and for example, for a dataset with the name **github_data**, `dlt` will create **staging_github_data**. -To configure a static staging dataset name, you can do the following (we use the destination factory) +To configure a static staging dataset name, you can do the following (we use the destination factory): ```py import dlt @@ -55,7 +55,7 @@ In essence, you need to set up two destinations and then pass them to `dlt.pipel Please follow our guide in the [filesystem destination documentation](destinations/filesystem.md). Test the staging as a standalone destination to make sure that files go where you want them. In your `secrets.toml`, you should now have a working `filesystem` configuration: ```toml [destination.filesystem] - bucket_url = "s3://[your_bucket_name]" # replace with your bucket name, + bucket_url = "s3://[your_bucket_name]" # replace with your bucket name [destination.filesystem.credentials] aws_access_key_id = "please set me up!" # copy the access key here @@ -103,8 +103,7 @@ Please note that `dlt` does not delete loaded files from the staging storage aft ### How to prevent staging files truncation -Before `dlt` loads data to the staging storage, it truncates previously loaded files. To prevent it and keep the whole history -of loaded files, you can use the following parameter: +Before `dlt` loads data to the staging storage, it truncates previously loaded files. To prevent it and keep the whole history of loaded files, you can use the following parameter: ```toml [destination.redshift] @@ -112,6 +111,7 @@ truncate_table_before_load_on_staging_destination=false ``` :::caution -The [Athena](destinations/athena#staging-support) destination only truncates not iceberg tables with `replace` merge_disposition. -Therefore, the parameter `truncate_table_before_load_on_staging_destination` only controls the truncation of corresponding files for these tables. +The [Athena](destinations/athena#staging-support) destination only truncates non-iceberg tables with `replace` merge_disposition. Therefore, the parameter `truncate_table_before_load_on_staging_destination` only controls the truncation of corresponding files for these tables. ::: + + diff --git a/docs/website/docs/dlt-ecosystem/table-formats/delta.md b/docs/website/docs/dlt-ecosystem/table-formats/delta.md index 7840f40d11..d8dd87b750 100644 --- a/docs/website/docs/dlt-ecosystem/table-formats/delta.md +++ b/docs/website/docs/dlt-ecosystem/table-formats/delta.md @@ -6,8 +6,9 @@ keywords: [delta, table formats] # Delta table format -[Delta](https://delta.io/) is an open source table format. `dlt` can store data as Delta tables. +[Delta](https://delta.io/) is an open-source table format. `dlt` can store data as Delta tables. -## Supported Destinations +## Supported destinations Supported by: **Databricks**, **filesystem** + diff --git a/docs/website/docs/dlt-ecosystem/table-formats/iceberg.md b/docs/website/docs/dlt-ecosystem/table-formats/iceberg.md index a34bab9a0c..233ae0ce21 100644 --- a/docs/website/docs/dlt-ecosystem/table-formats/iceberg.md +++ b/docs/website/docs/dlt-ecosystem/table-formats/iceberg.md @@ -6,8 +6,9 @@ keywords: [iceberg, table formats] # Iceberg table format -[Iceberg](https://iceberg.apache.org/) is an open source table format. `dlt` can store data as Iceberg tables. +[Iceberg](https://iceberg.apache.org/) is an open-source table format. `dlt` can store data as Iceberg tables. -## Supported Destinations +## Supported destinations Supported by: **Athena** + diff --git a/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt.md b/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt.md index 526e62e44b..09c674ed45 100644 --- a/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt.md +++ b/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt.md @@ -6,8 +6,7 @@ keywords: [transform, dbt, runner] # Transform the data with dbt -[dbt](https://github.com/dbt-labs/dbt-core) is a framework that allows for the simple structuring of your transformations into DAGs. The benefits of -using dbt include: +[dbt](https://github.com/dbt-labs/dbt-core) is a framework that allows for the simple structuring of your transformations into DAGs. The benefits of using dbt include: - End-to-end cross-db compatibility for dlt→dbt pipelines. - Ease of use by SQL analysts, with a low learning curve. @@ -22,14 +21,11 @@ The dbt runner: - Can create a virtual env for dbt on the fly; - Can run a dbt package from online sources (e.g., GitHub) or from local files; -- Passes configuration and credentials to dbt, so you do not need to handle them separately from - `dlt`, enabling dbt to configure on the fly. +- Passes configuration and credentials to dbt, so you do not need to handle them separately from `dlt`, enabling dbt to configure on the fly. ## How to use the dbt runner -For an example of how to use the dbt runner, see the -[jaffle shop example](https://github.com/dlt-hub/dlt/blob/devel/docs/examples/archive/dbt_run_jaffle.py). -Included below is another example where we run a `dlt` pipeline and then a dbt package via `dlt`: +For an example of how to use the dbt runner, see the [jaffle shop example](https://github.com/dlt-hub/dlt/blob/devel/docs/examples/archive/dbt_run_jaffle.py). Included below is another example where we run a `dlt` pipeline and then a dbt package via `dlt`: > 💡 Docstrings are available to read in your IDE. @@ -81,12 +77,12 @@ for m in models: ``` ## How to run dbt runner without pipeline -You can use the dbt runner without a dlt pipeline. The example below will clone and run **jaffle shop** using a dbt profile that you supply. -It assumes that dbt is installed in the current Python environment and the `profile.yml` is in the same folder as the Python script. + +You can use the dbt runner without a dlt pipeline. The example below will clone and run **jaffle shop** using a dbt profile that you supply. It assumes that dbt is installed in the current Python environment and the `profile.yml` is in the same folder as the Python script. +Here's an example **duckdb** profile: -Here's an example **duckdb** profile ```yaml config: # do not track usage, do not create .user.yml @@ -103,13 +99,13 @@ duckdb_dlt_dbt_test: - httpfs - parquet ``` -You can run the example with dbt debug log: `RUNTIME__LOG_LEVEL=DEBUG python dbt_standalone.py` +You can run the example with dbt debug log: `RUNTIME__LOG_LEVEL=DEBUG python dbt_standalone.py` ## Other transforming tools -If you want to transform the data before loading, you can use Python. If you want to transform the -data after loading, you can use dbt or one of the following: +If you want to transform the data before loading, you can use Python. If you want to transform the data after loading, you can use dbt or one of the following: 1. [`dlt` SQL client.](../sql.md) 2. [Pandas.](../pandas.md) + diff --git a/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt_cloud.md b/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt_cloud.md index d15c4eb84c..2ce374a362 100644 --- a/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt_cloud.md +++ b/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt_cloud.md @@ -4,11 +4,11 @@ description: Transforming the data loaded by a dlt pipeline with dbt Cloud keywords: [transform, sql] --- -# DBT Cloud Client and Helper Functions +# Dbt Cloud client and helper functions -## API Client +## API client -The DBT Cloud Client is a Python class designed to interact with the dbt Cloud API (version 2). +The Dbt Cloud client is a Python class designed to interact with the dbt Cloud API (version 2). It provides methods to perform various operations on dbt Cloud, such as triggering job runs and retrieving job run statuses. ```py @@ -26,7 +26,7 @@ run_status = client.get_run_status(run_id=job_run_id) print(f"Job run status: {run_status['status_humanized']}") ``` -## Helper Functions +## Helper functions These Python functions provide an interface to interact with the dbt Cloud API. They simplify the process of triggering and monitoring job runs in dbt Cloud. @@ -65,7 +65,7 @@ from dlt.helpers.dbt_cloud import get_dbt_cloud_run_status status = get_dbt_cloud_run_status(run_id=1234, wait_for_outcome=True) ``` -## Set Credentials +## Set credentials ### secrets.toml @@ -86,7 +86,7 @@ job_id = "set me up!" # optional only for the run_dbt_cloud_job function (you ca run_id = "set me up!" # optional for the get_dbt_cloud_run_status function (you can pass this explicitly as an argument to the function) ``` -### Environment Variables +### Environment variables `dlt` supports reading credentials from the environment. @@ -103,3 +103,4 @@ DBT_CLOUD__JOB_ID ``` For more information, read the [Credentials](https://dlthub.com/docs/general-usage/credentials) documentation. + diff --git a/docs/website/docs/dlt-ecosystem/transformations/pandas.md b/docs/website/docs/dlt-ecosystem/transformations/pandas.md index 0e08666eaf..f021409fde 100644 --- a/docs/website/docs/dlt-ecosystem/transformations/pandas.md +++ b/docs/website/docs/dlt-ecosystem/transformations/pandas.md @@ -4,7 +4,7 @@ description: Transform the data loaded by a dlt pipeline with Pandas keywords: [transform, pandas] --- -# Transform the Data with Pandas +# Transform the data with Pandas You can fetch the results of any SQL query as a dataframe. If the destination supports that natively (i.e., BigQuery and DuckDB), `dlt` uses the native method. Thanks to this, reading @@ -22,7 +22,7 @@ with pipeline.sql_client() as client: with client.execute_query( 'SELECT "reactions__+1", "reactions__-1", reactions__laugh, reactions__hooray, reactions__rocket FROM issues' ) as table: - # calling `df` on a cursor, returns the data as a data frame + # Calling `df` on a cursor returns the data as a data frame reactions = table.df() counts = reactions.sum(0).sort_values(0, ascending=False) ``` @@ -32,10 +32,12 @@ chunks by passing the `chunk_size` argument to the `df` method. Once your data is in a Pandas dataframe, you can transform it as needed. -## Other Transforming Tools +## Other transforming tools If you want to transform the data before loading, you can use Python. If you want to transform the data after loading, you can use Pandas or one of the following: 1. [dbt.](dbt/dbt.md) (recommended) 2. [`dlt` SQL client.](sql.md) + + diff --git a/docs/website/docs/dlt-ecosystem/transformations/sql.md b/docs/website/docs/dlt-ecosystem/transformations/sql.md index b358e97b4c..146f912f03 100644 --- a/docs/website/docs/dlt-ecosystem/transformations/sql.md +++ b/docs/website/docs/dlt-ecosystem/transformations/sql.md @@ -6,11 +6,7 @@ keywords: [transform, sql] # Transform the data using the `dlt` SQL client -A simple alternative to dbt is to query the data using the `dlt` SQL client and then perform the -transformations using Python. The `execute_sql` method allows you to execute any SQL statement, -including statements that change the database schema or data in the tables. In the example below, we -insert a row into the `customers` table. Note that the syntax is the same as for any standard `dbapi` -connection. +A simple alternative to dbt is to query the data using the `dlt` SQL client and then perform the transformations using Python. The `execute_sql` method allows you to execute any SQL statement, including statements that change the database schema or data in the tables. In the example below, we insert a row into the `customers` table. Note that the syntax is the same as for any standard `dbapi` connection. ```py pipeline = dlt.pipeline(destination="bigquery", dataset_name="crm") @@ -26,8 +22,7 @@ except Exception: ... ``` -In the case of SELECT queries, the data is returned as a list of rows, with the elements of a row -corresponding to selected columns. +In the case of SELECT queries, the data is returned as a list of rows, with the elements of a row corresponding to selected columns. ```py try: @@ -44,8 +39,8 @@ except Exception: ## Other transforming tools -If you want to transform the data before loading, you can use Python. If you want to transform the -data after loading, you can use SQL or one of the following: +If you want to transform the data before loading, you can use Python. If you want to transform the data after loading, you can use SQL or one of the following: 1. [dbt](dbt/dbt.md) (recommended). 2. [Pandas.](pandas.md) + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/_source-info-header.md b/docs/website/docs/dlt-ecosystem/verified-sources/_source-info-header.md index 112dcf06bf..17822e8dcd 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/_source-info-header.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/_source-info-header.md @@ -3,4 +3,5 @@ import Link from '../../_book-onboarding-call.md'; Join our Slack community or . - \ No newline at end of file + + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/amazon_kinesis.md b/docs/website/docs/dlt-ecosystem/verified-sources/amazon_kinesis.md index 3e7dad9793..35315d0ab5 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/amazon_kinesis.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/amazon_kinesis.md @@ -29,7 +29,7 @@ You can check out our pipeline example [here](https://github.com/dlt-hub/verified-sources/blob/master/sources/kinesis_pipeline.py). ::: -## Setup Guide +## Setup guide ### Grab credentials @@ -132,7 +132,7 @@ For more information, read [Run a pipeline.](../../walkthroughs/run-a-pipeline) ### Resource `kinesis_stream` This resource reads a Kinesis stream and yields messages. It supports -[incremental loading](../../general-usage/incremental-loading) and parses messages as json by +[incremental loading](../../general-usage/incremental-loading) and parses messages as JSON by default. ```py @@ -177,7 +177,7 @@ def kinesis_stream( You create a resource `kinesis_stream` by passing the stream name and a few other options. The resource will have the same name as the stream. When you iterate this resource (or pass it to `pipeline.run` records), it will query Kinesis for all the shards in the requested stream. For each - shard, it will create an iterator to read messages: +shard, it will create an iterator to read messages: 1. If `initial_at_timestamp` is present, the resource will read all messages after this timestamp. 1. If `initial_at_timestamp` is 0, only the messages at the tip of the stream are read. @@ -202,13 +202,13 @@ if False, `data` is returned as bytes. ## Customization + + ### Create your own pipeline -If you wish to create your own pipelines, you can leverage source and resource methods from this -verified source. +If you wish to create your own pipelines, you can leverage source and resource methods from this verified source. -1. Configure the [pipeline](../../general-usage/pipeline) by specifying the pipeline name, - destination, and dataset as follows: +1. Configure the [pipeline](../../general-usage/pipeline) by specifying the pipeline name, destination, and dataset as follows: ```py pipeline = dlt.pipeline( @@ -221,9 +221,9 @@ verified source. 1. To load messages from a stream from the last one hour: ```py - # the resource below will take its name from the stream name, - # it can be used multiple times by default it assumes that Data is json and parses it, - # here we disable that to just get bytes in data elements of the message + # The resource below will take its name from the stream name. + # It can be used multiple times. By default, it assumes that data is JSON and parses it. + # Here we disable that to just get bytes in data elements of the message. kinesis_stream_data = kinesis_stream( "kinesis_source_name", parse_json=False, @@ -236,7 +236,7 @@ verified source. 1. For incremental Kinesis streams, to fetch only new messages: ```py - #running pipeline will get only new messages + # Running the pipeline will get only new messages. info = pipeline.run(kinesis_stream_data) message_counts = pipeline.last_trace.last_normalize_info.row_counts if "kinesis_source_name" not in message_counts: @@ -245,7 +245,7 @@ verified source. print(pipeline.last_trace.last_normalize_info) ``` -1. To parse json with a simple decoder: +1. To parse JSON with a simple decoder: ```py def _maybe_parse_json(item: TDataItem) -> TDataItem: @@ -267,23 +267,23 @@ verified source. STATE_FILE = "kinesis_source_name.state.json" - # load the state if it exists + # Load the state if it exists. if os.path.exists(STATE_FILE): with open(STATE_FILE, "rb") as f: state = json.typed_loadb(f.read()) else: - # provide new state + # Provide new state. state = {} with Container().injectable_context( StateInjectableContext(state=state) ) as managed_state: - # dlt resources/source is just an iterator + # dlt resources/source is just an iterator. for message in kinesis_stream_data: - # here you can send the message somewhere + # Here you can send the message somewhere. print(message) - # save state after each message to have full transaction load - # dynamodb is also OK + # Save state after each message to have full transaction load. + # DynamoDB is also OK. with open(STATE_FILE, "wb") as f: json.typed_dump(managed_state.state, f) print(managed_state.state) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/arrow-pandas.md b/docs/website/docs/dlt-ecosystem/verified-sources/arrow-pandas.md index 4a5cdd2f71..ee0a938e5f 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/arrow-pandas.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/arrow-pandas.md @@ -13,12 +13,12 @@ You can load data directly from an Arrow table or Pandas dataframe. This is supported by all destinations, but recommended especially when using destinations that support the `parquet` file format natively (e.g. [Snowflake](../destinations/snowflake.md) and [Filesystem](../destinations/filesystem.md)). See the [destination support](#destination-support-and-fallback) section for more information. -When used with a `parquet` supported destination this is a more performant way to load structured data since `dlt` bypasses many processing steps normally involved in passing JSON objects through the pipeline. +When used with a `parquet` supported destination, this is a more performant way to load structured data since `dlt` bypasses many processing steps normally involved in passing JSON objects through the pipeline. `dlt` automatically translates the Arrow table's schema to the destination table's schema and writes the table to a parquet file which gets uploaded to the destination without any further processing. ## Usage -To write an Arrow source, pass any `pyarrow.Table`, `pyarrow.RecordBatch` or `pandas.DataFrame` object (or list of thereof) to the pipeline's `run` or `extract` method, or yield table(s)/dataframe(s) from a `@dlt.resource` decorated function. +To write an Arrow source, pass any `pyarrow.Table`, `pyarrow.RecordBatch` or `pandas.DataFrame` object (or list thereof) to the pipeline's `run` or `extract` method, or yield table(s)/dataframe(s) from a `@dlt.resource` decorated function. This example loads a Pandas dataframe to a Snowflake table: @@ -61,7 +61,7 @@ Destinations that support the `parquet` format natively will have the data files When the destination does not support `parquet`, the rows are extracted from the table and written in the destination's native format (usually `insert_values`) and this is generally much slower as it requires processing the table row by row and rewriting data to disk. -The output file format is chosen automatically based on the destination's capabilities, so you can load arrow or pandas frames to any destination but performance will vary. +The output file format is chosen automatically based on the destination's capabilities, so you can load Arrow or Pandas frames to any destination but performance will vary. ### Destinations that support parquet natively for direct loading * duckdb & motherduck @@ -89,13 +89,13 @@ add_dlt_id = true Keep in mind that enabling these incurs some performance overhead: -- `add_dlt_load_id` has minimal overhead since the column is added to arrow table in memory during `extract` stage, before parquet file is written to disk -- `add_dlt_id` adds the column during `normalize` stage after file has been extracted to disk. The file needs to be read back from disk in chunks, processed and rewritten with new columns +- `add_dlt_load_id` has minimal overhead since the column is added to the Arrow table in memory during the `extract` stage, before the parquet file is written to disk. +- `add_dlt_id` adds the column during the `normalize` stage after the file has been extracted to disk. The file needs to be read back from disk in chunks, processed, and rewritten with new columns. ## Incremental loading with Arrow tables You can use incremental loading with Arrow tables as well. -Usage is the same as without other dlt resources. Refer to the [incremental loading](/general-usage/incremental-loading.md) guide for more information. +Usage is the same as with other dlt resources. Refer to the [incremental loading](/general-usage/incremental-loading.md) guide for more information. Example: @@ -104,12 +104,12 @@ import dlt from dlt.common import pendulum import pandas as pd -# Create a resource using that yields a dataframe, using the `ordered_at` field as an incremental cursor +# Create a resource that yields a dataframe, using the `ordered_at` field as an incremental cursor @dlt.resource(primary_key="order_id") def orders(ordered_at = dlt.sources.incremental('ordered_at')): # Get dataframe/arrow table from somewhere # If your database supports it, you can use the last_value to filter data at the source. - # Otherwise it will be filtered automatically after loading the data. + # Otherwise, it will be filtered automatically after loading the data. df = get_orders(since=ordered_at.last_value) yield df @@ -124,7 +124,7 @@ Look at the [Connector X + Arrow Example](../../examples/connector_x_arrow/) to ::: ## Loading JSON documents -If you want to skip default `dlt` JSON normalizer, you can use any available method to convert JSON documents into tabular data. +If you want to skip the default `dlt` JSON normalizer, you can use any available method to convert JSON documents into tabular data. * **pandas** has `read_json` and `json_normalize` methods * **pyarrow** can infer table schema and convert JSON files into tables with `read_json` * **duckdb** can do the same with `read_json_auto` @@ -153,20 +153,21 @@ The Arrow data types are translated to dlt data types as follows: | `int` | `bigint` | Precision is determined by the bit width. | | `binary` | `binary` | | | `decimal` | `decimal` | Precision and scale are determined by the type properties. | -| `struct` | `json` | | +| `struct` | `json` | | | | | | ## Loading nested types -All struct types are represented as `json` and will be loaded as JSON (if destination permits) or a string. Currently we do not support **struct** types, -even if they are present in the destination (except **BigQuery** which can be [configured to handle them](../destinations/bigquery.md#use-bigquery-schema-autodetect-for-nested-fields)) +All struct types are represented as `json` and will be loaded as JSON (if the destination permits) or a string. Currently, we do not support **struct** types, +even if they are present in the destination (except **BigQuery** which can be [configured to handle them](../destinations/bigquery.md#use-bigquery-schema-autodetect-for-nested-fields)). -If you want to represent nested data as separated tables, you must yield panda frames and arrow tables as records. In the examples above: +If you want to represent nested data as separate tables, you must yield pandas frames and arrow tables as records. In the examples above: ```py -# yield panda frame as records +# yield pandas frame as records pipeline.run(df.to_dict(orient='records'), table_name="orders") # yield arrow table pipeline.run(table.to_pylist(), table_name="orders") ``` -Both Pandas and Arrow allow to stream records in batches. +Both Pandas and Arrow allow streaming records in batches. + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/github.md b/docs/website/docs/dlt-ecosystem/verified-sources/github.md index 830f4035d8..239e47e51b 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/github.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/github.md @@ -16,14 +16,14 @@ Resources that can be loaded using this verified source are: | Name | Description | | ---------------- |----------------------------------------------------------------------------------| -| github_reactions | Retrieves all issues, pull requests, comments and reactions associated with them | +| github_reactions | Retrieves all issues, pull requests, comments, and reactions associated with them | | github_repo_events | Gets all the repo events associated with the repository | -## Setup Guide +## Setup guide ### Grab credentials -To get the API token, sign-in to your GitHub account and follow these steps: +To get the API token, sign in to your GitHub account and follow these steps: 1. Click on your profile picture in the top right corner. @@ -32,7 +32,7 @@ To get the API token, sign-in to your GitHub account and follow these steps: 1. Select "Developer settings" on the left panel. 1. Under "Personal access tokens", click on "Generate a personal access token (preferably under - Tokens(classic))". + Tokens (classic))". 1. Grant at least the following scopes to the token by checking them. @@ -42,7 +42,7 @@ To get the API token, sign-in to your GitHub account and follow these steps: | read:repo_hook | Grants read and ping access to hooks in public or private repositories | | read:org | Read-only access to organization membership, organization projects, and team membership | | read:user | Grants access to read a user's profile data | - | read:project | Grants read only access to user and organization projects | + | read:project | Grants read-only access to user and organization projects | | read:discussion | Allows read access for team discussions | 1. Finally, click "Generate token". @@ -52,11 +52,11 @@ To get the API token, sign-in to your GitHub account and follow these steps: > You can optionally add API access tokens to avoid making requests as an unauthorized user. > If you wish to load data using the github_reaction source, the access token is mandatory. -More information you can see in the +More information can be found in the [GitHub authentication](https://docs.github.com/en/rest/overview/authenticating-to-the-rest-api?apiVersion=2022-11-28#basic-authentication) and [GitHub API token scopes](https://docs.github.com/en/apps/oauth-apps/building-oauth-apps/scopes-for-oauth-apps) -documentations. +documentation. ### Initialize the verified source @@ -89,7 +89,7 @@ For more information, read the guide on [how to add a verified source](../../wal ```toml # Put your secret values and credentials here - # Github access token (must be classic for reactions source) + # GitHub access token (must be classic for reactions source) [sources.github] access_token="please set me up!" # use GitHub access token here ``` @@ -101,7 +101,7 @@ For more information, read the guide on [how to add a verified source](../../wal add credentials for your chosen destination, ensuring proper routing of your data to the final destination. -For more information, read the [General Usage: Credentials.](../../general-usage/credentials) +For more information, read the [General usage: Credentials.](../../general-usage/credentials) ## Run the pipeline @@ -119,7 +119,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage ```sh dlt pipeline show ``` - For example, the `pipeline_name` for the above pipeline example is `github_reactions`, you may + For example, the `pipeline_name` for the above pipeline example is `github_reactions`. You may also use any custom name instead. For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline). @@ -151,7 +151,7 @@ def github_reactions( `name`: Refers to the name of the repository. -`access_token`: Classic access token should be utilized and is stored in the `.dlt/secrets.toml` +`access_token`: A classic access token should be utilized and is stored in the `.dlt/secrets.toml` file. `items_per_page`: The number of issues/pull requests to retrieve in a single page. Defaults to 100. @@ -185,11 +185,9 @@ dlt.resource( ### Source `github_repo_events` -This `dlt.source` fetches repository events incrementally, dispatching them to separate tables based -on event type. It loads new events only and appends them to tables. +This `dlt.source` fetches repository events incrementally, dispatching them to separate tables based on event type. It loads new events only and appends them to tables. -> Note: Github allows retrieving up to 300 events for public repositories, so frequent updates are -> recommended for active repos. +> Note: Github allows retrieving up to 300 events for public repositories, so frequent updates are recommended for active repos. ```py @dlt.source(max_table_nesting=2) @@ -203,8 +201,7 @@ def github_repo_events( `name`: Denotes the name of the repository. -`access_token`: Optional classic or fine-grained access token. If not provided, calls are made -anonymously. +`access_token`: Optional classic or fine-grained access token. If not provided, calls are made anonymously. `max_table_nesting=2` sets the maximum nesting level to 2. @@ -212,8 +209,7 @@ Read more about [nesting levels](../../general-usage/source#reduce-the-nesting-l ### Resource `repo_events` -This `dlt.resource` function serves as the resource for the `github_repo_events` source. It yields -repository events as data items. +This `dlt.resource` function serves as the resource for the `github_repo_events` source. It yields repository events as data items. ```py dlt.resource(primary_key="id", table_name=lambda i: i["type"]) # type: ignore @@ -229,9 +225,7 @@ def repo_events( `table_name`: Routes data to appropriate tables based on the data type. -`last_created_at`: This parameter determines the initial value for "last_created_at" in -dlt.sources.incremental. If no value is given, the default "initial_value" is used. The function -"last_value_func" determines the most recent 'created_at' value. +`last_created_at`: This parameter determines the initial value for "last_created_at" in dlt.sources.incremental. If no value is given, the default "initial_value" is used. The function "last_value_func" determines the most recent 'created_at' value. Read more about [incremental loading](../../general-usage/incremental-loading#incremental_loading-with-last-value). @@ -239,8 +233,7 @@ Read more about [incremental loading](../../general-usage/incremental-loading#in ### Create your own pipeline -If you wish to create your own pipelines, you can leverage source and resource methods from this -verified source. +If you wish to create your own pipelines, you can leverage source and resource methods from this verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: @@ -252,18 +245,16 @@ verified source. ) ``` - To read more about pipeline configuration, please refer to our - [documentation](../../general-usage/pipeline). + To read more about pipeline configuration, please refer to our [documentation](../../general-usage/pipeline). -1. To load all the data from repo on issues, pull requests, their comments and reactions, you can do - the following: +1. To load all the data from the repo on issues, pull requests, their comments, and reactions, you can do the following: ```py load_data = github_reactions("duckdb", "duckdb") load_info = pipeline.run(load_data) print(load_info) ``` - here, "duckdb" is the owner of the repository and the name of the repository. + Here, "duckdb" is the owner of the repository and the name of the repository. 1. To load only the first 100 issues, you can do the following: @@ -273,8 +264,7 @@ verified source. print(load_info) ``` -1. You can use fetch and process repo events data incrementally. It loads all data during the first - run and incrementally in subsequent runs. +1. You can fetch and process repo events data incrementally. It loads all data during the first run and incrementally in subsequent runs. ```py load_data = github_repo_events( @@ -287,3 +277,4 @@ verified source. It is optional to use `access_token` or make anonymous API calls. + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/google_analytics.md b/docs/website/docs/dlt-ecosystem/verified-sources/google_analytics.md index 7b4c1b0d5e..a4edb728d4 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/google_analytics.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/google_analytics.md @@ -25,7 +25,7 @@ Sources and resources that can be loaded using this verified source are: | metrics_table | Assembles and presents data relevant to the report's metrics | | dimensions_table | Compiles and displays data related to the report's dimensions | -## Setup Guide +## Setup guide ### Grab credentials @@ -103,7 +103,7 @@ python google_analytics/setup_script_gcp_oauth.py Once you have executed the script and completed the authentication, you will receive a "refresh token" that can be used to set up the "secrets.toml". -### Share the Google Analytics Property with the API: +### Share the Google Analytics property with the API: > Note: For service account authentication, use the client_email. For OAuth authentication, use the > email associated with the app creation and refresh token generation. @@ -183,40 +183,32 @@ For more information, read the guide on [how to add a verified source](../../wal #### Pass `property_id` and `request parameters` -1. `property_id` is a unique number that identifies a particular property. You will need to - explicitly pass it to get data from the property that you're interested in. For example, if the - property that you want to get data from is “GA4-Google Merch Shop” then you will need to pass its - property id "213025502". +1. `property_id` is a unique number that identifies a particular property. You will need to explicitly pass it to get data from the property that you're interested in. For example, if the property that you want to get data from is “GA4-Google Merch Shop” then you will need to pass its property id "213025502". ![Property ID](./docs_images/GA4_Property_ID_size.png) -1. You can also specify the parameters of the API requests such as dimensions and metrics to get - your desired data. +1. You can also specify the parameters of the API requests, such as dimensions and metrics, to get your desired data. -1. An example of how you can pass all of this to `dlt` is to simply insert it in the - `.dlt/config.toml` file as below: +1. An example of how you can pass all of this to `dlt` is to simply insert it in the `.dlt/config.toml` file as below: ```toml [sources.google_analytics] - property_id = "213025502" # this is example property id, please use yours + property_id = "213025502" # this is an example property id, please use yours queries = [ {"resource_name"= "sample_analytics_data1", "dimensions"= ["browser", "city"], "metrics"= ["totalUsers", "transactions"]}, {"resource_name"= "sample_analytics_data2", "dimensions"= ["browser", "city", "dateHour"], "metrics"= ["totalUsers"]} ] ``` - > Include request parameters in a queries list. The data from each request fills a table, with - > resources named by resource name, with dimensions. See the above example for reference. + > Include request parameters in a queries list. The data from each request fills a table, with resources named by resource name, with dimensions. See the above example for reference. -1. To use queries from `.dlt/config.toml`, run the `simple_load_config()` function in - [pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/google_analytics_pipeline.py). +1. To use queries from `.dlt/config.toml`, run the `simple_load_config()` function in [pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/google_analytics_pipeline.py). For more information, read the [General Usage: Credentials.](../../general-usage/credentials) ## Run the pipeline -1. Before running the pipeline, ensure that you have installed all the necessary dependencies by - running the command: +1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: ```sh pip install -r requirements.txt ``` @@ -224,25 +216,21 @@ For more information, read the [General Usage: Credentials.](../../general-usage ```sh python google_analytics_pipeline.py ``` -1. Once the pipeline has finished running, you can verify that everything loaded correctly by using - the following command: +1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: ```sh dlt pipeline show ``` - For example, the `pipeline_name` for the above pipeline example is - `dlt_google_analytics_pipeline`, you may also use any custom name instead. + For example, the `pipeline_name` for the above pipeline example is `dlt_google_analytics_pipeline`, you may also use any custom name instead. For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline). ## Sources and resources -`dlt` works on the principle of [sources](../../general-usage/source) and -[resources](../../general-usage/resource). +`dlt` works on the principle of [sources](../../general-usage/source) and [resources](../../general-usage/resource). ### Source `simple_load` -This function returns a list of resources including metadata, metrics, and dimensions data from -the Google Analytics API. +This function returns a list of resources, including metadata, metrics, and dimensions data from the Google Analytics API. ```py @dlt.source(max_table_nesting=2) @@ -260,14 +248,11 @@ def google_analytics( `property_id`: This is a unique identifier for a Google Analytics property. -`queries`: This is a list of queries outlining the API request parameters like dimensions and -metrics. +`queries`: This is a list of queries outlining the API request parameters like dimensions and metrics. -`start_date`: This optional parameter determines the starting date for data loading. By default, -it's set to "2000-01-01". +`start_date`: This optional parameter determines the starting date for data loading. By default, it's set to "2000-01-01". -`rows_per_page`: This parameter specifies the number of rows to fetch per page. By default, it is -set to 1000. +`rows_per_page`: This parameter specifies the number of rows to fetch per page. By default, it is set to 1000. ### Resource `get_metadata` @@ -281,13 +266,11 @@ def get_metadata(client: Resource, property_id: int) -> Iterator[Metadata]: `client`: This is the Google Analytics client used to make requests. -`property_id`: This is a reference to the Google Analytics project. For more information, click -[here](https://developers.google.com/analytics/devguides/reporting/data/v1/property-id). +`property_id`: This is a reference to the Google Analytics project. For more information, click [here](https://developers.google.com/analytics/devguides/reporting/data/v1/property-id). ### Transformer `metrics_table` -This transformer function extracts data using metadata and populates a table called "metrics" with -the data from each metric. +This transformer function extracts data using metadata and populates a table called "metrics" with the data from each metric. ```py @dlt.transformer(data_from=get_metadata, write_disposition="replace", name="metrics") @@ -298,14 +281,12 @@ def metrics_table(metadata: Metadata) -> Iterator[TDataItem]: `metadata`: GA4 metadata is stored in this "Metadata" class object. -Similarly, there is a transformer function called `dimensions_table` that populates a table called -"dimensions" with the data from each dimension. +Similarly, there is a transformer function called `dimensions_table` that populates a table called "dimensions" with the data from each dimension. ## Customization ### Create your own pipeline -If you wish to create your own pipelines, you can leverage source and resource methods from this -verified source. +If you wish to create your own pipelines, you can leverage source and resource methods from this verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: @@ -328,7 +309,7 @@ verified source. print(load_info) ``` - > Loads all the data till date in the first run, and then + > Loads all the data to date in the first run, and then > [incrementally](https://dlthub.com/docs/general-usage/incremental-loading) in subsequent runs. 1. To load data from a specific start date: diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/google_sheets.md b/docs/website/docs/dlt-ecosystem/verified-sources/google_sheets.md index 9cd6ad8079..59cf961186 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/google_sheets.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/google_sheets.md @@ -14,7 +14,7 @@ offered by Google as part of its Google Workspace suite. This Google Sheets `dlt` verified source and [pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/google_sheets_pipeline.py) -loads data using “Google Sheets API” to the destination of your choice. +loads data using the “Google Sheets API” to the destination of your choice. Sources and resources that can be loaded using this verified source are: @@ -24,14 +24,14 @@ Sources and resources that can be loaded using this verified source are: | range_names | Processes the range and yields data from each range | | spreadsheet_info | Information about the spreadsheet and the ranges processed | -## Setup Guide +## Setup guide ### Grab credentials There are two methods to get authenticated for using this verified source: - OAuth credentials -- Service account credential +- Service account credentials Here we'll discuss how to set up both OAuth tokens and service account credentials. In general, OAuth tokens are preferred when user consent is required, while service account credentials are @@ -41,14 +41,14 @@ credentials. You can choose the method of authentication as per your requirement #### Google service account credentials You need to create a GCP service account to get API credentials if you don't have one. To create - one, follow these steps: +one, follow these steps: 1. Sign in to [console.cloud.google.com](http://console.cloud.google.com/). 1. [Create a service account](https://cloud.google.com/iam/docs/service-accounts-create#creating) if needed. -1. Enable "Google Sheets API", refer +1. Enable "Google Sheets API", refer to [Google documentation](https://developers.google.com/sheets/api/guides/concepts) for comprehensive instructions on this process. @@ -74,12 +74,12 @@ follow these steps: 1. Go to Credentials -> OAuth client ID -> Select Desktop App from the Application type and give an appropriate name. -1. Download the credentials and fill "client_id", "client_secret" and "project_id" in +1. Download the credentials and fill in "client_id", "client_secret" and "project_id" in "secrets.toml". 1. Go back to credentials and select the OAuth consent screen on the left. -1. Fill in the App name, user support email(your email), authorized domain (localhost.com), and dev +1. Fill in the App name, user support email (your email), authorized domain (localhost.com), and dev contact info (your email again). 1. Add the following scope: @@ -104,6 +104,8 @@ follow these steps: ### Prepare your data + + #### Share Google Sheet with the email > Note: For service account authentication, use the client_email. For OAuth authentication, use the @@ -137,12 +139,11 @@ or spreadsheet id (which is a part of the url) typically you pass it directly to the [google_spreadsheet function](#create-your-own-pipeline) or in [config.toml](#add-credentials) as defined here. - -You can provide specific ranges to `google_spreadsheet` pipeline, as detailed in following. +You can provide specific ranges to `google_spreadsheet` pipeline, as detailed in the following. #### Guidelines about headers -Make sure your data has headers and is in the form of well-structured table. +Make sure your data has headers and is in the form of a well-structured table. The first row of any extracted range should contain headers. Please make sure: @@ -154,27 +155,20 @@ The first row of any extracted range should contain headers. Please make sure: > log. Hence, we advise running your pipeline script manually/locally and fixing all the problems. 1. Columns without headers will be removed and not extracted. 1. Columns with headers that do not contain any data will be removed. -1. If there are any problems with reading headers (i.e. header is not string or is empty or not +1. If there are any problems with reading headers (i.e. header is not a string or is empty or not unique): the headers row will be extracted as data and automatic header names will be used. -1. Empty rows are ignored +1. Empty rows are ignored. 1. `dlt` will normalize range names and headers into table and column names - so they may be different in the database than in Google Sheets. Prefer small cap names without special characters. - #### Guidelines about named ranges -We recommend to use -[Named Ranges](https://support.google.com/docs/answer/63175?hl=en&co=GENIE.Platform%3DDesktop) to -indicate which data should be extracted from a particular spreadsheet, and this is how this source -will work by default - when called without setting any other options. All the named ranges will be -converted into tables, named after them and stored in the destination. +We recommend using [Named Ranges](https://support.google.com/docs/answer/63175?hl=en&co=GENIE.Platform%3DDesktop) to indicate which data should be extracted from a particular spreadsheet, and this is how this source will work by default - when called without setting any other options. All the named ranges will be converted into tables, named after them, and stored in the destination. -1. You can let the spreadsheet users add and remove tables by just adding/removing the ranges, - you do not need to configure the pipeline again. +1. You can let the spreadsheet users add and remove tables by just adding/removing the ranges; you do not need to configure the pipeline again. -1. You can indicate exactly the fragments of interest, and only this data will be retrieved, so it is - the fastest. +1. You can indicate exactly the fragments of interest, and only this data will be retrieved, so it is the fastest. 1. You can name database tables by changing the range names. @@ -194,16 +188,13 @@ converted into tables, named after them and stored in the destination. If you are not happy with the workflow above, you can: -1. Disable it by setting `get_named_ranges` option to `False`. +1. Disable it by setting the `get_named_ranges` option to `False`. -1. Enable retrieving all sheets/tabs with get_sheets option set to `True`. +1. Enable retrieving all sheets/tabs with the get_sheets option set to `True`. 1. Pass a list of ranges as supported by Google Sheets in range_names. - > Note: To retrieve all named ranges with "get_named_ranges" or all sheets with "get_sheets" - > methods, pass an empty `range_names` list as `range_names = []`. Even when you use a set - > "get_named_ranges" to false pass the range_names as an empty list to get all the sheets with - > "get_sheets" method. + > Note: To retrieve all named ranges with "get_named_ranges" or all sheets with "get_sheets" methods, pass an empty `range_names` list as `range_names = []`. Even when you use a set "get_named_ranges" to false, pass the range_names as an empty list to get all the sheets with the "get_sheets" method. ### Initialize the verified source @@ -215,16 +206,11 @@ To get started with your data pipeline, follow these steps: dlt init google_sheets duckdb ``` - [This command](../../reference/command-line-interface) will initialize - [the pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/google_sheets_pipeline.py) - with Google Sheets as the [source](../../general-usage/source) and - [duckdb](../destinations/duckdb.md) as the [destination](../destinations). + [This command](../../reference/command-line-interface) will initialize [the pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/google_sheets_pipeline.py) with Google Sheets as the [source](../../general-usage/source) and [duckdb](../destinations/duckdb.md) as the [destination](../destinations). -1. If you'd like to use a different destination, simply replace `duckdb` with the name of your - preferred [destination](../destinations). +1. If you'd like to use a different destination, simply replace `duckdb` with the name of your preferred [destination](../destinations). -1. After running this command, a new directory will be created with the necessary files and - configuration settings to get started. +1. After running this command, a new directory will be created with the necessary files and configuration settings to get started. For more information, read the guide on [how to add a verified source](../../walkthroughs/add-a-verified-source). @@ -260,7 +246,7 @@ For more information, read the guide on [how to add a verified source](../../wal 1. Finally, enter credentials for your chosen destination as per the [docs](../destinations/). -1. Next you need to configure ".dlt/config.toml", which looks like: +1. Next, you need to configure ".dlt/config.toml", which looks like: ```toml [sources.google_sheets] @@ -317,13 +303,9 @@ For more information, read the guide on [how to run a pipeline](../../walkthroug ## Data types -The `dlt` normalizer uses the first row of data to infer types and attempts to coerce subsequent rows, creating variant columns if unsuccessful. This is standard behavior. -If `dlt` did not correctly determine the data type in the column, or you want to change the data type for other reasons, -then you can provide a type hint for the affected column in the resource. -Also, since recently `dlt`'s no longer recognizing date and time types, so you have to designate it yourself as `timestamp`. +The `dlt` normalizer uses the first row of data to infer types and attempts to coerce subsequent rows, creating variant columns if unsuccessful. This is standard behavior. If `dlt` did not correctly determine the data type in the column, or you want to change the data type for other reasons, then you can provide a type hint for the affected column in the resource. Also, since recently `dlt` no longer recognizes date and time types, so you have to designate it yourself as `timestamp`. -Use the `apply_hints` method on the resource to achieve this. -Here's how you can do it: +Use the `apply_hints` method on the resource to achieve this. Here's how you can do it: ```py for resource in resources: @@ -332,11 +314,11 @@ for resource in resources: "date": {"data_type": "timestamp"}, }) ``` -In this example, the `total_amount` column is enforced to be of type double and `date` is enforced to be of type timestamp. -This will ensure that all values in the `total_amount` column are treated as `double`, regardless of whether they are integers or decimals in the original Google Sheets data. -And `date` column will be represented as dates, not integers. + +In this example, the `total_amount` column is enforced to be of type double and `date` is enforced to be of type timestamp. This will ensure that all values in the `total_amount` column are treated as `double`, regardless of whether they are integers or decimals in the original Google Sheets data. And the `date` column will be represented as dates, not integers. For a single resource (e.g. `Sheet1`), you can simply use: + ```py source.Sheet1.apply_hints(columns={ "total_amount": {"data_type": "double"}, @@ -345,28 +327,24 @@ source.Sheet1.apply_hints(columns={ ``` To get the name of resources, you can use: + ```py print(source.resources.keys()) ``` -To read more about tables, columns, and datatypes, please refer to [our documentation here.](../../general-usage/schema#tables-and-columns) +To read more about tables, columns, and data types, please refer to [our documentation here.](../../general-usage/schema#tables-and-columns) :::caution -`dlt` will **not modify** tables after they are created. -So if you changed data types with hints, -then you need to **delete the dataset** -or set `dev_mode=True`. +`dlt` will **not modify** tables after they are created. So if you changed data types with hints, then you need to **delete the dataset** or set `dev_mode=True`. ::: ## Sources and resources -`dlt` works on the principle of [sources](../../general-usage/source) and -[resources](../../general-usage/resource). +`dlt` works on the principle of [sources](../../general-usage/source) and [resources](../../general-usage/resource). ### Source `google_spreadsheet` -This function loads data from a Google Spreadsheet. It retrieves data from all specified ranges, -whether explicitly defined or named, and obtains metadata for the first two rows within each range. +This function loads data from a Google Spreadsheet. It retrieves data from all specified ranges, whether explicitly defined or named, and obtains metadata for the first two rows within each range. ```py def google_spreadsheet( @@ -389,13 +367,11 @@ def google_spreadsheet( `get_sheets`: If True, imports all spreadsheet sheets into the database. -`get_named_ranges`: If True, imports either all named ranges or those -[specified](google_sheets.md#guidelines-about-named-ranges) into the database. +`get_named_ranges`: If True, imports either all named ranges or those [specified](google_sheets.md#guidelines-about-named-ranges) into the database. ### Resource `range_names` -This function processes each range name provided by the source function, loading its data into -separate tables in the destination. +This function processes each range name provided by the source function, loading its data into separate tables in the destination. ```py dlt.resource( @@ -405,14 +381,13 @@ dlt.resource( ) ``` -`process_range`: Function handles rows from a specified Google Spreadsheet range, taking data rows, -headers, and data types as arguments. +`process_range`: Function handles rows from a specified Google Spreadsheet range, taking data rows, headers, and data types as arguments. `name`: Specifies the table's name, derived from the spreadsheet range. `write_disposition`: Dictates how data is loaded to the destination. -> Please Note: +> Please note: > > 1. Empty rows are ignored. > 1. Empty cells are converted to None (and then to NULL by dlt). @@ -444,14 +419,15 @@ dlt.resource( [Read more](https://dlthub.com/docs/general-usage/incremental-loading#the-3-write-dispositions). `merge_key`: Parameter is used to specify the column used to identify records for merging. In this -case,"spreadsheet_id", means that the records will be merged based on the values in this column. +case, "spreadsheet_id" means that the records will be merged based on the values in this column. [Read more](https://dlthub.com/docs/general-usage/incremental-loading#merge-incremental_loading). ## Customization + + ### Create your own pipeline -If you wish to create your own pipelines, you can leverage source and resource methods from this -verified source. +If you wish to create your own pipelines, you can leverage source and resource methods from this verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: @@ -467,7 +443,7 @@ verified source. ```py load_data = google_spreadsheet( - "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL + "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", # Spreadsheet URL range_names=["range_name1", "range_name2"], # Range names get_sheets=False, get_named_ranges=False, @@ -479,11 +455,11 @@ verified source. > Note: You can pass the URL or spreadsheet ID and range names explicitly or in > ".dlt/config.toml". -1. To load all the range_names from spreadsheet: +1. To load all the range names from the spreadsheet: ```py load_data = google_spreadsheet( - "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL + "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", # Spreadsheet URL get_sheets=False, get_named_ranges=True, ) @@ -493,11 +469,11 @@ verified source. > Pass an empty list to range_names in ".dlt/config.toml" to retrieve all range names. -1. To load all the sheets from spreadsheet: +1. To load all the sheets from the spreadsheet: ```py load_data = google_spreadsheet( - "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL + "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", # Spreadsheet URL get_sheets=True, get_named_ranges=False, ) @@ -507,11 +483,11 @@ verified source. > Pass an empty list to range_names in ".dlt/config.toml" to retrieve all sheets. -1. To load all the sheets and range_names: +1. To load all the sheets and range names: ```py load_data = google_spreadsheet( - "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL + "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", # Spreadsheet URL get_sheets=True, get_named_ranges=True, ) @@ -525,17 +501,17 @@ verified source. ```py load_data1 = google_spreadsheet( - "https://docs.google.com/spreadsheets/d/43lkHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL + "https://docs.google.com/spreadsheets/d/43lkHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", # Spreadsheet URL range_names=["Sheet 1!A1:B10"], get_named_ranges=False, ) load_data2 = google_spreadsheet( - "https://docs.google.com/spreadsheets/d/3jo4HjqouQnnCIZAFa2rL6vT91YRN8aIhts22SKKO390/edit#gid=0", #Spreadsheet URL + "https://docs.google.com/spreadsheets/d/3jo4HjqouQnnCIZAFa2rL6vT91YRN8aIhts22SKKO390/edit#gid=0", # Spreadsheet URL range_names=["Sheet 1!B1:C10"], get_named_ranges=True, ) - load_info = pipeline.run([load_data1,load_data2]) + load_info = pipeline.run([load_data1, load_data2]) print(load_info) ``` @@ -543,9 +519,9 @@ verified source. ```py load_data = google_spreadsheet( - "https://docs.google.com/spreadsheets/d/43lkHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL - range_names=["Sheet 1!A1:B10"], - get_named_ranges=False, + "https://docs.google.com/spreadsheets/d/43lkHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", # Spreadsheet URL + range_names=["Sheet 1!A1:B10"], + get_named_ranges=False, ) data.resources["Sheet 1!A1:B10"].apply_hints(table_name="loaded_data_1") @@ -570,13 +546,13 @@ Consider the following when using Google Spreadsheets with Airflow: `Airflow Helper Caution` - Avoid using `scc decomposition` because it unnecessarily creates a new source instance for every specified data range. This is not efficient and can cause redundant tasks. -#### Recommended Airflow Deployment +#### Recommended Airflow deployment -Below is the correct way to set up an Airflow DAG for this purpose: +Below is the correct way to set up an Airflow DAG for this purpose: -- Define a DAG to run daily, starting from say February 1, 2023. It avoids catching up for missed runs and ensures only one instance runs at a time. +- Define a DAG to run daily, starting from, say, February 1, 2023. It avoids catching up for missed runs and ensures only one instance runs at a time. -- Data is imported from Google Spreadsheets and directed BigQuery. +- Data is imported from Google Spreadsheets and directed to BigQuery. - When adding the Google Spreadsheet task to the pipeline, avoid decomposing it; run it as a single task for efficiency. @@ -605,3 +581,5 @@ def get_named_ranges(): ``` + + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md b/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md index aac77b9b0a..bd0b480b0d 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md @@ -24,29 +24,29 @@ Sources and resources that can be loaded using this verified source are: | get_messages | resource-transformer | Retrieves emails from the mailbox using given UIDs | | get_attachments | resource-transformer | Downloads attachments from emails using given UIDs | -## Setup Guide +## Setup guide ### Grab credentials 1. For verified source configuration, you need: - "host": IMAP server hostname (e.g., Gmail: imap.gmail.com, Outlook: imap-mail.outlook.com). - - "email_account": Associated email account name (e.g. dlthub@dlthub.com). + - "email_account": Associated email account name (e.g., dlthub@dlthub.com). - "password": APP password (for third-party clients) from the email provider. 2. Host addresses and APP password procedures vary by provider and can be found via a quick Google search. For Google Mail's app password, read [here](https://support.google.com/mail/answer/185833?hl=en#:~:text=An%20app%20password%20is%20a,2%2DStep%20Verification%20turned%20on). 3. However, this guide covers Gmail inbox configuration; similar steps apply to other providers. -### Accessing Gmail Inbox +### Accessing Gmail inbox 1. SMTP server DNS: 'imap.gmail.com' for Gmail. 2. Port: 993 (for internet messaging access protocol over TLS/SSL). -### Grab App password for Gmail +### Grab app password for Gmail 1. An app password is a 16-digit code allowing less secure apps/devices to access your Google Account, available only with 2-Step Verification activated. -#### Steps to Create and Use App Passwords: +#### Steps to create and use app passwords: 1. Visit your Google Account > Security. 2. Under "How you sign in to Google", enable 2-Step Verification. @@ -84,9 +84,7 @@ For more information, read the ### Add credential -1. Inside the `.dlt` folder, you'll find a file called `secrets.toml`, which is where you can - securely store your access tokens and other sensitive information. It's important to handle this - file with care and keep it safe. Here's what the file looks like: +1. Inside the `.dlt` folder, you'll find a file called `secrets.toml`, which is where you can securely store your access tokens and other sensitive information. It's important to handle this file with care and keep it safe. Here's what the file looks like: ```toml # put your secret values and credentials here @@ -94,21 +92,17 @@ For more information, read the [sources.inbox] host = "Please set me up!" # The host address of the email service provider. email_account = "Please set me up!" # Email account associated with the service. - password = "Please set me up!" # # APP Password for the above email account. + password = "Please set me up!" # APP Password for the above email account. ``` -2. Replace the host, email, and password value with the [previously copied one](#grab-credentials) - to ensure secure access to your Inbox resources. +2. Replace the host, email, and password values with the [previously copied one](#grab-credentials) to ensure secure access to your Inbox resources. > When adding the App Password, remove any spaces. For instance, "abcd efgh ijkl mnop" should be "abcdefghijklmnop". -3. Next, follow the [destination documentation](../../dlt-ecosystem/destinations) instructions to - add credentials for your chosen destination, ensuring proper routing of your data to the final - destination. +3. Next, follow the [destination documentation](../../dlt-ecosystem/destinations) instructions to add credentials for your chosen destination, ensuring proper routing of your data to the final destination. ## Run the pipeline -1. Before running the pipeline, ensure that you have installed all the necessary dependencies by - running the command: +1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: ```sh pip install -r requirements.txt ``` @@ -123,20 +117,17 @@ For more information, read the For pdf parsing: - PyPDF2: `pip install PyPDF2` -2. Once the pipeline has finished running, you can verify that everything loaded correctly by using - the following command: +2. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: ```sh dlt pipeline show ``` - For example, the `pipeline_name` for the above pipeline example is `standard_inbox`, you may also - use any custom name instead. + For example, the `pipeline_name` for the above pipeline example is `standard_inbox`, you may also use any custom name instead. For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs/run-a-pipeline) ## Sources and resources -`dlt` works on the principle of [sources](../../general-usage/source) and -[resources](../../general-usage/resource). +`dlt` works on the principle of [sources](../../general-usage/source) and [resources](../../general-usage/resource). ### Source `inbox_source` @@ -158,11 +149,11 @@ def inbox_source( ... ``` -`host` : IMAP server hostname. Default: 'dlt.secrets.value'. +`host`: IMAP server hostname. Default: 'dlt.secrets.value'. `email_account`: Email login. Default: 'dlt.secrets.value'. -`password`: Email App password. Default: 'dlt.secrets.value'. +`password`: Email App password. Default: 'dlt.secrets.value'. `folder`: Mailbox folder for collecting emails. Default: 'INBOX'. @@ -170,7 +161,7 @@ def inbox_source( `start_date`: Start date to collect emails. Default: `/inbox/settings.py` 'DEFAULT_START_DATE'. -`filter_emails`:Email addresses for 'FROM' filtering. Default: `/inbox/settings.py` 'FILTER_EMAILS'. +`filter_emails`: Email addresses for 'FROM' filtering. Default: `/inbox/settings.py` 'FILTER_EMAILS'. `filter_by_mime_type`: MIME types for attachment filtering. Default: None. @@ -190,7 +181,7 @@ def get_messages_uids( ... ``` -`initial_message_num`: provides incremental loading on UID. +`initial_message_num`: Provides incremental loading on UID. ### Resource `get_messages` @@ -271,7 +262,7 @@ verified source. ``` 4. In `inbox_pipeline.py`, the `pdf_to_text` transformer extracts text from PDFs, treating each page as a separate data item. - Using the `pdf_to_text` function to load parsed pdfs from mail to the database: + Using the `pdf_to_text` function to load parsed PDFs from mail to the database: ```py filter_emails = ["mycreditcard@bank.com", "community@dlthub.com."] # Email senders diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/jira.md b/docs/website/docs/dlt-ecosystem/verified-sources/jira.md index b4e8bb76de..2c4a349270 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/jira.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/jira.md @@ -9,26 +9,22 @@ import Header from './_source-info-header.md';
-[Jira](https://www.atlassian.com/software/jira) by Atlassian helps teams manage projects and tasks -efficiently, prioritize work, and collaborate. +[Jira](https://www.atlassian.com/software/jira) by Atlassian helps teams manage projects and tasks efficiently, prioritize work, and collaborate. -This Jira `dlt` verified source and -[pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/jira_pipeline.py) -loads data using the Jira API to the destination of your choice. +This Jira `dlt` verified source and [pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/jira_pipeline.py) loads data using the Jira API to the destination of your choice. The endpoints that this verified source supports are: | Name | Description | | --------- | ---------------------------------------------------------------------------------------- | | issues | Individual pieces of work to be completed | -| users | Administrators of a given project | +| users | Administrators of a given project | | workflows | The key aspect of managing and tracking the progress of issues or tasks within a project | | projects | A collection of tasks that need to be completed to achieve a certain outcome | -To get a complete list of sub-endpoints that can be loaded, see -[jira/settings.py.](https://github.com/dlt-hub/verified-sources/blob/master/sources/jira/settings.py) +To get a complete list of sub-endpoints that can be loaded, see [jira/settings.py.](https://github.com/dlt-hub/verified-sources/blob/master/sources/jira/settings.py) -## Setup Guide +## Setup guide ### Grab credentials @@ -44,9 +40,7 @@ To get a complete list of sub-endpoints that can be loaded, see 1. Safely copy the newly generated access token. -> Note: The Jira UI, which is described here, might change. -The full guide is available at [this link.](https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/) - +> Note: The Jira UI, which is described here, might change. The full guide is available at [this link.](https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/) ### Initialize the verified source @@ -58,53 +52,41 @@ To get started with your data pipeline, follow these steps: dlt init jira duckdb ``` - [This command](../../reference/command-line-interface) will initialize - [the pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/jira_pipeline.py) - with Jira as the [source](../../general-usage/source) and [duckdb](../destinations/duckdb.md) as - the [destination](../destinations). + [This command](../../reference/command-line-interface) will initialize [the pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/jira_pipeline.py) with Jira as the [source](../../general-usage/source) and [duckdb](../destinations/duckdb.md) as the [destination](../destinations). -1. If you'd like to use a different destination, simply replace `duckdb` with the name of your - preferred [destination](../destinations). +1. If you'd like to use a different destination, simply replace `duckdb` with the name of your preferred [destination](../destinations). -1. After running this command, a new directory will be created with the necessary files and - configuration settings to get started. +1. After running this command, a new directory will be created with the necessary files and configuration settings to get started. For more information, read the guide on [how to add a verified source](../../walkthroughs/add-a-verified-source). ### Add credentials -1. Inside the `.dlt` folder, you'll find a file called `secrets.toml`, where you can securely store - your access tokens and other sensitive information. It's important to handle this file with care - and keep it safe. +1. Inside the `.dlt` folder, you'll find a file called `secrets.toml`, where you can securely store your access tokens and other sensitive information. It's important to handle this file with care and keep it safe. Here's what the file looks like: ```toml - # put your secret values and credentials here. Please do not share this file, and do not push it to GitHub + # Put your secret values and credentials here. Please do not share this file, and do not push it to GitHub. [sources.jira] - subdomain = "set me up!" # please set me up! - email = "set me up!" # please set me up! - api_token = "set me up!" # please set me up! + subdomain = "set me up!" # Please set me up! + email = "set me up!" # Please set me up! + api_token = "set me up!" # Please set me up! ``` -1. A subdomain in a URL identifies your Jira account. For example, in - "https://example.atlassian.net", "example" is the subdomain. +1. A subdomain in a URL identifies your Jira account. For example, in "https://example.atlassian.net", "example" is the subdomain. 1. Use the email address associated with your Jira account. -1. Replace the "access_token" value with the [previously copied one](jira.md#grab-credentials) to - ensure secure access to your Jira account. +1. Replace the "access_token" value with the [previously copied one](jira.md#grab-credentials) to ensure secure access to your Jira account. -1. Next, follow the [destination documentation](../../dlt-ecosystem/destinations) instructions to - add credentials for your chosen destination, ensuring proper routing of your data to the final - destination. +1. Next, follow the [destination documentation](../../dlt-ecosystem/destinations) instructions to add credentials for your chosen destination, ensuring proper routing of your data to the final destination. For more information, read [General Usage: Credentials.](../../general-usage/credentials) ## Run the pipeline -1. Before running the pipeline, ensure that you have installed all the necessary dependencies by - running the command: +1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: ```sh pip install -r requirements.txt ``` @@ -112,26 +94,21 @@ For more information, read [General Usage: Credentials.](../../general-usage/cre ```sh python jira_pipeline.py ``` -1. Once the pipeline has finished running, you can verify that everything loaded correctly by using - the following command: +1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: ```sh dlt pipeline show ``` - For example, the `pipeline_name` for the above pipeline example is `jira_pipeline`. You may also - use any custom name instead. + For example, the `pipeline_name` for the above pipeline example is `jira_pipeline`. You may also use any custom name instead. For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline). ## Sources and resources -`dlt` works on the principle of [sources](../../general-usage/source) and -[resources](../../general-usage/resource). +`dlt` works on the principle of [sources](../../general-usage/source) and [resources](../../general-usage/resource). ### Default endpoints -You can write your own pipelines to load data to a destination using this verified source. However, -it is important to note the complete list of the default endpoints given in -[jira/settings.py.](https://github.com/dlt-hub/verified-sources/blob/master/sources/jira/settings.py) +You can write your own pipelines to load data to a destination using this verified source. However, it is important to note the complete list of the default endpoints given in [jira/settings.py.](https://github.com/dlt-hub/verified-sources/blob/master/sources/jira/settings.py) ### Source `jira` @@ -153,8 +130,7 @@ def jira( ### Source `jira_search` -This function returns a resource for querying issues using JQL -[(Jira Query Language)](https://support.atlassian.com/jira-service-management-cloud/docs/use-advanced-search-with-jira-query-language-jql/). +This function returns a resource for querying issues using JQL [(Jira Query Language)](https://support.atlassian.com/jira-service-management-cloud/docs/use-advanced-search-with-jira-query-language-jql/). ```py @dlt.source @@ -166,8 +142,7 @@ def jira_search( ... ``` -The above function uses the same arguments `subdomain`, `email`, and `api_token` as described above -for the [jira source](jira.md#source-jira). +The above function uses the same arguments `subdomain`, `email`, and `api_token` as described above for the [jira source](jira.md#source-jira). ### Resource `issues` @@ -177,20 +152,19 @@ The resource function searches issues using JQL queries and then loads them to t @dlt.resource(write_disposition="replace") def issues(jql_queries: List[str]) -> Iterable[TDataItem]: api_path = "rest/api/3/search" - return {} # return the retrieved values here + return {} # Return the retrieved values here ``` `jql_queries`: Accepts a list of JQL queries. ## Customization + + ### Create your own pipeline -If you wish to create your own pipelines, you can leverage source and resource methods as discussed -above. +If you wish to create your own pipelines, you can leverage source and resource methods as discussed above. -1. Configure the pipeline by specifying the pipeline name, destination, and dataset. To read more - about pipeline configuration, please refer to our documentation - [here](https://dlthub.com/docs/general-usage/pipeline): +1. Configure the pipeline by specifying the pipeline name, destination, and dataset. To read more about pipeline configuration, please refer to our documentation [here](https://dlthub.com/docs/general-usage/pipeline): ```py pipeline = dlt.pipeline( @@ -203,23 +177,22 @@ above. 2. To load custom endpoints such as “issues” and “users” using the jira source function: ```py - #Run the pipeline - load_info = pipeline.run(jira().with_resources("issues","users")) + # Run the pipeline + load_info = pipeline.run(jira().with_resources("issues", "users")) print(f"Load Information: {load_info}") ``` -3. To load the custom issues using JQL queries, you can use custom queries. Here is an example - below: +3. To load the custom issues using JQL queries, you can use custom queries. Here is an example below: ```py # Define the JQL queries as follows queries = [ - "created >= -30d order by created DESC", - 'created >= -30d AND project = DEV AND issuetype = Epic AND status = "In Progress" order by created DESC', - ] + "created >= -30d order by created DESC", + 'created >= -30d AND project = DEV AND issuetype = Epic AND status = "In Progress" order by created DESC', + ] # Run the pipeline load_info = pipeline.run(jira_search().issues(jql_queries=queries)) - # Print Load information + # Print load information print(f"Load Information: {load_info}") ``` diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/kafka.md b/docs/website/docs/dlt-ecosystem/verified-sources/kafka.md index fe3c426819..368cc35aff 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/kafka.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/kafka.md @@ -20,7 +20,7 @@ The resource that can be loaded: | ----------------- |--------------------------------------------| | kafka_consumer | Extracts messages from Kafka topics | -## Setup Guide +## Setup guide ### Grab Kafka cluster credentials @@ -96,7 +96,7 @@ sasl_password="example_secret" For more information, read the [Walkthrough: Run a pipeline](../../walkthroughs/run-a-pipeline). -:::info If you created a topic and start reading from it immedately, the brokers may be not yet synchronized and offset from which `dlt` reads messages may become invalid. In this case the resource will return no messages. Pending messages will be received on next run (or when brokers synchronize) +:::info If you created a topic and start reading from it immediately, the brokers may not yet be synchronized and the offset from which `dlt` reads messages may become invalid. In this case, the resource will return no messages. Pending messages will be received on the next run (or when brokers synchronize). ## Sources and resources @@ -148,7 +148,6 @@ this offset. ### Create your own pipeline - 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: ```py diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/notion.md b/docs/website/docs/dlt-ecosystem/verified-sources/notion.md index 69e66ed2aa..3296da0611 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/notion.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/notion.md @@ -9,12 +9,9 @@ import Header from './_source-info-header.md';
-[Notion](https://www.notion.so/) is a flexible workspace tool for organizing personal and -professional tasks, offering customizable notes, documents, databases, and more. +[Notion](https://www.notion.so/) is a flexible workspace tool for organizing personal and professional tasks, offering customizable notes, documents, databases, and more. -This Notion `dlt` verified source and -[pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/notion_pipeline.py) -loads data using “Notion API” to the destination of your choice. +This Notion `dlt` verified source and [pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/notion_pipeline.py) loads data using the “Notion API” to the destination of your choice. Sources that can be loaded using this verified source are: @@ -22,30 +19,25 @@ Sources that can be loaded using this verified source are: |------------------|---------------------------------------| | notion_databases | Retrieves data from Notion databases. | -## Setup Guide +## Setup guide ### Grab credentials 1. If you don't already have a Notion account, please create one. -1. Access your Notion account and navigate to - [My Integrations](https://www.notion.so/my-integrations). +1. Access your Notion account and navigate to [My Integrations](https://www.notion.so/my-integrations). 1. Click "New Integration" on the left and name it appropriately. 1. Finally, click on "Submit" located at the bottom of the page. - ### Add a connection to the database 1. Open the database that you want to load to the destination. - 1. Click on the three dots located in the top right corner and choose "Add connections". ![Notion Database](./docs_images/Notion_Database_2.jpeg) 1. From the list of options, select the integration you previously created and click on "Confirm". -> Note: The Notion UI, which is described here, might change. -The full guide is available at [this link.](https://developers.notion.com/docs/authorization) - +> Note: The Notion UI, which is described here, might change. The full guide is available at [this link.](https://developers.notion.com/docs/authorization) ### Initialize the verified source @@ -57,24 +49,17 @@ To get started with your data pipeline, follow these steps: dlt init notion duckdb ``` - [This command](../../reference/command-line-interface) will initialize - [the pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/notion_pipeline.py) - with Notion as the [source](../../general-usage/source) and [duckdb](../destinations/duckdb.md) - as the [destination](../destinations). + [This command](../../reference/command-line-interface) will initialize [the pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/notion_pipeline.py) with Notion as the [source](../../general-usage/source) and [duckdb](../destinations/duckdb.md) as the [destination](../destinations). -1. If you'd like to use a different destination, simply replace `duckdb` with the name of your - preferred [destination](../destinations). +1. If you'd like to use a different destination, simply replace `duckdb` with the name of your preferred [destination](../destinations). -1. After running this command, a new directory will be created with the necessary files and - configuration settings to get started. +1. After running this command, a new directory will be created with the necessary files and configuration settings to get started. For more information, read the guide on [how to add a verified source.](../../walkthroughs/add-a-verified-source) ### Add credentials -1. In the `.dlt` folder, there's a file called `secrets.toml`. It's where you store sensitive - information securely, like access tokens. Keep this file safe. Here's its format for service - account authentication: +1. In the `.dlt` folder, there's a file called `secrets.toml`. It's where you store sensitive information securely, like access tokens. Keep this file safe. Here's its format for service account authentication: ```toml # Put your secret values and credentials here @@ -83,12 +68,9 @@ For more information, read the guide on [how to add a verified source.](../../wa api_key = "set me up!" # Notion API token (e.g. secret_XXX...) ``` -1. Replace the value of `api_key` with the one that [you copied above](notion.md#grab-credentials). - This will ensure that your data-verified source can access your Notion resources securely. +1. Replace the value of `api_key` with the one that [you copied above](notion.md#grab-credentials). This will ensure that your data-verified source can access your Notion resources securely. -1. Next, follow the instructions in [Destinations](../destinations/duckdb) to add credentials for - your chosen destination. This will ensure that your data is properly routed to its final - destination. +1. Next, follow the instructions in [Destinations](../destinations/duckdb) to add credentials for your chosen destination. This will ensure that your data is properly routed to its final destination. For more information, read the [General Usage: Credentials.](../../general-usage/credentials) @@ -108,7 +90,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage ```sh dlt pipeline show ``` - For example, the `pipeline_name` for the above pipeline example is `notion`, you may also use any + For example, the `pipeline_name` for the above pipeline example is `notion`. You may also use any custom name instead. For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline). @@ -120,7 +102,7 @@ For more information, read the guide on [how to run a pipeline](../../walkthroug ### Source `notion_databases` -This function loads notion databases from notion into the destination. +This function loads Notion databases from Notion into the destination. ```py @dlt.source @@ -131,7 +113,7 @@ def notion_databases( ... ``` -`database_ids`: A list of dictionaries each containing a database id and a name. +`database_ids`: A list of dictionaries each containing a database ID and a name. `api_key`: The Notion API secret key. @@ -141,8 +123,8 @@ def notion_databases( It is important to note that the data is loaded in “replace” mode where the existing data is completely replaced. - ## Customization + ### Create your own pipeline If you wish to create your own pipelines, you can leverage source and resource methods from this @@ -178,7 +160,7 @@ verified source. print(load_info) ``` - The Database ID can be retrieved from the URL. For example if the URL is: + The database ID can be retrieved from the URL. For example, if the URL is: ```sh https://www.notion.so/d8ee2d159ac34cfc85827ba5a0a8ae71?v=c714dec3742440cc91a8c38914f83b6b @@ -193,3 +175,5 @@ The database name ("use_name") is optional; if skipped, the pipeline will fetch automatically. + + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/openapi-generator.md b/docs/website/docs/dlt-ecosystem/verified-sources/openapi-generator.md index a987a55b15..f18f995d46 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/openapi-generator.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/openapi-generator.md @@ -32,7 +32,6 @@ You will need Python 3.9 or higher installed, as well as pip. You can run `pip i We will create a simple example pipeline from a [PokeAPI spec](https://pokeapi.co/) in our repo. You can point to any other OpenAPI Spec instead if you prefer. - 1. Run the generator with a URL: ```sh dlt-init-openapi pokemon --url https://raw.githubusercontent.com/dlt-hub/dlt-init-openapi/devel/tests/cases/e2e_specs/pokeapi.yml --global-limit 2 @@ -66,7 +65,7 @@ We will create a simple example pipeline from a [PokeAPI spec](https://pokeapi.c dlt pipeline pokemon_pipeline show ``` -9. You can go to our docs at https://dlthub.com/docs to learn how to modify the generated pipeline to load to many destinations, place schema contracts on your pipeline, and many other things. +9. You can go to our docs at https://dlthub.com/docs to learn how to modify the generated pipeline to load to many destinations, place schema contracts on your pipeline, and many other things. :::note We used the `--global-limit 2` CLI flag to limit the requests to the PokeAPI @@ -74,6 +73,7 @@ for this example. This way, the Pokemon collection endpoint only gets queried twice, resulting in 2 x 20 Pokemon details being rendered. ::: + ## What will be created? When you run the `dlt-init-openapi` command above, the following files will be generated: @@ -94,12 +94,12 @@ pokemon_pipeline/ ``` :::warning -If you re-generate your pipeline, you will be prompted to continue if this folder exists. If you select yes, all generated files will be overwritten. All other files you may have created will remain in this folder. In non-interactive mode you will not be asked, and the generated files will be overwritten. +If you re-generate your pipeline, you will be prompted to continue if this folder exists. If you select yes, all generated files will be overwritten. All other files you may have created will remain in this folder. In non-interactive mode, you will not be asked, and the generated files will be overwritten. ::: ## A closer look at your `rest_api` dictionary in `pokemon/__init__.py` -This file contains the [configuration dictionary](./rest_api#source-configuration) for the rest_api source which is the main result of running this generator. For our Pokemon example, we have used an OpenAPI 3 spec that works out of the box. The result of this dictionary depends on the quality of the spec you are using, whether the API you are querying actually adheres to this spec, and whether our heuristics manage to find the right values. +This file contains the [configuration dictionary](./rest_api#source-configuration) for the rest_api source, which is the main result of running this generator. For our Pokemon example, we have used an OpenAPI 3 spec that works out of the box. The result of this dictionary depends on the quality of the spec you are using, whether the API you are querying actually adheres to this spec, and whether our heuristics manage to find the right values. The generated dictionary will look something like this: @@ -185,7 +185,7 @@ _The only required options are either to supply a path or a URL to a spec_ ## Config options You can pass a path to a config file with the `--config PATH` argument. To see available config values, go to https://github.com/dlt-hub/dlt-init-openapi/blob/devel/dlt_init_openapi/config.py and read the information below each field on the `Config` class. -The config file can be supplied as JSON or YAML dictionary. For example, to change the package name, you can create a YAML file: +The config file can be supplied as a JSON or YAML dictionary. For example, to change the package name, you can create a YAML file: ```yaml # config.yml @@ -207,4 +207,5 @@ This project started as a fork of [openapi-python-client](https://github.com/ope ## Implementation notes * OAuth Authentication currently is not natively supported. You can supply your own. * Per endpoint authentication currently is not supported by the generator. Only the first globally set securityScheme will be applied. You can add your own per endpoint if you need to. -* Basic OpenAPI 2.0 support is implemented. We recommend updating your specs at https://editor.swagger.io before using `dlt-init-openapi`. \ No newline at end of file +* Basic OpenAPI 2.0 support is implemented. We recommend updating your specs at https://editor.swagger.io before using `dlt-init-openapi`. + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/personio.md b/docs/website/docs/dlt-ecosystem/verified-sources/personio.md index 9829c94786..daa61d56e3 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/personio.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/personio.md @@ -9,11 +9,9 @@ import Header from './_source-info-header.md';
-Personio is a human resources management software that helps businesses streamline HR processes, -including recruitment, employee data management, and payroll, in one platform. +Personio is human resources management software that helps businesses streamline HR processes, including recruitment, employee data management, and payroll, in one platform. -Our [Personio verified](https://github.com/dlt-hub/verified-sources/blob/master/sources/personio) source loads data using Perosnio API to your preferred -[destination](../destinations). +Our [Personio verified](https://github.com/dlt-hub/verified-sources/blob/master/sources/personio) source loads data using the Personio API to your preferred [destination](../destinations). :::tip You can check out our pipeline example [here](https://github.com/dlt-hub/verified-sources/blob/master/sources/personio_pipeline.py). @@ -23,17 +21,17 @@ Resources that can be loaded using this verified source are: | Name | Description | Endpoint | |----------------------------|-----------------------------------------------------------------------------------|---------------------------------------------------| -| employees | Retrieves company employees details | /company/employees | +| employees | Retrieves company employees' details | /company/employees | | absences | Retrieves absence periods for absences tracked in days | /company/time-offs | -| absences_types | Retrieves list of various types of employee absences | /company/time-off-types | +| absences_types | Retrieves a list of various types of employee absences | /company/time-off-types | | attendances | Retrieves attendance records for each employee | /company/attendances | | projects | Retrieves a list of all company projects | /company/attendances/projects | | document_categories | Retrieves all document categories of the company | /company/document-categories | -| employees_absences_balance | The transformer, retrieves the absence balance for a specific employee | /company/employees/{employee_id}/absences/balance | +| employees_absences_balance | The transformer retrieves the absence balance for a specific employee | /company/employees/{employee_id}/absences/balance | | custom_reports_list | Retrieves metadata about existing custom reports (name, report type, report date) | /company/custom-reports/reports | | custom_reports | The transformer for custom reports | /company/custom-reports/reports/{report_id} | -## Setup Guide +## Setup guide ### Grab credentials @@ -42,12 +40,13 @@ To load data from Personio, you need to obtain API credentials, `client_id` and 1. Sign in to your Personio account, and ensure that your user account has API access rights. 1. Navigate to Settings > Integrations > API credentials. 1. Click on "Generate new credentials." -1. Assign necessary permissions to credentials, i.e. read access. +1. Assign necessary permissions to credentials, i.e., read access. :::info The Personio UI, which is described here, might change. The full guide is available at this [link.](https://developer.personio.de/docs#21-employee-attendance-and-absence-endpoints) ::: + ### Initialize the verified source To get started with your data pipeline, follow these steps: @@ -81,8 +80,8 @@ For more information, read [Add a verified source.](../../walkthroughs/add-a-ver # Put your secret values and credentials here # Note: Do not share this file and do not push it to GitHub! [sources.personio] - client_id = "papi-*****" # please set me up! - client_secret = "papi-*****" # please set me up! + client_id = "papi-*****" # Please set me up! + client_secret = "papi-*****" # Please set me up! ``` 1. Replace the value of `client_id` and `client_secret` with the one that @@ -175,13 +174,12 @@ def employees( `allow_external_schedulers`: A boolean that, if True, permits [external schedulers](../../general-usage/incremental-loading#using-airflow-schedule-for-backfill-and-incremental-loading) to manage incremental loading. - Like the `employees` resource discussed above, other resources `absences` and `attendances` load data incrementally from the Personio API to your preferred destination. ### Resource `absence_types` -Simple resource, which retrieves a list of various types of employee absences. +A simple resource that retrieves a list of various types of employee absences. ```py @dlt.resource(primary_key="id", write_disposition="replace") def absence_types(items_per_page: int = items_per_page) -> Iterable[TDataItem]: @@ -195,16 +193,16 @@ It is important to note that the data is loaded in `replace` mode where the exis completely replaced. In addition to the mentioned resource, -there are three more resources `projects`, `custom_reports_list` and `document_categories` -with similar behaviour. +there are three more resources `projects`, `custom_reports_list`, and `document_categories` +with similar behavior. ### Resource-transformer `employees_absences_balance` -Besides of these source and resource functions, there are two transformer functions +Besides these source and resource functions, there are two transformer functions for endpoints like `/company/employees/{employee_id}/absences/balance` and `/company/custom-reports/reports/{report_id}`. The transformer functions transform or process data from resources. -The transformer function `employees_absences_balance` process data from the `employees` resource. +The transformer function `employees_absences_balance` processes data from the `employees` resource. It fetches and returns a list of the absence balances for each employee. ```py @@ -219,7 +217,7 @@ def employees_absences_balance(employees_item: TDataItem) -> Iterable[TDataItem] ``` `employees_item`: The data item from the 'employees' resource. -It uses `@dlt.defer` decorator to enable parallel run in thread pool. +It uses the `@dlt.defer` decorator to enable parallel runs in the thread pool. ## Customization @@ -253,3 +251,4 @@ verified source. ``` + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/pipedrive.md b/docs/website/docs/dlt-ecosystem/verified-sources/pipedrive.md index d571e5d386..60445b43a2 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/pipedrive.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/pipedrive.md @@ -30,7 +30,7 @@ Sources and resources that can be loaded using this verified source are: | stage | Specific step in a sales process where a deal resides based on its progress | | user | Individual with a unique login credential who can access and use the platform | -## Setup Guide +## Setup guide ### Grab API token @@ -77,7 +77,7 @@ For more information, read the guide on [how to add a verified source.](../../wa ```toml [sources.pipedrive.credentials] # Note: Do not share this file and do not push it to GitHub! - pipedrive_api_key = "PIPEDRIVE_API_TOKEN" # please set me up ! + pipedrive_api_key = "PIPEDRIVE_API_TOKEN" # please set me up! ``` 1. Replace `PIPEDRIVE_API_TOKEN` with the API token you [copied above](#grab-api-token). @@ -132,8 +132,8 @@ Pipedrive API. ### Source `pipedrive_source` -This function returns a list of resources including activities, deals, custom_fields_mapping and -other resources data from Pipedrive API. +This function returns a list of resources including activities, deals, custom_fields_mapping, and +other resources data from the Pipedrive API. ```py @dlt.source(name="pipedrive") @@ -199,7 +199,7 @@ def pipedrive_source(args): `write_disposition`: Sets the transformer to merge new data with existing data in the destination. Similar to the transformer function "deals_participants" is another transformer function named -"deals_flow" that gets the flow of deals from the Pipedrive API, and then yields the result for +"deals_flow" that gets the flow of deals from the Pipedrive API and then yields the result for further processing or loading. ### Resource `create_state` @@ -225,7 +225,7 @@ entity exists. This updated state is then saved for future pipeline runs. Similar to the above functions, there are the following: `custom_fields_mapping`: Transformer function that parses and yields custom fields' mapping in order -to be stored in destination by dlt. +to be stored in the destination by dlt. `leads`: Resource function that incrementally loads Pipedrive leads by update_time. diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/scrapy.md b/docs/website/docs/dlt-ecosystem/verified-sources/scrapy.md index 2e6b588c18..2432bf38b2 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/scrapy.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/scrapy.md @@ -9,7 +9,7 @@ keywords: [scraping, scraping verified source, scrapy] This verified source utilizes Scrapy, an open-source and collaborative framework for web scraping. Scrapy enables efficient extraction of required data from websites. -## Setup Guide +## Setup guide ### Initialize the verified source @@ -37,15 +37,15 @@ For more information, read the guide on ### Add credentials -1. The `config.toml`, looks like: +1. The `config.toml` looks like: ```toml # put your configuration values here [sources.scraping] start_urls = ["URL to be scraped"] # please set me up! start_urls_file = "/path/to/urls.txt" # please set me up! ``` - > When both `start_urls` and `start_urls_file` are provided they will be merged and deduplicated - > to ensure a Scrapy gets a unique set of start URLs. + > When both `start_urls` and `start_urls_file` are provided, they will be merged and deduplicated + > to ensure Scrapy gets a unique set of start URLs. 1. Inside the `.dlt` folder, you'll find a file called `secrets.toml`, which is where you can securely store your access tokens and other sensitive information. It's important to handle this @@ -85,13 +85,13 @@ scrape data from "https://quotes.toscrape.com/page/1/". ## Customization + + ### Create your own pipeline If you wish to create your data pipeline, follow these steps: -1. The first step requires creating a spider class that scrapes data - from the website. For example, class `Myspider` below scrapes data from - URL: "https://quotes.toscrape.com/page/1/". +1. The first step requires creating a spider class that scrapes data from the website. For example, class `MySpider` below scrapes data from URL: "https://quotes.toscrape.com/page/1/". ```py class MySpider(Spider): @@ -112,7 +112,6 @@ If you wish to create your data pipeline, follow these steps: }, } yield result - ``` > Define your own class tailored to the website you intend to scrape. @@ -127,10 +126,9 @@ If you wish to create your data pipeline, follow these steps: ) ``` - To read more about pipeline configuration, please refer to our - [documentation](../../general-usage/pipeline). + To read more about pipeline configuration, please refer to our [documentation](../../general-usage/pipeline). -1. To run the pipeline with customized scrapy settings: +1. To run the pipeline with customized Scrapy settings: ```py run_pipeline( @@ -151,13 +149,9 @@ If you wish to create your data pipeline, follow these steps: ) ``` - In the above example, scrapy settings are passed as a parameter. For more information about - scrapy settings, please refer to the - [Scrapy documentation.](https://docs.scrapy.org/en/latest/topics/settings.html). + In the above example, Scrapy settings are passed as a parameter. For more information about Scrapy settings, please refer to the [Scrapy documentation.](https://docs.scrapy.org/en/latest/topics/settings.html). -1. To limit the number of items processed, use the "on_before_start" function to set a limit on - the resources the pipeline processes. For instance, setting the resource limit to two allows - the pipeline to yield a maximum of two resources. +1. To limit the number of items processed, use the "on_before_start" function to set a limit on the resources the pipeline processes. For instance, setting the resource limit to two allows the pipeline to yield a maximum of two resources. ```py def on_before_start(res: DltResource) -> None: @@ -179,11 +173,11 @@ If you wish to create your data pipeline, follow these steps: ) ``` -1. To create a pipeline using Scrapy host, use `create_pipeline_runner` defined in - `helpers.py`. As follows: +1. To create a pipeline using Scrapy host, use `create_pipeline_runner` defined in `helpers.py` as follows: ```py scraping_host = create_pipeline_runner(pipeline, MySpider, batch_size=10) scraping_host.pipeline_runner.scraping_resource.add_limit(2) scraping_host.run(dataset_name="quotes", write_disposition="append") ``` + diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/slack.md b/docs/website/docs/dlt-ecosystem/verified-sources/slack.md index 38eda15c94..72947d2c3a 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/slack.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/slack.md @@ -25,7 +25,7 @@ Sources and resources that can be loaded using this verified source are: | get_messages_resource | Retrieves all the messages for a given channel | | access_logs | Retrieves the access logs | -## Setup Guide +## Setup guide ### Grab user OAuth token @@ -33,9 +33,9 @@ To set up the pipeline, create a Slack app in your workspace to obtain a user to 1. Navigate to your Slack workspace and click on the name at the top-left. 1. Select Tools > Customize Workspace. -1. From the top-left Menu, choose Configure apps. +1. From the top-left menu, choose Configure apps. 1. Click Build (top-right) > Create a New App. -1. Opt for "From scratch", set the "App Name", and pick your target workspace. +1. Opt for "From scratch," set the "App Name," and pick your target workspace. 1. Confirm with Create App. 1. Navigate to OAuth and Permissions under the Features section. 1. Assign the following scopes: @@ -121,7 +121,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage dlt pipeline show ``` - For example, the `pipeline_name` for the above pipeline example is `slack`, you + For example, the `pipeline_name` for the above pipeline example is `slack`. You may also use any custom name instead. For more information, read the guide on [how to run a pipeline](../../walkthroughs/run-a-pipeline). @@ -133,7 +133,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage ### Source `slack` -It retrieves data from Slack's API and fetches the Slack data such as channels, messages for selected channels, users, logs. +It retrieves data from Slack's API and fetches the Slack data such as channels, messages for selected channels, users, and logs. ```py @dlt.source(name="slack", max_table_nesting=2) @@ -151,7 +151,7 @@ def slack_source( `access_token`: OAuth token for authentication. -`start_date`: Range start. (default: January 1, 2000). +`start_date`: Range start (default: January 1, 2000). `end_date`: Range end. @@ -217,7 +217,7 @@ This method retrieves access logs from the Slack API. primary_key="user_id", write_disposition="append", ) -# it is not an incremental resource it just has a end_date filter +# It is not an incremental resource; it just has an end_date filter. def logs_resource() -> Iterable[TDataItem]: ... ``` diff --git a/docs/website/docs/general-usage/credentials/advanced.md b/docs/website/docs/general-usage/credentials/advanced.md index 793f5c2a55..30a5d06551 100644 --- a/docs/website/docs/general-usage/credentials/advanced.md +++ b/docs/website/docs/general-usage/credentials/advanced.md @@ -8,7 +8,7 @@ keywords: [credentials, secrets.toml, secrets, config, configuration, environmen ## Injection mechanism -`dlt` has a special treatment for functions decorated with `@dlt.source`, `@dlt.resource`, and `@dlt.destination`. When such a function is called, `dlt` takes the argument names in the signature and supplies (`injects`) the required values by looking for them in [various config providers](setup). +`dlt` has special treatment for functions decorated with `@dlt.source`, `@dlt.resource`, and `@dlt.destination`. When such a function is called, `dlt` takes the argument names in the signature and supplies (`injects`) the required values by looking for them in [various config providers](setup). ### Injection rules @@ -58,12 +58,12 @@ keywords: [credentials, secrets.toml, secrets, config, configuration, environmen We highly recommend adding types to your function signatures. The effort is very low, and it gives `dlt` much more -information on what source/resource expects. +information on what the source/resource expects. Doing so provides several benefits: -1. You'll never receive the invalid data types in your code. -1. `dlt` will automatically parse and coerce types for you, so you don't need to parse it yourself. +1. You'll never receive invalid data types in your code. +1. `dlt` will automatically parse and coerce types for you, so you don't need to parse them yourself. 1. `dlt` can generate sample config and secret files for your source automatically. 1. You can request [built-in and custom credentials](complex_types) (i.e., connection strings, AWS / GCP / Azure credentials). 1. You can specify a set of possible types via `Union`, i.e., OAuth or API Key authorization. @@ -89,12 +89,12 @@ Now, * `service.json` as a string or dictionary (in code and via config providers). * connection string (used in SQL Alchemy) (in code and via config providers). - * if nothing is passed, the default credentials are used (i.e., those present on Cloud Function runner) + * if nothing is passed, the default credentials are used (i.e., those present on Cloud Function runner). ## Toml files structure `dlt` arranges the sections of [toml files](setup/#secretstoml-and-configtoml) into a **default layout** that is expected by the [injection mechanism](#injection-mechanism). -This layout makes it easy to configure simple cases but also provides a room for more explicit sections and complex cases, i.e., having several sources with different credentials +This layout makes it easy to configure simple cases but also provides room for more explicit sections and complex cases, i.e., having several sources with different credentials or even hosting several pipelines in the same project sharing the same config and credentials. ```text @@ -123,7 +123,6 @@ pipeline_name |-normalize ``` - ## Read configs and secrets manually `dlt` handles credentials and configuration automatically, but also offers flexibility for manual processing. @@ -158,7 +157,7 @@ dlt.config["sheet_id"] = "23029402349032049" dlt.secrets["destination.postgres.credentials"] = BaseHook.get_connection('postgres_dsn').extra ``` -Will mock the `toml` provider to desired values. +This will mock the `toml` provider to the desired values. ## Example @@ -200,4 +199,5 @@ In the example above: :::tip `dlt.resource` behaves in the same way, so if you have a [standalone resource](../resource.md#declare-a-standalone-resource) (one that is not an inner function of a **source**) -::: \ No newline at end of file +::: + diff --git a/docs/website/docs/general-usage/credentials/complex_types.md b/docs/website/docs/general-usage/credentials/complex_types.md index 24915c1b2e..9d3bd35a94 100644 --- a/docs/website/docs/general-usage/credentials/complex_types.md +++ b/docs/website/docs/general-usage/credentials/complex_types.md @@ -125,10 +125,10 @@ credentials.add_scopes(["scope3", "scope4"]) `OAuth2Credentials` is a base class to implement actual OAuth; for example, it is a base class for [GcpOAuthCredentials](#gcpoauthcredentials). -### GCP Credentials +### GCP credentials #### Examples -* [Google Analytics verified source](https://github.com/dlt-hub/verified-sources/blob/master/sources/google_analytics/__init__.py): the example of how to use GCP Credentials. +* [Google Analytics verified source](https://github.com/dlt-hub/verified-sources/blob/master/sources/google_analytics/__init__.py): an example of how to use GCP credentials. * [Google Analytics example](https://github.com/dlt-hub/verified-sources/blob/master/sources/google_analytics/setup_script_gcp_oauth.py): how you can get the refresh token using `dlt.secrets.value`. #### Types @@ -197,6 +197,8 @@ oauth_credentials = GcpOAuthCredentials() # Accepts a native value, which can be either an instance of GoogleOAuth2Credentials # or serialized OAuth client secrets JSON. +``` + # Parses the native value and updates the credentials. native_value_oauth = {"client_secret": ...} oauth_credentials.parse_native_representation(native_value_oauth) @@ -239,11 +241,9 @@ property_id = "213025502" In order for the `auth()` method to succeed: -- You must provide valid `client_id`, `client_secret`, `refresh_token`, and `project_id` to get a current **access token** and authenticate with OAuth. Keep in mind that the `refresh_token` must contain all the scopes that is required for your access. +- You must provide valid `client_id`, `client_secret`, `refresh_token`, and `project_id` to get a current **access token** and authenticate with OAuth. Keep in mind that the `refresh_token` must contain all the scopes that are required for your access. - If the `refresh_token` is not provided, and you run the pipeline from a console or a notebook, `dlt` will use InstalledAppFlow to run the desktop authentication flow. - - #### Defaults If configuration values are missing, `dlt` will use the default Google credentials (from `default()`) if available. Read more about [Google defaults.](https://googleapis.dev/python/google-auth/latest/user-guide.html#application-default-credentials) @@ -383,26 +383,25 @@ This applies not only to credentials but to [all specs](#writing-custom-specs). ::: :::tip -Check out the [complete example](https://github.com/dlt-hub/dlt/blob/devel/tests/common/configuration/test_spec_union.py), to learn how to create unions +Check out the [complete example](https://github.com/dlt-hub/dlt/blob/devel/tests/common/configuration/test_spec_union.py) to learn how to create unions of credentials that derive from the common class, so you can handle it seamlessly in your code. ::: + ## Writing custom specs -**Custom specifications** let you take full control over the function arguments. You can +**Custom specifications** let you take full control over the function arguments. You can: -- Control which values should be injected, the types, default values. +- Control which values should be injected, the types, and default values. - Specify optional and final fields. - Form hierarchical configurations (specs in specs). -- Provide own handlers for `on_partial` (called before failing on missing config key) or `on_resolved`. -- Provide own native value parsers. -- Provide own default credentials logic. -- Utilise Python dataclass functionality. -- Utilise Python `dict` functionality (`specs` instances can be created from dicts and serialized - from dicts). +- Provide your own handlers for `on_partial` (called before failing on a missing config key) or `on_resolved`. +- Provide your own native value parsers. +- Provide your own default credentials logic. +- Utilize Python dataclass functionality. +- Utilize Python `dict` functionality (`specs` instances can be created from dicts and serialized from dicts). -In fact, `dlt` synthesizes a unique spec for each decorated function. For example, in the case of `google_sheets`, the following -class is created: +In fact, `dlt` synthesizes a unique spec for each decorated function. For example, in the case of `google_sheets`, the following class is created: ```py from dlt.sources.config import configspec, with_config @@ -415,26 +414,22 @@ class GoogleSheetsConfiguration(BaseConfiguration): ``` ### All specs derive from [BaseConfiguration](https://github.com/dlt-hub/dlt/blob/devel/dlt/common/configuration/specs/base_configuration.py#L170) + This class serves as a foundation for creating configuration objects with specific characteristics: -- It provides methods to parse and represent the configuration - in native form (`parse_native_representation` and `to_native_representation`). +- It provides methods to parse and represent the configuration in native form (`parse_native_representation` and `to_native_representation`). - It defines methods for accessing and manipulating configuration fields. -- It implements a dictionary-compatible interface on top of the dataclass. -This allows instances of this class to be treated like dictionaries. +- It implements a dictionary-compatible interface on top of the dataclass. This allows instances of this class to be treated like dictionaries. -- It defines helper functions for checking if a certain attribute is present, -if a field is valid, and for calling methods in the method resolution order (MRO). +- It defines helper functions for checking if a certain attribute is present, if a field is valid, and for calling methods in the method resolution order (MRO). More information about this class can be found in the class docstrings. ### All credentials derive from [CredentialsConfiguration](https://github.com/dlt-hub/dlt/blob/devel/dlt/common/configuration/specs/base_configuration.py#L307) -This class is a subclass of `BaseConfiguration` -and is meant to serve as a base class for handling various types of credentials. -It defines methods for initializing credentials, converting them to native representations, -and generating string representations while ensuring sensitive information is appropriately handled. +This class is a subclass of `BaseConfiguration` and is meant to serve as a base class for handling various types of credentials. It defines methods for initializing credentials, converting them to native representations, and generating string representations while ensuring sensitive information is appropriately handled. + +More information about this class can be found in the class docstrings. -More information about this class can be found in the class docstrings. \ No newline at end of file diff --git a/docs/website/docs/general-usage/credentials/index.md b/docs/website/docs/general-usage/credentials/index.md index c9cbe6707c..95e0ec36ac 100644 --- a/docs/website/docs/general-usage/credentials/index.md +++ b/docs/website/docs/general-usage/credentials/index.md @@ -9,10 +9,11 @@ import DocCardList from '@theme/DocCardList'; 1. Environment variables 2. Configuration files (`secrets.toml` and `config.toml`) -3. Key managers and Vaults +3. Key managers and vaults `dlt` automatically extracts configuration settings and secrets based on flexible [naming conventions](setup/#naming-convention). It then [injects](advanced/#injection-mechanism) these values where needed in code. -# Learn Details About +# Learn details about + + - \ No newline at end of file diff --git a/docs/website/docs/general-usage/credentials/setup.md b/docs/website/docs/general-usage/credentials/setup.md index 7933bab183..94256b3677 100644 --- a/docs/website/docs/general-usage/credentials/setup.md +++ b/docs/website/docs/general-usage/credentials/setup.md @@ -33,9 +33,11 @@ A custom config provider is helpful if you want to use your own configuration fi Please make sure your pipeline name contains no whitespace or any other punctuation characters except `"-"` and `"_"`. This way you will ensure your code is working with any configuration option. ::: + + ## Naming convention -`dlt` uses a specific naming hierarchy to search for the secrets and configs values. This makes configurations and secrets easy to manage. +`dlt` uses a specific naming hierarchy to search for the secrets and config values. This makes configurations and secrets easy to manage. To keep the naming convention flexible, `dlt` looks for a lot of possible combinations of key names, starting from the most specific possible path. Then, if the value is not found, it removes the right-most section and tries again. @@ -125,7 +127,7 @@ def deals(api_key: str = dlt.secrets.value): 1. `api_key` :::tip -You can use your pipeline name to have separate configurations for each pipeline in your project. All config values will be looked with the pipeline name first and then again without it. +You can use your pipeline name to have separate configurations for each pipeline in your project. All config values will be looked up with the pipeline name first and then again without it. ```toml [pipeline_name_1.sources.google_sheets.credentials] @@ -149,7 +151,7 @@ For example, to connect to a `sql_database` source, you can either set up a conn [sources.sql_database] credentials="snowflake://user:password@service-account/database?warehouse=warehouse_name&role=role" ``` -or set up all parameters of connection separately: +or set up all parameters of the connection separately: ```toml [sources.sql_database.credentials] @@ -239,7 +241,7 @@ The TOML provider also has the capability to read files from `~/.dlt/` (located `dlt` organizes sections in TOML files in a specific structure required by the [injection mechanism](advanced/#injection-mechanism). Understanding this structure gives you more flexibility in setting credentials. For more details, see [Toml files structure](advanced/#toml-files-structure). -## Custom Providers +## Custom providers You can use the `CustomLoaderDocProvider` classes to supply a custom dictionary to `dlt` for use as a supplier of `config` and `secret` values. The code below demonstrates how to use a config stored in `config.json`. @@ -255,14 +257,14 @@ def load_config(): config_dict = json.load(f) # create the custom provider -provider = CustomLoaderDocProvider("my_json_provider",load_config) +provider = CustomLoaderDocProvider("my_json_provider", load_config) -# register provider +# Register provider dlt.config.register_provider(provider) ``` :::tip -Check our an [example](../../examples/custom_config_provider) for a `yaml` based config provider that supports switchable profiles. +Check out an [example](../../examples/custom_config_provider) for a `yaml` based config provider that supports switchable profiles. ::: ## Examples @@ -406,8 +408,8 @@ export CREDENTIALS__PROJECT_ID="" ```py import os -# do not set up the secrets directly in the code! -# what you can do is reassign env variables +# Do not set up the secrets directly in the code! +# What you can do is reassign env variables os.environ["CREDENTIALS__CLIENT_EMAIL"] = os.environ.get("GOOGLE_CLIENT_EMAIL") os.environ["CREDENTIALS__PRIVATE_KEY"] = os.environ.get("GOOGLE_PRIVATE_KEY") os.environ["CREDENTIALS__PROJECT_ID"] = os.environ.get("GOOGLE_PROJECT_ID") @@ -431,17 +433,17 @@ os.environ["CREDENTIALS__PROJECT_ID"] = os.environ.get("GOOGLE_PROJECT_ID") ```toml -# google sheet credentials +# Google sheet credentials [sources.credentials] client_email = "" private_key = "" -project_id = "" +project_id = "" -# bigquery credentials +# Bigquery credentials [destination.credentials] client_email = "" private_key = "" -project_id = "" +project_id = "" ``` @@ -449,12 +451,12 @@ project_id = "" ```sh -# google sheet credentials +# Google sheet credentials export SOURCES__CREDENTIALS__CLIENT_EMAIL="" export SOURCES__CREDENTIALS__PRIVATE_KEY="" export SOURCES__CREDENTIALS__PROJECT_ID="" -# bigquery credentials +# Bigquery credentials export DESTINATION__CREDENTIALS__CLIENT_EMAIL="" export DESTINATION__CREDENTIALS__PRIVATE_KEY="" export DESTINATION__CREDENTIALS__PROJECT_ID="" @@ -468,13 +470,13 @@ export DESTINATION__CREDENTIALS__PROJECT_ID="" import dlt import os -# do not set up the secrets directly in the code! -# what you can do is reassign env variables +# Do not set up the secrets directly in the code! +# What you can do is reassign env variables os.environ["DESTINATION__CREDENTIALS__CLIENT_EMAIL"] = os.environ.get("BIGQUERY_CLIENT_EMAIL") os.environ["DESTINATION__CREDENTIALS__PRIVATE_KEY"] = os.environ.get("BIGQUERY_PRIVATE_KEY") os.environ["DESTINATION__CREDENTIALS__PROJECT_ID"] = os.environ.get("BIGQUERY_PROJECT_ID") -# or set them to the dlt.secrets +# Or set them to the dlt.secrets dlt.secrets["sources.credentials.client_email"] = os.environ.get("SHEETS_CLIENT_EMAIL") dlt.secrets["sources.credentials.private_key"] = os.environ.get("SHEETS_PRIVATE_KEY") dlt.secrets["sources.credentials.project_id"] = os.environ.get("SHEETS_PROJECT_ID") @@ -513,23 +515,23 @@ Let's assume we have several different Google sources and destinations. We can u ```toml -# google sheet credentials +# Google Sheets credentials [sources.google_sheets.credentials] client_email = "" private_key = "" -project_id = "" +project_id = "" -# google analytics credentials +# Google Analytics credentials [sources.google_analytics.credentials] client_email = "" private_key = "" -project_id = "" +project_id = "" -# bigquery credentials +# BigQuery credentials [destination.bigquery.credentials] client_email = "" private_key = "" -project_id = "" +project_id = "" ``` @@ -537,17 +539,17 @@ project_id = "" ```sh -# google sheet credentials +# Google Sheets credentials export SOURCES__GOOGLE_SHEETS__CREDENTIALS__CLIENT_EMAIL="" export SOURCES__GOOGLE_SHEETS__CREDENTIALS__PRIVATE_KEY="" export SOURCES__GOOGLE_SHEETS__CREDENTIALS__PROJECT_ID="" -# google analytics credentials +# Google Analytics credentials export SOURCES__GOOGLE_ANALYTICS__CREDENTIALS__CLIENT_EMAIL="" export SOURCES__GOOGLE_ANALYTICS__CREDENTIALS__PRIVATE_KEY="" export SOURCES__GOOGLE_ANALYTICS__CREDENTIALS__PROJECT_ID="" -# bigquery credentials +# BigQuery credentials export DESTINATION__BIGQUERY__CREDENTIALS__CLIENT_EMAIL="" export DESTINATION__BIGQUERY__CREDENTIALS__PRIVATE_KEY="" export DESTINATION__BIGQUERY__CREDENTIALS__PROJECT_ID="" @@ -561,8 +563,8 @@ export DESTINATION__BIGQUERY__CREDENTIALS__PROJECT_ID="" import os import dlt -# do not set up the secrets directly in the code! -# what you can do is reassign env variables +# Do not set up the secrets directly in the code! +# What you can do is reassign env variables os.environ["SOURCES__GOOGLE_ANALYTICS__CREDENTIALS__CLIENT_EMAIL"] = os.environ.get("SHEETS_CLIENT_EMAIL") os.environ["SOURCES__GOOGLE_ANALYTICS__CREDENTIALS__PRIVATE_KEY"] = os.environ.get("ANALYTICS_PRIVATE_KEY") os.environ["SOURCES__GOOGLE_ANALYTICS__CREDENTIALS__PROJECT_ID"] = os.environ.get("ANALYTICS_PROJECT_ID") @@ -571,7 +573,7 @@ os.environ["DESTINATION__CREDENTIALS__CLIENT_EMAIL"] = os.environ.get("BIGQUERY_ os.environ["DESTINATION__CREDENTIALS__PRIVATE_KEY"] = os.environ.get("BIGQUERY_PRIVATE_KEY") os.environ["DESTINATION__CREDENTIALS__PROJECT_ID"] = os.environ.get("BIGQUERY_PROJECT_ID") -# or set them to the dlt.secrets +# Or set them to the dlt.secrets dlt.secrets["sources.credentials.client_email"] = os.environ.get("SHEETS_CLIENT_EMAIL") dlt.secrets["sources.credentials.private_key"] = os.environ.get("SHEETS_PRIVATE_KEY") dlt.secrets["sources.credentials.project_id"] = os.environ.get("SHEETS_PROJECT_ID") @@ -583,7 +585,7 @@ dlt.secrets["sources.credentials.project_id"] = os.environ.get("SHEETS_PROJECT_I ### Credentials for several sources of the same type -Let's assume we have several sources of the same type, how can we separate them in the `secrets.toml`? The recommended solution is to use different pipeline names for each source: +Let's assume we have several sources of the same type. How can we separate them in the `secrets.toml`? The recommended solution is to use different pipeline names for each source: `data_enrichment_part_one` holds the enriched data from part one. It can also be directly used - > in part two as demonstrated in - > **[Colab Notebook](https://colab.research.google.com/drive/1ZKEkf1LRSld7CWQFS36fUXjhJKPAon7P?usp=sharing).** + > `data_enrichment_part_one` holds the enriched data from part one. It can also be directly used in part two as demonstrated in **[Colab Notebook](https://colab.research.google.com/drive/1ZKEkf1LRSld7CWQFS36fUXjhJKPAon7P?usp=sharing).** ### 2. Create `converted_amount` function -This function retrieves conversion rates for currency pairs that either haven't been fetched before -or were last updated more than 24 hours ago from the ExchangeRate-API, using information stored in -the `dlt` [state](../../general-usage/state.md). +This function retrieves conversion rates for currency pairs that either haven't been fetched before or were last updated more than 24 hours ago from the ExchangeRate-API, using information stored in the `dlt` [state](../../general-usage/state.md). -The first step is to register on [ExhangeRate-API](https://app.exchangerate-api.com/) and obtain the -API token. +The first step is to register on [ExchangeRate-API](https://app.exchangerate-api.com/) and obtain the API token. -1. In the `.dlt`folder, there's a file called `secrets.toml`. It's where you store sensitive - information securely, like access tokens. Keep this file safe. Here's its format for service - account authentication: +1. In the `.dlt` folder, there's a file called `secrets.toml`. It's where you store sensitive information securely, like access tokens. Keep this file safe. Here's its format for service account authentication: ```py [sources] - api_key= "Please set me up!" #ExchangeRate-API key + api_key= "Please set me up!" # ExchangeRate-API key ``` 1. Create the `converted_amount` function as follows: @@ -184,10 +166,7 @@ API token. "rate_last_updated": currency_pair_state["last_update"], } ``` -1. Next, follow the instructions in - [Destinations](../../dlt-ecosystem/destinations/duckdb.md) to add credentials for - your chosen destination. This will ensure that your data is properly routed to its final - destination. +1. Next, follow the instructions in [Destinations](../../dlt-ecosystem/destinations/duckdb.md) to add credentials for your chosen destination. This will ensure that your data is properly routed to its final destination. ### 3. Create your pipeline @@ -200,7 +179,7 @@ API token. processing. `Transformers` are a form of `dlt resource` that takes input from other resources - via `data_from` argument to enrich or transform the data. + via the `data_from` argument to enrich or transform the data. [Click here.](../../general-usage/resource.md#process-resources-with-dlttransformer) Conversely, `add_map` used to customize a resource applies transformations at an item level @@ -244,7 +223,7 @@ API token. ### Run the pipeline 1. Install necessary dependencies for the preferred - [destination](../../dlt-ecosystem/destinations/), For example, duckdb: + [destination](../../dlt-ecosystem/destinations/), for example, duckdb: ```sh pip install "dlt[duckdb]" @@ -264,3 +243,4 @@ API token. For example, the "pipeline_name" for the above pipeline example is `data_enrichment_two`; you can use any custom name instead. + diff --git a/docs/website/docs/general-usage/data-enrichments/url-parser-data-enrichment.md b/docs/website/docs/general-usage/data-enrichments/url-parser-data-enrichment.md index f2cd4a1065..8b58f93c8b 100644 --- a/docs/website/docs/general-usage/data-enrichments/url-parser-data-enrichment.md +++ b/docs/website/docs/general-usage/data-enrichments/url-parser-data-enrichment.md @@ -1,33 +1,30 @@ --- -title: URL-parser data enrichment -description: Enriching the url with various parameters. +title: URL-parser data enrichment +description: Enriching the URL with various parameters. keywords: [data enrichment, url parser, referer data enrichment] --- # Data enrichment part three: URL parser data enrichment -URL parser data enrichment is extracting various URL components to gain additional insights and -context about the URL. This extracted information can be used for data analysis, marketing, SEO, and -more. +URL parser data enrichment is extracting various URL components to gain additional insights and context about the URL. This extracted information can be used for data analysis, marketing, SEO, and more. ## URL parsing process -Here is step-by-step process for URL parser data enrichment : +Here is a step-by-step process for URL parser data enrichment: -1. Get the URL data that is needed to be parsed from a source or create one. +1. Get the URL data that needs to be parsed from a source or create one. 1. Send the URL data to an API like [URL Parser API](https://urlparse.com/). 1. Get the parsed URL data. 1. Include metadata like conversion rate, date, and time. 1. Save the updated dataset in a data warehouse or lake using a data pipeline. -We use **[URL Parse API](https://urlparse.com/)** to extract the information about the URL. However, -you can use any API you prefer. +We use **[URL Parse API](https://urlparse.com/)** to extract the information about the URL. However, you can use any API you prefer. :::tip -`URL Parse API` is free, with 1000 requests/hour limit, which can be increased on request. +`URL Parse API` is free, with a 1000 requests/hour limit, which can be increased on request. ::: -By default the URL Parse API will return a JSON response like: +By default, the URL Parse API will return a JSON response like: ```json { @@ -53,8 +50,7 @@ By default the URL Parse API will return a JSON response like: ## Creating data enrichment pipeline -You can either follow the example in the linked Colab notebook or follow this documentation to -create the URL-parser data enrichment pipeline. +You can either follow the example in the linked Colab notebook or follow this documentation to create the URL-parser data enrichment pipeline. ### A. Colab notebook @@ -64,14 +60,11 @@ This Colab notebook outlines a three-part data enrichment process for a sample d - Currency conversion data enrichment - URL-parser data enrichment -This document focuses on the URL-Parser Data Enrichment (Part Three). For a comprehensive -understanding, you may explore all three enrichments sequentially in the notebook: -[Colab Notebook](https://colab.research.google.com/drive/1ZKEkf1LRSld7CWQFS36fUXjhJKPAon7P?usp=sharing). +This document focuses on the URL-Parser Data Enrichment (Part Three). For a comprehensive understanding, you may explore all three enrichments sequentially in the notebook: [Colab Notebook](https://colab.research.google.com/drive/1ZKEkf1LRSld7CWQFS36fUXjhJKPAon7P?usp=sharing). ### B. Create a pipeline -Alternatively, to create a data enrichment pipeline, you can start by creating the following -directory structure: +Alternatively, to create a data enrichment pipeline, you can start by creating the following directory structure: ```text url_parser_enrichment/ @@ -91,7 +84,7 @@ different tracking services. Let's examine a synthetic dataset created for this article. It includes: -- `user_id`: Web trackers typically assign unique ID to users for tracking their journeys and +- `user_id`: Web trackers typically assign a unique ID to users for tracking their journeys and interactions over time. - `device_name`: User device information helps in understanding the user base's device. @@ -139,8 +132,8 @@ Here's the resource that yields the sample data as discussed above: ### 2. Create `url_parser` function -We use a free service called [URL Parse API](https://urlparse.com/), to parse the urls. You don’t -need to register to use this service neither get an API key. +We use a free service called [URL Parse API](https://urlparse.com/) to parse the URLs. You don’t +need to register to use this service or get an API key. 1. Create a `url_parser` function as follows: ```py @@ -185,7 +178,7 @@ need to register to use this service neither get an API key. processing. `Transformers` are a form of `dlt resource` that takes input from other resources - via `data_from` argument to enrich or transform the data. + via the `data_from` argument to enrich or transform the data. [Click here.](../../general-usage/resource.md#process-resources-with-dlttransformer) Conversely, `add_map` used to customize a resource applies transformations at an item level @@ -222,13 +215,13 @@ need to register to use this service neither get an API key. ) ``` - This will execute the `url_parser` function with the tracked data and return parsed URL. + This will execute the `url_parser` function with the tracked data and return the parsed URL. ::: ### Run the pipeline 1. Install necessary dependencies for the preferred - [destination](https://dlthub.com/docs/dlt-ecosystem/destinations/), For example, duckdb: + [destination](https://dlthub.com/docs/dlt-ecosystem/destinations/), for example, duckdb: ```sh pip install "dlt[duckdb]" @@ -248,3 +241,4 @@ need to register to use this service neither get an API key. For example, the "pipeline_name" for the above pipeline example is `data_enrichment_three`; you can use any custom name instead. + diff --git a/docs/website/docs/general-usage/http/overview.md b/docs/website/docs/general-usage/http/overview.md index 7358e577f4..939518fcb2 100644 --- a/docs/website/docs/general-usage/http/overview.md +++ b/docs/website/docs/general-usage/http/overview.md @@ -14,7 +14,7 @@ Additionally, dlt provides tools to simplify working with APIs: ## Quick example -Here's a simple pipeline that reads issues from the [dlt GitHub repository](https://github.com/dlt-hub/dlt/issues). The API endpoint is https://api.github.com/repos/dlt-hub/dlt/issues. The result is "paginated", meaning that the API returns a limited number of issues per page. The `paginate()` method iterates over all pages and yields the results which are then processed by the pipeline. +Here's a simple pipeline that reads issues from the [dlt GitHub repository](https://github.com/dlt-hub/dlt/issues). The API endpoint is https://api.github.com/repos/dlt-hub/dlt/issues. The result is "paginated," meaning that the API returns a limited number of issues per page. The `paginate()` method iterates over all pages and yields the results, which are then processed by the pipeline. ```py import dlt @@ -46,7 +46,7 @@ print(load_info) Here's what the code does: 1. We create a `RESTClient` instance with the base URL of the API: in this case, the GitHub API (https://api.github.com). -2. Issues endpoint returns a list of issues. Since there could be hundreds of issues, the API "paginates" the results: it returns a limited number of issues in each response along with a link to the next batch of issues (or "page"). The `paginate()` method iterates over all pages and yields the batches of issues. +2. The issues endpoint returns a list of issues. Since there could be hundreds of issues, the API "paginates" the results: it returns a limited number of issues in each response along with a link to the next batch of issues (or "page"). The `paginate()` method iterates over all pages and yields the batches of issues. 3. Here we specify the address of the endpoint we want to read from: `/repos/dlt-hub/dlt/issues`. 4. We pass the parameters to the actual API call to control the data we get back. In this case, we ask for 100 issues per page (`"per_page": 100`), sorted by the last update date (`"sort": "updated"`) in descending order (`"direction": "desc"`). 5. We yield the page from the resource function to the pipeline. The `page` is an instance of the [`PageData`](#pagedata) and contains the data from the current page of the API response and some metadata. @@ -87,5 +87,6 @@ print(load_info) In the example above: 1. We create a `RESTClient` instance with the base URL of the API: in this case, the [PokéAPI](https://pokeapi.co/). We also specify the paginator to use explicitly: `JSONLinkPaginator` with the `next_url_path` set to `"next"`. This tells the paginator to look for the next page URL in the `next` key of the JSON response. -2. In `data_selector` we specify the JSON path to extract the data from the response. This is used to extract the data from the response JSON. -3. By default the number of items per page is limited to 20. We override this by specifying the `limit` parameter in the API call. +2. In `data_selector`, we specify the JSON path to extract the data from the response. This is used to extract the data from the response JSON. +3. By default, the number of items per page is limited to 20. We override this by specifying the `limit` parameter in the API call. + diff --git a/docs/website/docs/general-usage/http/requests.md b/docs/website/docs/general-usage/http/requests.md index a6da3079af..da96420c9f 100644 --- a/docs/website/docs/general-usage/http/requests.md +++ b/docs/website/docs/general-usage/http/requests.md @@ -10,7 +10,7 @@ We recommend using this to make API calls in your sources as it makes your pipel The dlt requests client will additionally set the default user-agent header to `dlt/{DLT_VERSION_NAME}`. -For most use cases this is a drop in replacement for `requests`, so in places where you would normally do: +For most use cases, this is a drop-in replacement for `requests`, so in places where you would normally do: ```py import requests @@ -35,21 +35,21 @@ data = response.json() ## Retry rules -By default failing requests are retried up to 5 times with an exponentially increasing delay. That means the first retry will wait 1 second and the fifth retry will wait 16 seconds. +By default, failing requests are retried up to 5 times with an exponentially increasing delay. That means the first retry will wait 1 second and the fifth retry will wait 16 seconds. -If all retry attempts fail the corresponding requests exception is raised. E.g. `requests.HTTPError` or `requests.ConnectionTimeout` +If all retry attempts fail, the corresponding requests exception is raised. E.g. `requests.HTTPError` or `requests.ConnectionTimeout`. All standard HTTP server errors trigger a retry. This includes: * Error status codes: All status codes in the `500` range and `429` (too many requests). - Commonly servers include a `Retry-After` header with `429` and `503` responses. - When detected this value supersedes the standard retry delay. + Commonly, servers include a `Retry-After` header with `429` and `503` responses. + When detected, this value supersedes the standard retry delay. -* Connection and timeout errors +* Connection and timeout errors: - When the remote server is unreachable, the connection is unexpectedly dropped or when the request takes longer than the configured `timeout`. + When the remote server is unreachable, the connection is unexpectedly dropped, or when the request takes longer than the configured `timeout`. ## Customizing retry settings @@ -63,7 +63,7 @@ request_timeout = 120 # Timeout in seconds request_max_retry_delay = 30 # Cap exponential delay to 30 seconds ``` -For more control you can create your own instance of `dlt.sources.requests.Client` and use that instead of the global client. +For more control, you can create your own instance of `dlt.sources.requests.Client` and use that instead of the global client. This lets you customize which status codes and exceptions to retry on: @@ -76,7 +76,7 @@ http_client = requests.Client( ) ``` -and you may even supply a custom retry condition in the form of a predicate. +And you may even supply a custom retry condition in the form of a predicate. This is sometimes needed when loading from non-standard APIs which don't use HTTP error codes. For example: @@ -98,3 +98,4 @@ http_client = Client( retry_condition=retry_if_error_key ) ``` + diff --git a/docs/website/docs/general-usage/http/rest-client.md b/docs/website/docs/general-usage/http/rest-client.md index 40c83f8c5b..ad400594f7 100644 --- a/docs/website/docs/general-usage/http/rest-client.md +++ b/docs/website/docs/general-usage/http/rest-client.md @@ -61,12 +61,9 @@ If `paginator` is not specified, the `paginate()` method will attempt to automat ### Selecting data from the response -When paginating through API responses, the `RESTClient` tries to automatically extract the data from the response. Sometimes though you may need to explicitly -specify how to extract the data from the response JSON. +When paginating through API responses, the `RESTClient` tries to automatically extract the data from the response. Sometimes, though, you may need to explicitly specify how to extract the data from the response JSON. -Use `data_selector` parameter of the `RESTClient` class or the `paginate()` method to tell the client how to extract the data. -`data_selector` is a [JSONPath](https://github.com/h2non/jsonpath-ng?tab=readme-ov-file#jsonpath-syntax) expression that points to the key in -the JSON that contains the data to be extracted. +Use the `data_selector` parameter of the `RESTClient` class or the `paginate()` method to tell the client how to extract the data. `data_selector` is a [JSONPath](https://github.com/h2non/jsonpath-ng?tab=readme-ov-file#jsonpath-syntax) expression that points to the key in the JSON that contains the data to be extracted. For example, if the API response looks like this: @@ -100,7 +97,7 @@ The `data_selector` needs to be set to `"results.posts"`. Read more about [JSONP ### PageData -Each `PageData` instance contains the data for a single page, along with context such as the original request and response objects, allowing for detailed inspection.. The `PageData` is a list-like object that contains the following attributes: +Each `PageData` instance contains the data for a single page, along with context such as the original request and response objects, allowing for detailed inspection. The `PageData` is a list-like object that contains the following attributes: - `request`: The original request object. - `response`: The response object. @@ -161,17 +158,15 @@ def get_data(): yield page ``` - #### HeaderLinkPaginator -This paginator handles pagination based on a link to the next page in the response headers (e.g., the `Link` header, as used by GitHub API). +This paginator handles pagination based on a link to the next page in the response headers (e.g., the `Link` header, as used by the GitHub API). **Parameters:** - `links_next_key`: The relation type (rel) to identify the next page link within the Link header. Defaults to "next". -Note: normally, you don't need to specify this paginator explicitly, as it is used automatically when the API returns a `Link` header. On rare occasions, you may -need to specify the paginator when the API uses a different relation type. +Note: Normally, you don't need to specify this paginator explicitly, as it is used automatically when the API returns a `Link` header. On rare occasions, you may need to specify the paginator when the API uses a different relation type. #### OffsetPaginator @@ -184,7 +179,7 @@ need to specify the paginator when the API uses a different relation type. - `offset_param`: The name of the query parameter used to specify the offset. Defaults to `"offset"`. - `limit_param`: The name of the query parameter used to specify the limit. Defaults to `"limit"`. - `total_path`: A JSONPath expression for the total number of items. If not provided, pagination is controlled by `maximum_offset` and `stop_after_empty_page`. -- `maximum_offset`: Optional maximum offset value. Limits pagination even without total count. +- `maximum_offset`: Optional maximum offset value. Limits pagination even without a total count. - `stop_after_empty_page`: Whether pagination should stop when a page contains no result items. Defaults to `True`. **Example:** @@ -224,7 +219,7 @@ client = RESTClient( ) ``` -Additionally, you can limit pagination with `maximum_offset`, for example during development. If `maximum_offset` is reached before the first empty page then pagination stops: +Additionally, you can limit pagination with `maximum_offset`, for example during development. If `maximum_offset` is reached before the first empty page, then pagination stops: ```py client = RESTClient( @@ -237,10 +232,9 @@ client = RESTClient( ) ``` -You can disable automatic stoppage of pagination by setting `stop_after_stop_after_empty_page = False`. In this case, you must provide either `total_path` or `maximum_offset` to guarantee that the paginator terminates. - +You can disable the automatic stoppage of pagination by setting `stop_after_empty_page = False`. In this case, you must provide either `total_path` or `maximum_offset` to guarantee that the paginator terminates. -#### PageNumberPaginator +#### Pagenumberpaginator `PageNumberPaginator` works by incrementing the page number for each request. @@ -287,22 +281,21 @@ client = RESTClient( ) ``` -Additionally, you can limit pagination with `maximum_offset`, for example during development. If `maximum_page` is reached before the first empty page then pagination stops: +Additionally, you can limit pagination with `maximum_offset`, for example during development. If `maximum_page` is reached before the first empty page, then pagination stops: ```py client = RESTClient( base_url="https://api.example.com", paginator=OffsetPaginator( - maximum_page=2, # limits response to 2 pages + maximum_page=2, # Limits response to 2 pages total_path=None, ) ) ``` -You can disable automatic stoppage of pagination by setting `stop_after_stop_after_empty_page = False`. In this case, you must provide either `total_path` or `maximum_page` to guarantee that the paginator terminates. +You can disable the automatic stoppage of pagination by setting `stop_after_stop_after_empty_page = False`. In this case, you must provide either `total_path` or `maximum_page` to guarantee that the paginator terminates. - -#### JSONResponseCursorPaginator +#### Jsonresponsecursorpaginator `JSONResponseCursorPaginator` handles pagination based on a cursor in the JSON response. @@ -335,17 +328,17 @@ client = RESTClient( ### Implementing a custom paginator -When working with APIs that use non-standard pagination schemes, or when you need more control over the pagination process, you can implement a custom paginator by subclassing the `BasePaginator` class and implementing the methods `init_request`, `update_state` and `update_request`. +When working with APIs that use non-standard pagination schemes, or when you need more control over the pagination process, you can implement a custom paginator by subclassing the `BasePaginator` class and implementing the methods `init_request`, `update_state`, and `update_request`. - `init_request(request: Request) -> None`: This method is called before making the first API call in the `RESTClient.paginate` method. You can use this method to set up the initial request query parameters, headers, etc. For example, you can set the initial page number or cursor value. - `update_state(response: Response, data: Optional[List[Any]]) -> None`: This method updates the paginator's state based on the response of the API call. Typically, you extract pagination details (like the next page reference) from the response and store them in the paginator instance. -- `update_request(request: Request) -> None`: Before making the next API call in `RESTClient.paginate` method, `update_request` is used to modify the request with the necessary parameters to fetch the next page (based on the current state of the paginator). For example, you can add query parameters to the request, or modify the URL. +- `update_request(request: Request) -> None`: Before making the next API call in the `RESTClient.paginate` method, `update_request` is used to modify the request with the necessary parameters to fetch the next page (based on the current state of the paginator). For example, you can add query parameters to the request or modify the URL. -#### Example 1: creating a query parameter paginator +#### Example 1: Creating a query parameter paginator -Suppose an API uses query parameters for pagination, incrementing an page parameter for each subsequent page, without providing direct links to next pages in its responses. E.g. `https://api.example.com/posts?page=1`, `https://api.example.com/posts?page=2`, etc. Here's how you could implement a paginator for this scheme: +Suppose an API uses query parameters for pagination, incrementing a page parameter for each subsequent page, without providing direct links to next pages in its responses. E.g. `https://api.example.com/posts?page=1`, `https://api.example.com/posts?page=2`, etc. Here's how you could implement a paginator for this scheme: ```py from typing import Any, List, Optional @@ -395,7 +388,7 @@ def get_data(): [`PageNumberPaginator`](#pagenumberpaginator) that ships with dlt does the same thing, but with more flexibility and error handling. This example is meant to demonstrate how to implement a custom paginator. For most use cases, you should use the [built-in paginators](#paginators). ::: -#### Example 2: creating a paginator for POST requests +#### Example 2: Creating a paginator for POST requests Some APIs use POST requests for pagination, where the next page is fetched by sending a POST request with a cursor or other parameters in the request body. This is frequently used in "search" API endpoints or other endpoints with big payloads. Here's how you could implement a paginator for a case like this: @@ -447,12 +440,12 @@ The available authentication methods are defined in the `dlt.sources.helpers.res - [OAuth2ClientCredentials](#oauth20-authorization) For specific use cases, you can [implement custom authentication](#implementing-custom-authentication) by subclassing the `AuthBase` class from the Requests library. -For specific flavors of OAuth 2.0 you can [implement custom OAuth 2.0](#oauth2-authorization) +For specific flavors of OAuth 2.0, you can [implement custom OAuth 2.0](#oauth2-authorization) by subclassing `OAuth2ClientCredentials`. ### Bearer token authentication -Bearer Token Authentication (`BearerTokenAuth`) is an auth method where the client sends a token in the request's Authorization header (e.g. `Authorization: Bearer `). The server validates this token and grants access if the token is valid. +Bearer Token Authentication (`BearerTokenAuth`) is an auth method where the client sends a token in the request's Authorization header (e.g., `Authorization: Bearer `). The server validates this token and grants access if the token is valid. **Parameters:** @@ -475,7 +468,7 @@ for page in client.paginate("/protected/resource"): ### API key authentication -API Key Authentication (`ApiKeyAuth`) is an auth method where the client sends an API key in a custom header (e.g. `X-API-Key: `, or as a query parameter). +API Key Authentication (`ApiKeyAuth`) is an auth method where the client sends an API key in a custom header (e.g., `X-API-Key: `, or as a query parameter). **Parameters:** @@ -523,13 +516,13 @@ response = client.get("/protected/resource") OAuth 2.0 is a common protocol for authorization. We have implemented two-legged authorization employed for server-to-server authorization because the end user (resource owner) does not need to grant approval. The REST client acts as the OAuth client which obtains a temporary access token from the authorization server. This access token is then sent to the resource server to access protected content. If the access token is expired, the OAuth client automatically refreshes it. -Unfortunately, most OAuth 2.0 implementations vary and thus you might need to subclass `OAuth2ClientCredentials` and implement `build_access_token_request()` to suite the requirements of the specific authorization server you want to interact with. +Unfortunately, most OAuth 2.0 implementations vary and thus you might need to subclass `OAuth2ClientCredentials` and implement `build_access_token_request()` to suit the requirements of the specific authorization server you want to interact with. **Parameters:** -- `access_token_url`: The url to obtain the temporary access token. +- `access_token_url`: The URL to obtain the temporary access token. - `client_id`: Client credential to obtain authorization. Usually issued via a developer portal. - `client_secret`: Client credential to obtain authorization. Usually issued via a developer portal. -- `access_token_request_data`: A dictionary with data required by the autorization server apart from the `client_id`, `client_secret`, and `"grant_type": "client_credentials"`. Defaults to `None`. +- `access_token_request_data`: A dictionary with data required by the authorization server apart from the `client_id`, `client_secret`, and `"grant_type": "client_credentials"`. Defaults to `None`. - `default_token_expiration`: The time in seconds after which the temporary access token expires. Defaults to 3600. **Example:** @@ -540,7 +533,7 @@ from dlt.sources.helpers.rest_client import RESTClient from dlt.sources.helpers.rest_client.auth import OAuth2ClientCredentials class OAuth2ClientCredentialsHTTPBasic(OAuth2ClientCredentials): - """Used e.g. by Zoom Zoom Video Communications, Inc.""" + """Used e.g. by Zoom Video Communications, Inc.""" def build_access_token_request(self) -> Dict[str, Any]: authentication: str = b64encode( f"{self.client_id}:{self.client_secret}".encode() @@ -567,8 +560,6 @@ client = RESTClient(base_url="https://api.zoom.us/v2", auth=auth) response = client.get("/users") ``` - - ### Implementing custom authentication You can implement custom authentication by subclassing the `AuthBase` class and implementing the `__call__` method: @@ -597,7 +588,7 @@ client = RESTClient( ## Advanced usage -`RESTClient.paginate()` allows to specify a [custom hook function](https://requests.readthedocs.io/en/latest/user/advanced/#event-hooks) that can be used to modify the response objects. For example, to handle specific HTTP status codes gracefully: +`RESTClient.paginate()` allows you to specify a [custom hook function](https://requests.readthedocs.io/en/latest/user/advanced/#event-hooks) that can be used to modify the response objects. For example, to handle specific HTTP status codes gracefully: ```py def custom_response_handler(response): @@ -608,7 +599,7 @@ def custom_response_handler(response): client.paginate("/posts", hooks={"response": [custom_response_handler]}) ``` -The handler function may raise `IgnoreResponseException` to exit the pagination loop early. This is useful for the enpoints that return a 404 status code when there are no items to paginate. +The handler function may raise `IgnoreResponseException` to exit the pagination loop early. This is useful for the endpoints that return a 404 status code when there are no items to paginate. ## Shortcut for paginating API responses @@ -621,7 +612,6 @@ for page in paginate("https://api.example.com/posts"): print(page) ``` - ## Retry You can customize how the RESTClient retries failed requests by editing your `config.toml`. @@ -659,7 +649,7 @@ print(response.content) ### `RESTClient.paginate()` -Debugging `paginate()` is trickier because it's a generator function that yields [`PageData`](#pagedata) objects. Here's several ways to debug the `paginate()` method: +Debugging `paginate()` is trickier because it's a generator function that yields [`PageData`](#pagedata) objects. Here are several ways to debug the `paginate()` method: 1. Enable [logging](../../running-in-production/running.md#set-the-log-level-and-format) to see detailed information about the HTTP requests: @@ -702,3 +692,4 @@ for page in client.paginate( ): print(page) ``` + diff --git a/docs/website/docs/tutorial/filesystem.md b/docs/website/docs/tutorial/filesystem.md index b748f794d5..b44b7b783a 100644 --- a/docs/website/docs/tutorial/filesystem.md +++ b/docs/website/docs/tutorial/filesystem.md @@ -4,7 +4,7 @@ description: Learn how to load data files like JSON, JSONL, CSV, and Parquet fro keywords: [dlt, tutorial, filesystem, cloud storage, file system, python, data pipeline, incremental loading, json, jsonl, csv, parquet, duckdb] --- -This tutorial is for you if you need to load data files like JSONL, CSV, and Parquet from either Cloud Storage (ex. AWS S3, Google Cloud Storage, Google Drive, Azure Blob Storage) or a local file system. +This tutorial is for you if you need to load data files like JSONL, CSV, and Parquet from either Cloud Storage (e.g., AWS S3, Google Cloud Storage, Google Drive, Azure Blob Storage) or a local file system. ## What you will learn @@ -48,24 +48,24 @@ Here’s what each file does: - `config.toml`: This file contains the configuration settings for your dlt project. :::note -When deploying your pipeline in a production environment, managing all configurations with files might not be convenient. In this case, we recommend you to use the environment variables to store secrets and configs instead. Read more about [configuration providers](../general-usage/credentials/setup#available-config-providers) available in dlt. +When deploying your pipeline in a production environment, managing all configurations with files might not be convenient. In this case, we recommend you use environment variables to store secrets and configs instead. Read more about [configuration providers](../general-usage/credentials/setup#available-config-providers) available in dlt. ::: + + ## 2. Creating the pipeline The filesystem source provides users with building blocks for loading data from any type of files. You can break down the data extraction into two steps: -1. Listing the files in the bucket / directory. +1. Listing the files in the bucket/directory. 2. Reading the files and yielding records. dlt's filesystem source includes several resources: -- the `filesystem` resource lists files in the directory or bucket -- several readers resources (`read_csv`, `read_parquet`, `read_jsonl`) read files and yield the records. These resources have a -special type, they called [transformers](../general-usage/resource#process-resources-with-dlttransformer). Transformers expect items from another resource. -In this particular case transformers expect `FileItem` object and transform it into multiple records. +- The `filesystem` resource lists files in the directory or bucket. +- Several reader resources (`read_csv`, `read_parquet`, `read_jsonl`) read files and yield the records. These resources have a special type; they are called [transformers](../general-usage/resource#process-resources-with-dlttransformer). Transformers expect items from another resource. In this particular case, transformers expect `FileItem` objects and transform them into multiple records. -Let's initialize a source and create a pipeline for loading CSV files from Google Cloud Storage to DuckDB. You can replace code from `filesystem_pipeline.py` with the following: +Let's initialize a source and create a pipeline for loading CSV files from Google Cloud Storage to DuckDB. You can replace the code in `filesystem_pipeline.py` with the following: ```py import dlt @@ -81,26 +81,25 @@ print(info) What's happening in the snippet above? -1. We import the `filesystem` resource and initialize it with a bucket URL (`gs://filesystem-tutorial`) and the `file_glob` parameter. dlt uses `file_glob` to filter files names in the bucket. `filesystem` returns a generator object. -2. We pipe the files names yielded by the filesystem resource to the transformer resource `read_csv` to read each file and iterate over records from the file. We name this transformer resource `"encounters"` using the `with_name()`. dlt will use the resource name `"encounters"` as a table name when loading the data. +1. We import the `filesystem` resource and initialize it with a bucket URL (`gs://filesystem-tutorial`) and the `file_glob` parameter. dlt uses `file_glob` to filter file names in the bucket. `filesystem` returns a generator object. +2. We pipe the file names yielded by the filesystem resource to the transformer resource `read_csv` to read each file and iterate over records from the file. We name this transformer resource `"encounters"` using the `with_name()`. dlt will use the resource name `"encounters"` as a table name when loading the data. :::note A [transformer](../general-usage/resource#process-resources-with-dlttransformer) in dlt is a special type of resource that processes each record from another resource. This lets you chain multiple resources together. ::: -3. We create the dlt pipeline configuring with the name `hospital_data_pipeline` and DuckDB destination. +3. We create the dlt pipeline, configuring it with the name `hospital_data_pipeline` and DuckDB destination. 4. We call `pipeline.run()`. This is where the underlying generators are iterated: - dlt retrieves remote data, - normalizes data, - creates or updates the table in the destination, - loads the extracted data into the destination. - 5. `print(info)` outputs pipeline running stats we get from `pipeline.run()` +5. `print(info)` outputs pipeline running stats we get from `pipeline.run()`. ## 3. Configuring the filesystem source :::note -In this tutorial we will work with publicly accessed dataset [Hospital Patient Records](https://mavenanalytics.io/data-playground?order=date_added%2Cdesc&search=Hospital%20Patient%20Records) -synthetic electronic health care records. You can use the exact credentials from this tutorial to load this dataset from GCP. +In this tutorial, we will work with the publicly accessed dataset [Hospital Patient Records](https://mavenanalytics.io/data-playground?order=date_added%2Cdesc&search=Hospital%20Patient%20Records), synthetic electronic health care records. You can use the exact credentials from this tutorial to load this dataset from GCP.
Citation Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, Scott McLachlan, Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record, Journal of the American Medical Informatics Association, Volume 25, Issue 3, March 2018, Pages 230–238, https://doi.org/10.1093/jamia/ocx079 @@ -128,6 +127,7 @@ Let's specify the bucket URL and credentials. We can do this using the following client_email = "public-access@dlthub-sandbox.iam.gserviceaccount.com" project_id = "dlthub-sandbox" private_key = "-----BEGIN PRIVATE KEY-----\nMIIEvAIBADANBgkqhkiG9w0BAQEFAASCBKYwggSiAgEAAoIBAQDGWsVHJRjliojx\nTo+j1qu+x8PzC5ZHZrMx6e8OD6tO8uxMyl65ByW/4FZkVXkS4SF/UYPigGN+rel4\nFmySTbP9orva4t3Pk1B9YSvQMB7V5IktmTIW9Wmdmn5Al8Owb1RehgIidm1EX/Z9\nLr09oLpO6+jUu9RIP2Lf2mVQ6tvkgl7UOdpdGACSNGzRiZgVZDOaDIgH0Tl4UWmK\n6iPxhwZy9YC2B1beLB/NU+F6DUykrEpBzCFQTqFoTUcuDAEvuvpU9JrU2iBMiOGw\nuP3TYSiudhBjmauEUWaMiqWAgFeX5ft1vc7/QWLdI//SAjaiTAu6pTer29Q0b6/5\niGh0jRXpAgMBAAECggEAL8G9C9MXunRvYkH6/YR7F1T7jbH1fb1xWYwsXWNSaJC+\nagKzabMZ2KfHxSJ7IxuHOCNFMKyex+pRcvNbMqJ4upGKzzmeFBMw5u8VYGulkPQU\nPyFKWRK/Wg3PZffkSr+TPargKrH+vt6n9x3gvEzNbqEIDugmRTrVsHXhvOi/BrYc\nWhppHSVQidWZi5KVwDEPJjDQiHEcYI/vfIy1WhZ8VuPAaE5nMZ1m7gTdeaWWKIAj\n/p2ZkLgRdCY8vNkfaNDAxDbvH+CMuTtOw55GydzsYYiofANS6xZ8CedGkYaGi82f\nqGdLghX61Sg3UAb5SI36T/9XbuCpTf3B/SMV20ew8QKBgQDm2yUxL71UqI/xK9LS\nHWnqfHpKmHZ+U9yLvp3v79tM8XueSRKBTJJ4H+UvVQrXlypT7cUEE+sGpTrCcDGL\nm8irtdUmMvdi7AnRBgmWdYKig/kgajLOUrjXqFt/BcFgqMyTfzqPt3xdp6F3rSEK\nHE6PQ8I3pJ0BJOSJRa6Iw2VH1QKBgQDb9WbVFjYwTIKJOV4J2plTK581H8PI9FSt\nUASXcoMTixybegk8beGdwfm2TkyF/UMzCvHfuaUhf+S0GS5Zk31Wkmh1YbmFU4Q9\nm9K/3eoaqF7CohpigB0wJw4HfqNh6Qt+nICOMCv++gw7+/UwfV72dCqr0lpzfX5F\nAsez8igTxQKBgDsq/axOnQr+rO3WGpGJwmS8BKfrzarxGXyjnV0qr51X4yQdfGWx\nV3T8T8RC2qWI8+tQ7IbwB/PLE3VURg6PHe6MixXgSDGNZ7KwBnMOqS23/3kEXwMs\nhn2Xg+PZeMeqW8yN9ldxYqmqViMTN32c5bGoXzXdtfPeHcjlGCerVOEFAoGADVPi\nRjkRUX3hTvVF6Gzxa2OyQuLI1y1O0C2QCakrngyI0Dblxl6WFBwDyHMYGepNnxMj\nsr2p7sy0C+GWuGDCcHNwluQz/Ish8SW28F8+5xyamUp/NMa0fg1vwS6AMdeQFbzf\n4T2z/MAj66KJqcV+8on5Z+3YAzVwaDgR56pdmU0CgYBo2KWcNWAhZ1Qa6sNrITLV\nGlxg6tWP3OredZrmKb1kj5Tk0V+EwVN+HnKzMalv6yyyK7SWq1Z6rvCye37vy27q\nD7xfuz0c0H+48uWJpdLcsxpTioopsRPayiVDKlHSe/Qa+MEjAG3ded5TJiC+5iSw\nxWJ51y0wpme0LWgzzoLbRw==\n-----END PRIVATE KEY-----\n" +``` # config.toml [sources.filesystem] @@ -155,7 +155,7 @@ from dlt.sources.filesystem import filesystem, read_csv files = filesystem( bucket_url="gs://filesystem-tutorial", - # please, do not specify sensitive information directly in the code, + # Please, do not specify sensitive information directly in the code, # instead, you can use env variables to get the credentials credentials=GcpClientCredentials( client_email="public-access@dlthub-sandbox.iam.gserviceaccount.com", @@ -217,9 +217,9 @@ If you try running the pipeline again with `python filesystem_pipeline.py`, you - `replace`: Replaces the data in the destination table with the new data. - `merge`: Merges the new data with the existing data in the destination table based on the primary key. -To specify the `write_disposition`, you can set it in the `pipeline.run` command. Let's change the write disposition to `merge`. In this case, dlt will deduplicate the data before loading them into the destination. +To specify the `write_disposition`, you can set it in the `pipeline.run` command. Let's change the write disposition to `merge`. In this case, dlt will deduplicate the data before loading it into the destination. -To enable data deduplication, we also should specify a `primary_key` or `merge_key`, which will be used by dlt to define if two records are different. Both keys could consist of several columns. dlt will try to use `merge_key` and fallback to `primary_key` if it's not specified. To specify any hints about the data, including column types, primary keys, you can use the [`apply_hints`](../general-usage/resource#set-table-name-and-adjust-schema) method. +To enable data deduplication, we also should specify a `primary_key` or `merge_key`, which will be used by dlt to define if two records are different. Both keys could consist of several columns. dlt will try to use `merge_key` and fall back to `primary_key` if it's not specified. To specify any hints about the data, including column types and primary keys, you can use the [`apply_hints`](../general-usage/resource#set-table-name-and-adjust-schema) method. ```py import dlt @@ -234,7 +234,7 @@ info = pipeline.run(reader, write_disposition="merge") print(info) ``` :::tip -You may need to drop the previously loaded data if you loaded data several times with `append` write disposition to make sure the primary key column has unique values. +You may need to drop the previously loaded data if you loaded data several times with the `append` write disposition to make sure the primary key column has unique values. ::: You can learn more about `write_disposition` in the [write dispositions section](../general-usage/incremental-loading#the-3-write-dispositions) of the incremental loading page. @@ -367,3 +367,4 @@ Interested in learning more about dlt? Here are some suggestions: - Learn more about the filesystem source configuration in [filesystem source](../dlt-ecosystem/verified-sources/filesystem) - Learn more about different credential types in [Built-in credentials](../general-usage/credentials/complex_types#built-in-credentials) - Learn how to [create a custom source](./load-data-from-an-api.md) in the advanced tutorial + diff --git a/docs/website/docs/tutorial/load-data-from-an-api.md b/docs/website/docs/tutorial/load-data-from-an-api.md index 5b1d63373c..d96075729a 100644 --- a/docs/website/docs/tutorial/load-data-from-an-api.md +++ b/docs/website/docs/tutorial/load-data-from-an-api.md @@ -9,7 +9,7 @@ This tutorial introduces you to foundational dlt concepts, demonstrating how to ## What you will learn - Loading data from a list of Python dictionaries into DuckDB. -- Low level API usage with built-in HTTP client. +- Low-level API usage with a built-in HTTP client. - Understand and manage data loading behaviors. - Incrementally load new data and deduplicate existing data. - Dynamic resource creation and reducing code redundancy. @@ -74,13 +74,13 @@ Load package 1692364844.460054 is LOADED and contains no failed jobs ### Explore the data -To allow sneak peek and basic discovery you can take advantage of [built-in integration with Strealmit](../reference/command-line-interface#show-tables-and-data-in-the-destination): +To allow a sneak peek and basic discovery, you can take advantage of [built-in integration with Streamlit](../reference/command-line-interface#show-tables-and-data-in-the-destination): ```sh dlt pipeline quick_start show ``` -**quick_start** is the name of the pipeline from the script above. If you do not have Streamlit installed yet do: +**quick_start** is the name of the pipeline from the script above. If you do not have Streamlit installed yet, do: ```sh pip install streamlit @@ -94,22 +94,22 @@ Streamlit Explore data. Schema and data for a test pipeline “quick_start”. :::tip `dlt` works in Jupyter Notebook and Google Colab! See our [Quickstart Colab Demo.](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing) -Looking for source code of all the snippets? You can find and run them [from this repository](https://github.com/dlt-hub/dlt/blob/devel/docs/website/docs/getting-started-snippets.py). +Looking for the source code of all the snippets? You can find and run them [from this repository](https://github.com/dlt-hub/dlt/blob/devel/docs/website/docs/getting-started-snippets.py). ::: -Now that you have a basic understanding of how to get started with dlt, you might be eager to dive deeper. For that we need to switch to a more advanced data source - the GitHub API. We will load issues from our [dlt-hub/dlt](https://github.com/dlt-hub/dlt) repository. +Now that you have a basic understanding of how to get started with dlt, you might be eager to dive deeper. For that, we need to switch to a more advanced data source - the GitHub API. We will load issues from our [dlt-hub/dlt](https://github.com/dlt-hub/dlt) repository. :::note -This tutorial uses GitHub REST API for demonstration purposes only. If you need to read data from a REST API, consider using the dlt's REST API source. Check out the [REST API source tutorial](./rest-api) for quick start or [REST API source reference](../dlt-ecosystem/verified-sources/rest_api) for more details. +This tutorial uses the GitHub REST API for demonstration purposes only. If you need to read data from a REST API, consider using dlt's REST API source. Check out the [REST API source tutorial](./rest-api) for a quick start or [REST API source reference](../dlt-ecosystem/verified-sources/rest_api) for more details. ::: + ## Create a pipeline First, we need to create a [pipeline](../general-usage/pipeline). Pipelines are the main building blocks of `dlt` and are used to load data from sources to destinations. Open your favorite text editor and create a file called `github_issues.py`. Add the following code to it: - Here's what the code above does: 1. It makes a request to the GitHub API endpoint and checks if the response is successful. 2. Then it creates a dlt pipeline with the name `github_issues` and specifies that the data should be loaded to the `duckdb` destination and the `github_data` dataset. Nothing gets loaded yet. @@ -161,7 +161,7 @@ load_info = pipeline.run( print(load_info) ``` -Run this script twice to see that **issues** table still contains only one copy of the data. +Run this script twice to see that the **issues** table still contains only one copy of the data. :::tip What if the API has changed and new fields get added to the response? @@ -182,7 +182,7 @@ You can pass a generator to the `run` method directly or use the `@dlt.resource` ### Load only new data (incremental loading) -Let's improve our GitHub API example and get only issues that were created since last load. +Let's improve our GitHub API example and get only issues that were created since the last load. Instead of using `replace` write disposition and downloading all issues each time the pipeline is run, we do the following: @@ -192,9 +192,9 @@ Let's take a closer look at the code above. We use the `@dlt.resource` decorator to declare the table name into which data will be loaded and specify the `append` write disposition. -We request issues for dlt-hub/dlt repository ordered by **created_at** field (descending) and yield them page by page in `get_issues` generator function. +We request issues for the dlt-hub/dlt repository ordered by the **created_at** field (descending) and yield them page by page in the `get_issues` generator function. -We also use `dlt.sources.incremental` to track `created_at` field present in each issue to filter in the newly created. +We also use `dlt.sources.incremental` to track the `created_at` field present in each issue to filter in the newly created ones. Now run the script. It loads all the issues from our repo to `duckdb`. Run it again, and you can see that no issues got added (if no issues were created in the meantime). @@ -202,7 +202,7 @@ Now you can run this script on a daily schedule and each day you’ll load only :::tip Between pipeline runs, `dlt` keeps the state in the same database it loaded data to. -Peek into that state, the tables loaded and get other information with: +Peek into that state, the tables loaded, and get other information with: ```sh dlt pipeline -v github_issues_incremental info @@ -213,24 +213,24 @@ Learn more: - Declare your [resources](../general-usage/resource) and group them in [sources](../general-usage/source) using Python decorators. - [Set up "last value" incremental loading.](../general-usage/incremental-loading#incremental_loading-with-last-value) -- [Inspect pipeline after loading.](../walkthroughs/run-a-pipeline#4-inspect-a-load-process) +- [Inspect the pipeline after loading.](../walkthroughs/run-a-pipeline#4-inspect-a-load-process) - [`dlt` command line interface.](../reference/command-line-interface) ### Update and deduplicate your data The script above finds **new** issues and adds them to the database. -It will ignore any updates to **existing** issue text, emoji reactions etc. -To get always fresh content of all the issues you combine incremental load with `merge` write disposition, +It will ignore any updates to **existing** issue text, emoji reactions, etc. +To get always fresh content of all the issues, you combine incremental load with `merge` write disposition, like in the script below. -Above we add `primary_key` argument to the `dlt.resource()` that tells `dlt` how to identify the issues in the database to find duplicates which content it will merge. +Above we add the `primary_key` argument to the `dlt.resource()` that tells `dlt` how to identify the issues in the database to find duplicates whose content it will merge. Note that we now track the `updated_at` field — so we filter in all issues **updated** since the last pipeline run (which also includes those newly created). -Pay attention how we use **since** parameter from [GitHub API](https://docs.github.com/en/rest/issues/issues?apiVersion=2022-11-28#list-repository-issues) +Pay attention to how we use the **since** parameter from [GitHub API](https://docs.github.com/en/rest/issues/issues?apiVersion=2022-11-28#list-repository-issues) and `updated_at.last_value` to tell GitHub to return issues updated only **after** the date we pass. `updated_at.last_value` holds the last `updated_at` value from the previous run. [Learn more about merge write disposition](../general-usage/incremental-loading#merge-incremental_loading). @@ -282,10 +282,10 @@ Let's zoom in on the changes: 1. The `while` loop that handled pagination is replaced with reading pages from the `paginate()` generator. 2. `paginate()` takes the URL of the API endpoint and optional parameters. In this case, we pass the `since` parameter to get only issues updated after the last pipeline run. -3. We're not explicitly setting up pagination, `paginate()` handles it for us. Magic! Under the hood, `paginate()` analyzes the response and detects the pagination method used by the API. Read more about pagination in the [REST client documentation](../general-usage/http/rest-client.md#paginating-api-responses). +3. We're not explicitly setting up pagination; `paginate()` handles it for us. Magic! Under the hood, `paginate()` analyzes the response and detects the pagination method used by the API. Read more about pagination in the [REST client documentation](../general-usage/http/rest-client.md#paginating-api-responses). If you want to take full advantage of the `dlt` library, then we strongly suggest that you build your sources out of existing building blocks: -To make most of `dlt`, consider the following: +To make the most of `dlt`, consider the following: ## Use source decorator @@ -310,7 +310,7 @@ def get_comments( yield page ``` -We can load this resource separately from the issues resource, however loading both issues and comments in one go is more efficient. To do that, we'll use the `@dlt.source` decorator on a function that returns a list of resources: +We can load this resource separately from the issues resource; however, loading both issues and comments in one go is more efficient. To do that, we'll use the `@dlt.source` decorator on a function that returns a list of resources: ```py @dlt.source @@ -380,7 +380,7 @@ print(load_info) ### Dynamic resources -You've noticed that there's a lot of code duplication in the `get_issues` and `get_comments` functions. We can reduce that by extracting the common fetching code into a separate function and use it in both resources. Even better, we can use `dlt.resource` as a function and pass it the `fetch_github_data()` generator function directly. Here's the refactored code: +You've noticed that there's a lot of code duplication in the `get_issues` and `get_comments` functions. We can reduce that by extracting the common fetching code into a separate function and using it in both resources. Even better, we can use `dlt.resource` as a function and pass it the `fetch_github_data()` generator function directly. Here's the refactored code: ```py import dlt @@ -414,7 +414,7 @@ row_counts = pipeline.last_trace.last_normalize_info ## Handle secrets -For the next step we'd want to get the [number of repository clones](https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-repository-clones) for our dlt repo from the GitHub API. However, the `traffic/clones` endpoint that returns the data requires [authentication](https://docs.github.com/en/rest/overview/authenticating-to-the-rest-api?apiVersion=2022-11-28). +For the next step, we'd want to get the [number of repository clones](https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-repository-clones) for our dlt repo from the GitHub API. However, the `traffic/clones` endpoint that returns the data requires [authentication](https://docs.github.com/en/rest/overview/authenticating-to-the-rest-api?apiVersion=2022-11-28). Let's handle this by changing our `fetch_github_data()` first: @@ -444,13 +444,13 @@ def github_source(access_token): ... ``` -Here, we added `access_token` parameter and now we can use it to pass the access token to the request: +Here, we added the `access_token` parameter, and now we can use it to pass the access token to the request: ```py load_info = pipeline.run(github_source(access_token="ghp_XXXXX")) ``` -It's a good start. But we'd want to follow the best practices and not hardcode the token in the script. One option is to set the token as an environment variable, load it with `os.getenv()` and pass it around as a parameter. dlt offers a more convenient way to handle secrets and credentials: it lets you inject the arguments using a special `dlt.secrets.value` argument value. +It's a good start. But we'd want to follow the best practices and not hardcode the token in the script. One option is to set the token as an environment variable, load it with `os.getenv()`, and pass it around as a parameter. dlt offers a more convenient way to handle secrets and credentials: it lets you inject the arguments using a special `dlt.secrets.value` argument value. To use it, change the `github_source()` function to: @@ -467,7 +467,7 @@ When you add `dlt.secrets.value` as a default value for an argument, `dlt` will 1. Special environment variables. 2. `secrets.toml` file. -The `secret.toml` file is located in the `~/.dlt` folder (for global configuration) or in the `.dlt` folder in the project folder (for project-specific configuration). +The `secrets.toml` file is located in the `~/.dlt` folder (for global configuration) or in the `.dlt` folder in the project folder (for project-specific configuration). Let's add the token to the `~/.dlt/secrets.toml` file: @@ -476,7 +476,7 @@ Let's add the token to the `~/.dlt/secrets.toml` file: access_token = "ghp_A...3aRY" ``` -Now we can run the script and it will load the data from the `traffic/clones` endpoint: +Now we can run the script, and it will load the data from the `traffic/clones` endpoint: ```py ... @@ -564,18 +564,16 @@ Interested in learning more? Here are some suggestions: 1. You've been running your pipelines locally. Learn how to [deploy and run them in the cloud](../walkthroughs/deploy-a-pipeline/). 2. Dive deeper into how dlt works by reading the [Using dlt](../general-usage) section. Some highlights: - [Set up "last value" incremental loading](../general-usage/incremental-loading#incremental_loading-with-last-value). - - Learn about data loading strategies: [append, replace and merge](../general-usage/incremental-loading). + - Learn about data loading strategies: [append, replace, and merge](../general-usage/incremental-loading). - [Connect the transformers to the resources](../general-usage/resource#feeding-data-from-one-resource-into-another) to load additional data or enrich it. - [Customize your data schema—set primary and merge keys, define column nullability, and specify data types](../general-usage/resource#define-schema). - [Create your resources dynamically from data](../general-usage/source#create-resources-dynamically). - [Transform your data before loading](../general-usage/resource#customize-resources) and see some [examples of customizations like column renames and anonymization](../general-usage/customising-pipelines/renaming_columns). - Employ data transformations using [SQL](../dlt-ecosystem/transformations/sql) or [Pandas](../dlt-ecosystem/transformations/sql). - [Pass config and credentials into your sources and resources](../general-usage/credentials). - - [Run in production: inspecting, tracing, retry policies and cleaning up](../running-in-production/running). + - [Run in production: inspecting, tracing, retry policies, and cleaning up](../running-in-production/running). - [Run resources in parallel, optimize buffers and local storage](../reference/performance.md) - [Use REST API client helpers](../general-usage/http/rest-client.md) to simplify working with REST APIs. -3. Explore [destinations](../dlt-ecosystem/destinations/) and [sources](../dlt-ecosystem/verified-sources/) provided by us and community. -4. Explore the [Examples](../examples) section to see how dlt can be used in real-world scenarios - - +3. Explore [destinations](../dlt-ecosystem/destinations/) and [sources](../dlt-ecosystem/verified-sources/) provided by us and the community. +4. Explore the [Examples](../examples) section to see how dlt can be used in real-world scenarios. diff --git a/docs/website/docs/tutorial/rest-api.md b/docs/website/docs/tutorial/rest-api.md index 3e214e0b55..114a60fdd0 100644 --- a/docs/website/docs/tutorial/rest-api.md +++ b/docs/website/docs/tutorial/rest-api.md @@ -76,7 +76,7 @@ Let's verify that the pipeline is working as expected. Run the following command python rest_api_pipeline.py ``` -You should see the output of the pipeline execution in the terminal. The output will also diplay the location of the DuckDB database file where the data is stored: +You should see the output of the pipeline execution in the terminal. The output will also display the location of the DuckDB database file where the data is stored: ```sh Pipeline rest_api_pokemon load step completed in 1.08 seconds @@ -100,7 +100,7 @@ dlt pipeline rest_api_pokemon show ``` The command opens a new browser window with the data browser application. `rest_api_pokemon` is the name of the pipeline defined in the `rest_api_pipeline.py` file. -You can explore the loaded data, run queries and see some pipeline execution details: +You can explore the loaded data, run queries, and see some pipeline execution details: ![Explore rest_api data in Streamlit App](https://dlt-static.s3.eu-central-1.amazonaws.com/images/docs-rest-api-tutorial-streamlit-screenshot.png) @@ -142,7 +142,7 @@ def load_pokemon() -> None: print(load_info) ``` -Here what's happening in the code: +Here's what's happening in the code: 1. With `dlt.pipeline()` we define a new pipeline named `rest_api_pokemon` with DuckDB as the destination and `rest_api_data` as the dataset name. 2. The `rest_api_source()` function creates a new REST API source object. @@ -174,9 +174,10 @@ config: RESTAPIConfig = { You may have noticed that we didn't specify any pagination configuration in the `rest_api_source()` function. That's because for REST APIs that follow best practices, dlt can automatically detect and handle pagination. Read more about [configuring pagination](../dlt-ecosystem/verified-sources/rest_api/basic#pagination) in the REST API source documentation. ::: + ## Appending, replacing, and merging loaded data -Try running the pipeline again with `python rest_api_pipeline.py`. You will notice that all the tables have data duplicated. This happens because by default, dlt appends the data to the destination table. In dlt you can control how the data is loaded into the destination table by setting the `write_disposition` parameter in the resource configuration. The possible values are: +Try running the pipeline again with `python rest_api_pipeline.py`. You will notice that all the tables have duplicated data. This happens because, by default, dlt appends the data to the destination table. In dlt, you can control how the data is loaded into the destination table by setting the `write_disposition` parameter in the resource configuration. The possible values are: - `append`: Appends the data to the destination table. This is the default. - `replace`: Replaces the data in the destination table with the new data. - `merge`: Merges the new data with the existing data in the destination table based on the primary key. @@ -234,7 +235,7 @@ pokemon_source = rest_api_source( }, }, # For the `berry` and `location` resources, we keep - # the`replace` write disposition + # the `replace` write disposition "write_disposition": "replace", }, "resources": [ @@ -306,7 +307,7 @@ load_info = pipeline.run(github_source()) print(load_info) ``` -In this configuration, the `since` parameter is defined as a special incremental parameter. The `cursor_path` field specifies the JSON path to the field that will be used to fetch the updated data and we use the `initial_value` for the initial value for the incremental parameter. This value will be used in the first request to fetch the data. +In this configuration, the `since` parameter is defined as a special incremental parameter. The `cursor_path` field specifies the JSON path to the field that will be used to fetch the updated data, and we use the `initial_value` for the initial value for the incremental parameter. This value will be used in the first request to fetch the data. When the pipeline runs, dlt will automatically update the `since` parameter with the latest value from the response data. This way, you can fetch only the new or updated data from the API. @@ -319,4 +320,5 @@ Congratulations on completing the tutorial! You've learned how to set up a REST Interested in learning more about dlt? Here are some suggestions: - Learn more about the REST API source configuration in [REST API source documentation](../dlt-ecosystem/verified-sources/rest_api/) -- Learn how to [create a custom source](./load-data-from-an-api.md) in the advanced tutorial \ No newline at end of file +- Learn how to [create a custom source](./load-data-from-an-api.md) in the advanced tutorial + diff --git a/docs/website/docs/tutorial/sql-database.md b/docs/website/docs/tutorial/sql-database.md index 1a7702b637..edcfeebd03 100644 --- a/docs/website/docs/tutorial/sql-database.md +++ b/docs/website/docs/tutorial/sql-database.md @@ -42,7 +42,7 @@ After running this command, your project will have the following structure: Here’s what each file does: -- `sql_database_pipeline.py`: This is the main script where you'll define your data pipeline. It contains several different examples for how you can configure your SQL Database pipeline. +- `sql_database_pipeline.py`: This is the main script where you'll define your data pipeline. It contains several different examples of how you can configure your SQL Database pipeline. - `requirements.txt`: This file lists all the Python dependencies required for your project. - `.dlt/`: This directory contains the [configuration files](../general-usage/credentials/) for your project: - `secrets.toml`: This file stores your credentials, API keys, tokens, and other sensitive information. @@ -52,6 +52,8 @@ Here’s what each file does: When deploying your pipeline in a production environment, managing all configurations with the TOML files might not be convenient. In this case, we highly recommend using environment variables or other [configuration providers](../general-usage/credentials/setup#available-config-providers) available in dlt to store secrets and configs instead. ::: + + ## 2. Configure the pipeline script With the necessary files in place, we can now start writing our pipeline script. The existing file `sql_database_pipeline.py` already contains many pre-configured example functions that can help you get started with different data loading scenarios. However, for the purpose of this tutorial, we will be writing a new function from scratch. @@ -60,7 +62,6 @@ With the necessary files in place, we can now start writing our pipeline script. Running the script as it is will execute the function `load_standalone_table_resource()`, so remember to comment out the function call from inside the main block. ::: - The following function will load the tables `family` and `genome`. ```py @@ -92,11 +93,11 @@ Explanation: - `sql_database()` is a [dlt source function](../general-usage/source) that iteratively loads the tables (in this example, `"family"` and `"genome"`) passed inside the `with_resource()` method. - `sql_table()` is a [dlt resource function](../general-usage/resource) that loads standalone tables. For example, if we wanted to only load the table `"family"`, then we could have done it using `sql_table(table="family")`. - `dlt.pipeline()` creates a `dlt` pipeline with the name `"sql_to_duckdb_pipeline"` with the destination DuckDB. -- `pipeline.run()` method loads the data into the destination. +- The `pipeline.run()` method loads the data into the destination. ## 3. Add credentials -To sucessfully connect to your SQL database, you will need to pass credentials into your pipeline. dlt automatically looks for this information inside the generated TOML files. +To successfully connect to your SQL database, you will need to pass credentials into your pipeline. dlt automatically looks for this information inside the generated TOML files. Simply paste the [connection details](https://docs.rfam.org/en/latest/database.html) inside `secrets.toml` as follows: ```toml @@ -114,8 +115,7 @@ Alternatively, you can also paste the credentials as a connection string: sources.sql_database.credentials="mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam" ``` -For more details on the credentials format and other connection methods read the section on [configuring connection to the SQL Database](https://dlthub.com/docs/dlt-ecosystem/verified-sources/sql_database#credentials-format). - +For more details on the credentials format and other connection methods, read the section on [configuring connection to the SQL Database](https://dlthub.com/docs/dlt-ecosystem/verified-sources/sql_database#credentials-format). ## 4. Install dependencies @@ -138,7 +138,7 @@ After performing steps 1-4, you should now be able to successfully run the pipel ```sh python sql_database_pipeline.py ``` -This will create the file `sql_to_duckdb_pipeline.duckdb` in your dlt project directory which contains the loaded data. +This will create the file `sql_to_duckdb_pipeline.duckdb` in your dlt project directory, which contains the loaded data. ## 6. Explore the data @@ -154,14 +154,13 @@ Next, run the following command to launch the data browser app: dlt pipeline sql_to_duckdb_pipeline show ``` -You can explore the loaded data, run queries and see some pipeline execution details. +You can explore the loaded data, run queries, and see some pipeline execution details. ![streamlit-screenshot](https://storage.googleapis.com/dlt-blog-images/docs-sql-database-tutorial-streamlit-screenshot.png) ## 7. Append, replace, or merge loaded data -Try running the pipeline again with `python sql_database_pipeline.py`. You will notice that -all the tables have the data duplicated. This happens as dlt, by default, appends data to the destination tables in every load. This behavior can be adjusted by setting the `write_disposition` parameter inside the `pipeline.run()` method. The possible settings are: +Try running the pipeline again with `python sql_database_pipeline.py`. You will notice that all the tables have the data duplicated. This happens as dlt, by default, appends data to the destination tables in every load. This behavior can be adjusted by setting the `write_disposition` parameter inside the `pipeline.run()` method. The possible settings are: - `append`: Appends the data to the destination table. This is the default. - `replace`: Replaces the data in the destination table with the new data. @@ -197,7 +196,7 @@ Run the pipeline again with `sql_database_pipeline.py`. This time, the data will When you want to update the existing data as new data is loaded, you can use the `merge` write disposition. This requires specifying a primary key for the table. The primary key is used to match the new data with the existing data in the destination table. -In the previous example, we set `write_disposition="replace"` inside `pipeline.run()` which caused all the tables to be loaded with `replace`. However, it's also possible to define the `write_disposition` strategy separately for each tables using the `apply_hints` method. In the example below, we use `apply_hints` on each table to specify different primary keys for merge: +In the previous example, we set `write_disposition="replace"` inside `pipeline.run()`, which caused all the tables to be loaded with `replace`. However, it's also possible to define the `write_disposition` strategy separately for each table using the `apply_hints` method. In the example below, we use `apply_hints` on each table to specify different primary keys for merge: ```py def load_tables_family_and_genome(): @@ -224,7 +223,7 @@ if __name__ == '__main__': ## 8. Load data incrementally -Often you don't want to load the whole data in each load, but rather only the new or modified data. dlt makes this easy with [incremental loading](../general-usage/incremental-loading). +Often you don't want to load all the data in each load, but rather only the new or modified data. dlt makes this easy with [incremental loading](../general-usage/incremental-loading). In the example below, we configure the table `"family"` to load incrementally based on the column `"updated"`: @@ -262,3 +261,4 @@ Interested in learning more about dlt? Here are some suggestions: - Learn more about the SQL Database source configuration in [the SQL Database source reference](../dlt-ecosystem/verified-sources/sql_database) - Learn more about different credential types in [Built-in credentials](../general-usage/credentials/complex_types#built-in-credentials) - Learn how to [create a custom source](./load-data-from-an-api.md) in the advanced tutorial +