diff --git a/docs/website/docs/general-usage/credentials/config_providers.md b/docs/website/docs/general-usage/credentials/config_providers.md index c0dc459da0..fbdbfe78f2 100644 --- a/docs/website/docs/general-usage/credentials/config_providers.md +++ b/docs/website/docs/general-usage/credentials/config_providers.md @@ -7,34 +7,21 @@ keywords: [credentials, secrets.toml, secrets, config, configuration, environmen # Configuration Providers - -Configuration Providers in the context of the `dlt` library -refer to different sources from which configuration values -and secrets can be retrieved for a data pipeline. -These providers form a hierarchy, with each having its own -priority in determining the values for function arguments. +Configuration Providers, in the context of the `dlt` library, refer to different sources from which configuration values and secrets can be retrieved for a data pipeline. These providers form a hierarchy, each having its own priority in determining the values for function arguments. ## The provider hierarchy -If function signature has arguments that may be injected, `dlt` looks for the argument values in -providers. +If a function signature has arguments that may be injected, `dlt` looks for the argument values in providers. ### Providers -1. **Environment Variables**: At the top of the hierarchy are environment variables. - If a value for a specific argument is found in an environment variable, - dlt will use it and will not proceed to search in lower-priority providers. +1. **Environment Variables**: At the top of the hierarchy are environment variables. If a value for a specific argument is found in an environment variable, dlt will use it and will not proceed to search in lower-priority providers. -2. **Vaults (Airflow/Google/AWS/Azure)**: These are specialized providers that come - after environment variables. They can provide configuration values and secrets. - However, they typically focus on handling sensitive information. +2. **Vaults (Airflow/Google/AWS/Azure)**: These are specialized providers that come after environment variables. They can provide configuration values and secrets. However, they typically focus on handling sensitive information. -3. **`secrets.toml` and `config.toml` Files**: These files are used for storing both - configuration values and secrets. `secrets.toml` is dedicated to sensitive information, - while `config.toml` contains non-sensitive configuration data. +3. **`secrets.toml` and `config.toml` Files**: These files are used for storing both configuration values and secrets. `secrets.toml` is dedicated to sensitive information, while `config.toml` contains non-sensitive configuration data. -4. **Default Argument Values**: These are the values specified in the function's signature. - They have the lowest priority in the provider hierarchy. +4. **Default Argument Values**: These are the values specified in the function's signature. They have the lowest priority in the provider hierarchy. ### Example @@ -49,34 +36,27 @@ def google_sheets( ... ``` -In case of `google_sheets()` it will look -for: `spreadsheet_id`, `tab_names`, `credentials` and `only_strings` +In the case of `google_sheets()`, it will look for: `spreadsheet_id`, `tab_names`, `credentials`, and `only_strings`. Each provider has its own key naming convention, and dlt is able to translate between them. **The argument name is a key in the lookup**. -At the top of the hierarchy are Environment Variables, then `secrets.toml` and -`config.toml` files. Providers like Airflow/Google/AWS/Azure Vaults will be inserted **after** the Environment -provider but **before** TOML providers. +At the top of the hierarchy are Environment Variables, then `secrets.toml` and `config.toml` files. Providers like Airflow/Google/AWS/Azure Vaults will be inserted **after** the Environment provider but **before** TOML providers. -For example, if `spreadsheet_id` is found in environment variable `SPREADSHEET_ID`, `dlt` will not look in TOML files -and below. +For example, if `spreadsheet_id` is found in the environment variable `SPREADSHEET_ID`, `dlt` will not look in TOML files and below. -The values passed in the code **explicitly** are the **highest** in provider hierarchy. The **default values** -of the arguments have the **lowest** priority in the provider hierarchy. +The values passed in the code **explicitly** are the **highest** in the provider hierarchy. The **default values** of the arguments have the **lowest** priority in the provider hierarchy. :::info Explicit Args **>** ENV Variables **>** Vaults: Airflow etc. **>** `secrets.toml` **>** `config.toml` **>** Default Arg Values ::: -Secrets are handled only by the providers supporting them. Some providers support only -secrets (to reduce the number of requests done by `dlt` when searching sections). +Secrets are handled only by the providers supporting them. Some providers support only secrets (to reduce the number of requests done by `dlt` when searching sections). 1. `secrets.toml` and environment may hold both config and secret values. -1. `config.toml` may hold only config values, no secrets. -1. Various vaults providers hold only secrets, `dlt` skips them when looking for values that are not - secrets. +2. `config.toml` may hold only config values, no secrets. +3. Various vaults providers hold only secrets, `dlt` skips them when looking for values that are not secrets. :::info Context-aware providers will activate in the right environments i.e. on Airflow or AWS/GCP VMachines. @@ -86,22 +66,19 @@ Context-aware providers will activate in the right environments i.e. on Airflow ### TOML vs. Environment Variables -Providers may use different formats for the keys. `dlt` will translate the standard format where -sections and key names are separated by "." into the provider-specific formats. +Providers may use different formats for the keys. `dlt` will translate the standard format where sections and key names are separated by "." into the provider-specific formats. 1. For TOML, names are case-sensitive and sections are separated with ".". -1. For Environment Variables, all names are capitalized and sections are separated with double - underscore "__". +2. For Environment Variables, all names are capitalized and sections are separated with double underscore "__". -Example: When `dlt` evaluates the request `dlt.secrets["my_section.gcp_credentials"]` it must find -the `private_key` for Google credentials. It will look +Example: When `dlt` evaluates the request `dlt.secrets["my_section.gcp_credentials"]`, it must find the `private_key` for Google credentials. It will look: -1. first in env variable `MY_SECTION__GCP_CREDENTIALS__PRIVATE_KEY` and if not found, -1. in `secrets.toml` with key `my_section.gcp_credentials.private_key`. +1. first in the env variable `MY_SECTION__GCP_CREDENTIALS__PRIVATE_KEY` and if not found, +2. in `secrets.toml` with the key `my_section.gcp_credentials.private_key`. ### Environment provider -Looks for the values in the environment variables. +This provider looks for the values in the environment variables. ### TOML provider @@ -110,12 +87,10 @@ The TOML provider in dlt utilizes two TOML files: - `secrets.toml `- This file is intended for storing sensitive information, often referred to as "secrets". - `config.toml `- This file is used for storing configuration values. -By default, the `.gitignore` file in the project prevents `secrets.toml` from being added to -version control and pushed. However, `config.toml` can be freely added to version control. +By default, the `.gitignore` file in the project prevents `secrets.toml` from being added to version control and pushed. However, `config.toml` can be freely added to version control. :::info -**TOML provider always loads those files from `.dlt` folder** which is looked **relative to the -current Working Directory**. +The TOML provider always loads these files from the `.dlt` folder, which is looked for relative to the current working directory. ::: Example: If your working directory is `my_dlt_project` and your project has the following structure: @@ -128,14 +103,11 @@ my_dlt_project: |---- google_sheets.py ``` -and you run `python pipelines/google_sheets.py` then `dlt` will look for `secrets.toml` in -`my_dlt_project/.dlt/secrets.toml` and ignore the existing -`my_dlt_project/pipelines/.dlt/secrets.toml`. +and you run `python pipelines/google_sheets.py`, then `dlt` will look for `secrets.toml` in `my_dlt_project/.dlt/secrets.toml` and ignore the existing `my_dlt_project/pipelines/.dlt/secrets.toml`. -If you change your working directory to `pipelines` and run `python google_sheets.py` it will look for -`my_dlt_project/pipelines/.dlt/secrets.toml` as (probably) expected. +If you change your working directory to `pipelines` and run `python google_sheets.py`, it will look for `my_dlt_project/pipelines/.dlt/secrets.toml` as (probably) expected. :::caution -It's worth mentioning that the TOML provider also has the capability to read files from `~/.dlt/` -(located in the user's home directory) in addition to the local project-specific `.dlt` folder. -::: \ No newline at end of file +It's worth mentioning that the TOML provider also has the capability to read files from `~/.dlt/` (located in the user's home directory) in addition to the local project-specific `.dlt` folder. +::: + diff --git a/docs/website/docs/general-usage/credentials/config_specs.md b/docs/website/docs/general-usage/credentials/config_specs.md index 07e56b3e14..976178d660 100644 --- a/docs/website/docs/general-usage/credentials/config_specs.md +++ b/docs/website/docs/general-usage/credentials/config_specs.md @@ -18,7 +18,7 @@ service account credentials, while `ConnectionStringCredentials` handles databas ### Example -As an example, let's use `ConnectionStringCredentials` which represents a database connection +As an example, let's use `ConnectionStringCredentials`, which represents a database connection string. ```python @@ -29,7 +29,7 @@ def query(sql: str, dsn: ConnectionStringCredentials = dlt.secrets.value): ... ``` -The source above executes the `sql` against database defined in `dsn`. `ConnectionStringCredentials` +The source above executes the `sql` against the database defined in `dsn`. `ConnectionStringCredentials` makes sure you get the correct values with correct types and understands the relevant native form of the credentials. @@ -51,7 +51,7 @@ Example 2. Use the **native** form. dsn="postgres://loader:loader@localhost:5432/dlt_data" ``` -Example 3. Use the **mixed** form: the password is missing in explicit dsn and will be taken from the +Example 3. Use the mixed form: the password is missing in the explicit DSN and will be taken from the `secrets.toml`. ```toml @@ -66,7 +66,7 @@ query("SELECT * FROM customers", "postgres://loader@localhost:5432/dlt_data") query("SELECT * FROM customers", {"database": "dlt_data", "username": "loader"...}) ``` -## Built in credentials +## Built-in credentials We have some ready-made credentials you can reuse: @@ -141,7 +141,7 @@ it is a base class for [GcpOAuthCredentials](#gcpoauthcredentials). - [GcpOAuthCredentials](#gcpoauthcredentials). [Google Analytics verified source](https://github.com/dlt-hub/verified-sources/blob/master/sources/google_analytics/__init__.py): -the example how to use GCP Credentials. +the example of how to use GCP Credentials. #### GcpServiceAccountCredentials @@ -150,7 +150,7 @@ This class provides methods to retrieve native credentials for Google clients. ##### Usage -- You may just pass the `service.json` as string or dictionary (in code and via config providers). +- You may just pass the `service.json` as a string or dictionary (in code and via config providers). - Or default credentials will be used. ```python @@ -249,12 +249,12 @@ and `config.toml`: property_id = "213025502" ``` -In order for `auth()` method to succeed: +In order for the `auth()` method to succeed: -- You must provide valid `client_id` and `client_secret`, +- You must provide a valid `client_id` and `client_secret`, `refresh_token` and `project_id` in order to get a current **access token** and authenticate with OAuth. - Mind that the `refresh_token` must contain all the scopes that you require for your access. + Keep in mind that the `refresh_token` must contain all the scopes that you require for your access. - If `refresh_token` is not provided, and you run the pipeline from a console or a notebook, `dlt` will use InstalledAppFlow to run the desktop authentication flow. @@ -429,7 +429,7 @@ of credentials that derive from the common class, so you can handle it seamlessl This is used a lot in the `dlt` core and may become useful for complicated sources. -In fact, for each decorated function a spec is synthesized. In case of `google_sheets` following +In fact, for each decorated function a spec is synthesized. In the case of `google_sheets`, the following class is created: ```python @@ -437,7 +437,7 @@ from dlt.sources.config import configspec, with_config @configspec class GoogleSheetsConfiguration(BaseConfiguration): - tab_names: List[str] = None # manadatory + tab_names: List[str] = None # mandatory credentials: GcpServiceAccountCredentials = None # mandatory secret only_strings: Optional[bool] = False ``` @@ -465,4 +465,5 @@ and is meant to serve as a base class for handling various types of credentials. It defines methods for initializing credentials, converting them to native representations, and generating string representations while ensuring sensitive information is appropriately handled. -More information about this class can be found in the class docstrings. \ No newline at end of file +More information about this class can be found in the class docstrings. + diff --git a/docs/website/docs/general-usage/credentials/configuration.md b/docs/website/docs/general-usage/credentials/configuration.md index 9b2d392883..851a8d0a96 100644 --- a/docs/website/docs/general-usage/credentials/configuration.md +++ b/docs/website/docs/general-usage/credentials/configuration.md @@ -8,12 +8,12 @@ keywords: [credentials, secrets.toml, secrets, config, configuration, environmen # Secrets and Configs Use secret and config values to pass access credentials and configure or fine-tune your pipelines without the need to modify your code. -When done right you'll be able to run the same pipeline script during development and in production. +When done correctly, you'll be able to run the same pipeline script during development and in production. **Configs**: - Configs refer to non-sensitive configuration data. These are settings, parameters, or options that define the behavior of a data pipeline. - - They can include things like file paths, database hosts and timeouts, API urls, performance settings, or any other settings that affect the pipeline's behavior. + - They can include things like file paths, database hosts and timeouts, API URLs, performance settings, or any other settings that affect the pipeline's behavior. **Secrets**: @@ -23,7 +23,7 @@ When done right you'll be able to run the same pipeline script during developmen ## Configure dlt sources and resources In the example below, the `google_sheets` source function is used to read selected tabs from Google Sheets. -It takes several arguments that specify the spreadsheet, the tab names and the Google credentials to be used when extracting data. +It takes several arguments that specify the spreadsheet, the tab names, and the Google credentials to be used when extracting data. ```python @dlt.source @@ -46,9 +46,9 @@ def google_sheets( tabs.append(dlt.resource(data, name=tab_name)) return tabs ``` -`dlt.source` decorator makes all arguments in `google_sheets` function signature configurable. +The `dlt.source` decorator makes all arguments in the `google_sheets` function signature configurable. `dlt.secrets.value` and `dlt.config.value` are special argument defaults that tell `dlt` that this -argument is required and must be passed explicitly or must exist in the configuration. Additionally +argument is required and must be passed explicitly or must exist in the configuration. Additionally, `dlt.secrets.value` tells `dlt` that an argument is a secret. In the example above: @@ -65,22 +65,22 @@ of a **source**) ### Allow `dlt` to pass the config and secrets automatically You are free to call the function above as usual and pass all the arguments in the code. You'll hardcode google credentials and [we do not recommend that](#do-not-pass-hardcoded-secrets). -Instead let `dlt` to do the work and leave it to [injection mechanism](#injection-mechanism) that looks for function arguments in the config files or environment variables and adds them to your explicit arguments during a function call. Below are two most typical examples: +Instead, let `dlt` do the work and leave it to the [injection mechanism](#injection-mechanism) that looks for function arguments in the config files or environment variables and adds them to your explicit arguments during a function call. Below are two most typical examples: 1. Pass spreadsheet id and tab names in the code, inject credentials from the secrets: ```python data_source = google_sheets("23029402349032049", ["tab1", "tab2"]) ``` - `credentials` value will be injected by the `@source` decorator (e.g. from `secrets.toml`). + The `credentials` value will be injected by the `@source` decorator (e.g., from `secrets.toml`). `spreadsheet_id` and `tab_names` take values from the call arguments. 2. Inject all the arguments from config / secrets ```python data_source = google_sheets() ``` - `credentials` value will be injected by the `@source` decorator (e.g. from **secrets.toml**). + The `credentials` value will be injected by the `@source` decorator (e.g., from **secrets.toml**). - `spreadsheet_id` and `tab_names` will be also injected by the `@source` decorator (e.g. from **config.toml**). + `spreadsheet_id` and `tab_names` will also be injected by the `@source` decorator (e.g., from **config.toml**). Where do the configs and secrets come from? By default, `dlt` looks in two **config providers**: @@ -101,21 +101,21 @@ Where do the configs and secrets come from? By default, `dlt` looks in two **con private_key = project_id = ``` - Note that **credentials** will be evaluated as dictionary containing **client_email**, **private_key** and **project_id** as keys. It is standard TOML behavior. + Note that `credentials` will be evaluated as a dictionary containing `client_email`, `private_key`, and `project_id` as keys. It is standard TOML behavior. - [Environment Variables](config_providers#environment-provider): ```python CREDENTIALS= SPREADSHEET_ID=1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580 TAB_NAMES=tab1,tab2 ``` - We pass the JSON contents of `service.json` file to `CREDENTIALS` and we specify tab names as comma-delimited values. Environment variables are always in **upper case**. + We pass the JSON contents of the `service.json` file to `CREDENTIALS` and we specify tab names as comma-delimited values. Environment variables are always in **upper case**. :::tip -There are many ways you can organize your configs and secrets. The example above is the simplest default **layout** that `dlt` supports. In more complicated cases (i.e. a single configuration is shared by many pipelines with different sources and destinations) you may use more [explicit layouts](#secret-and-config-values-layout-and-name-lookup). +There are many ways you can organize your configs and secrets. The example above is the simplest default **layout** that `dlt` supports. In more complicated cases (i.e., a single configuration is shared by many pipelines with different sources and destinations) you may use more [explicit layouts](#secret-and-config-values-layout-and-name-lookup). ::: :::caution -**[TOML provider](config_providers#toml-provider) always loads `secrets.toml` and `config.toml` files from `.dlt` folder** which is looked relative to the +**[TOML provider](config_providers#toml-provider) always loads `secrets.toml` and `config.toml` files from the `.dlt` folder** which is looked relative to the **current [Working Directory](https://en.wikipedia.org/wiki/Working_directory)**. TOML provider also has the capability to read files from `~/.dlt/` (located in the user's [Home Directory](https://en.wikipedia.org/wiki/Home_directory)). ::: @@ -158,10 +158,10 @@ information on what source/resource expects. Doing so provides several benefits: 1. You'll never receive invalid data types in your code. -1. `dlt` will automatically parse and coerce types for you. In our example, you do not need to parse list of tabs or credentials dictionary yourself. +1. `dlt` will automatically parse and coerce types for you. In our example, you do not need to parse the list of tabs or credentials dictionary yourself. 1. We can generate nice sample config and secret files for your source. -1. You can request [built-in and custom credentials](config_specs.md) (i.e. connection strings, AWS / GCP / Azure credentials). -1. You can specify a set of possible types via `Union` i.e. OAuth or API Key authorization. +1. You can request [built-in and custom credentials](config_specs.md) (i.e., connection strings, AWS / GCP / Azure credentials). +1. You can specify a set of possible types via `Union` i.e., OAuth or API Key authorization. ```python @dlt.source @@ -180,11 +180,11 @@ Now: 1. You will get actual Google credentials (see [GCP Credential Configuration](config_specs#gcp-credentials)), and your users can pass them in many different forms. -In case of `GcpServiceAccountCredentials`: +In the case of `GcpServiceAccountCredentials`: -- You may just pass the `service.json` as string or dictionary (in code and via config providers). +- You may just pass the `service.json` as a string or dictionary (in code and via config providers). - You may pass a connection string (used in SQL Alchemy) (in code and via config providers). -- If you do not pass any credentials, the default credentials are used (i.e. those present on Cloud Function runner) +- If you do not pass any credentials, the default credentials are used (i.e., those present on Cloud Function runner) ## Read configs and secrets yourself `dlt.secrets` and `dlt.config` provide dictionary-like access to configuration values and secrets, respectively. @@ -205,10 +205,10 @@ request value cast to a desired type. For example: ```python credentials = dlt.secrets.get("my_section.gcp_credentials", GcpServiceAccountCredentials) ``` -Creates `GcpServiceAccountCredentials` instance out of values (typically a dictionary) under **my_section.gcp_credentials** key. +Creates `GcpServiceAccountCredentials` instance out of values (typically a dictionary) under `my_section.gcp_credentials` key. ### Write configs and secrets in code -**dlt.config** and **dlt.secrets** can be also used as setters. For example: +`dlt.config` and `dlt.secrets` can also be used as setters. For example: ```python dlt.config["sheet_id"] = "23029402349032049" dlt.secrets["destination.postgres.credentials"] = BaseHook.get_connection('postgres_dsn').extra @@ -220,7 +220,7 @@ Will mock the **toml** provider to desired values. Config and secret values are added to the function arguments when a function decorated with `@dlt.source` or `@dlt.resource` is called. -The signature of such function (i.e. `google_sheets` above) is **also a specification of the configuration**. +The signature of such function (i.e., `google_sheets` above) is also a **specification of the configuration**. During runtime `dlt` takes the argument names in the signature and supplies (`inject`) the required values via various config providers. The injection rules are: @@ -236,23 +236,23 @@ The injection rules are: (or explicitly passed). If they are not found by the config providers, the code raises exception. The code in the functions always receives those arguments. -Additionally `dlt.secrets.value` tells `dlt` that supplied value is a secret, and it will be injected +Additionally, `dlt.secrets.value` tells `dlt` that supplied value is a secret, and it will be injected only from secure config providers. ## Secret and config values layout and name lookup `dlt` uses a layout of hierarchical sections to organize the config and secret values. This makes configurations and secrets easy to manage, and disambiguate values with the same keys by placing -them in the different sections. +them in different sections. :::note If you know how TOML files are organized -> this is the same concept! ::: -A lot of config values are dictionaries themselves (i.e. most of the credentials) and you want the +A lot of config values are dictionaries themselves (i.e., most of the credentials) and you want the values corresponding to one component to be close together. -You can have a separate credentials for your destinations and each of the sources your pipeline uses, +You can have separate credentials for your destinations and each of the sources your pipeline uses, if you have many pipelines in a single project, you can group them in separate sections. Here is the simplest default layout for our `google_sheets` example. @@ -325,7 +325,7 @@ tabs=["tab1", "tab2"] `dlt` arranges the sections into **default layout** that is expected by injection mechanism. This layout makes it easy to configure simple cases but also provides a room for more explicit sections and -complex cases i.e. having several sources with different credentials or even hosting several pipelines +complex cases i.e., having several sources with different credentials or even hosting several pipelines in the same project sharing the same config and credentials. ``` @@ -377,111 +377,4 @@ project_id = client_email = private_key = project_id = -``` - -Now when `dlt` looks for destination credentials, it will start with `destination.bigquery.credentials`, eliminate `bigquery` and stop at `destination.credentials`. - -When looking for `sources` credentials it will start with `sources.google_sheets.google_sheets.credentials`, eliminate `google_sheets` twice and stop at `sources.credentials` (we assume that `google_sheets` source was defined in `google_sheets` python module) - -Example: let's be even more explicit and use a full section path possible. - -```toml -# google sheet credentials -[sources.google_sheets.credentials] -client_email = -private_key = -project_id = - -# google analytics credentials -[sources.google_analytics.credentials] -client_email = -private_key = -project_id = - -# bigquery credentials -[destination.bigquery.credentials] -client_email = -private_key = -project_id = -``` - -Now we can separate credentials for different sources as well. - -**Rule 2:** You can use your pipeline name to have separate configurations for each pipeline in your -project. - -Pipeline created/obtained with `dlt.pipeline()` creates a global and optional namespace with the -value of `pipeline_name`. All config values will be looked with pipeline name first and then again -without it. - -Example: the pipeline is named `ML_sheets`. - -```toml -[ML_sheets.credentials] -client_email = -private_key = -project_id = -``` - -or maximum path: - -```toml -[ML_sheets.sources.google_sheets.credentials] -client_email = -private_key = -project_id = -``` - -### The `sources` section - -Config and secrets for decorated sources and resources are kept in -`sources..` section. **All sections are optional during lookup**. For example, -if source module is named `pipedrive` and the function decorated with `@dlt.source` is -`deals(api_key: str=...)` then `dlt` will look for API key in: - -1. `sources.pipedrive.deals.api_key` -1. `sources.pipedrive.api_key` -1. `sources.api_key` -1. `api_key` - -Step 2 in a search path allows all the sources/resources in a module to share the same set of -credentials. - -Also look at the [following test](https://github.com/dlt-hub/dlt/blob/devel/tests/extract/test_decorators.py#L303) `test_source_sections`. - -## Understanding the exceptions - -Now we can finally understand the `ConfigFieldMissingException`. - -Let's run `chess.py` example without providing the password: - -``` -$ CREDENTIALS="postgres://loader@localhost:5432/dlt_data" python chess.py -... -dlt.common.configuration.exceptions.ConfigFieldMissingException: Following fields are missing: ['password'] in configuration with spec PostgresCredentials - for field "password" config providers and keys were tried in following order: - In Environment Variables key CHESS_GAMES__DESTINATION__POSTGRES__CREDENTIALS__PASSWORD was not found. - In Environment Variables key CHESS_GAMES__DESTINATION__CREDENTIALS__PASSWORD was not found. - In Environment Variables key CHESS_GAMES__CREDENTIALS__PASSWORD was not found. - In secrets.toml key chess_games.destination.postgres.credentials.password was not found. - In secrets.toml key chess_games.destination.credentials.password was not found. - In secrets.toml key chess_games.credentials.password was not found. - In Environment Variables key DESTINATION__POSTGRES__CREDENTIALS__PASSWORD was not found. - In Environment Variables key DESTINATION__CREDENTIALS__PASSWORD was not found. - In Environment Variables key CREDENTIALS__PASSWORD was not found. - In secrets.toml key destination.postgres.credentials.password was not found. - In secrets.toml key destination.credentials.password was not found. - In secrets.toml key credentials.password was not found. -Please refer to https://dlthub.com/docs/general-usage/credentials for more information -``` - -It tells you exactly which paths `dlt` looked at, via which config providers and in which order. - -In the example above: - -1. First it looked in a big section `chess_games` which is name of the pipeline. -1. In each case it starts with full paths and goes to minimum path `credentials.password`. -1. First it looks into `environ` then in `secrets.toml`. It displays the exact keys tried. -1. Note that `config.toml` was skipped! It may not contain any secrets. -Read more about [Provider Hierarchy](./config_providers). \ No newline at end of file diff --git a/docs/website/docs/general-usage/customising-pipelines/pseudonymizing_columns.md b/docs/website/docs/general-usage/customising-pipelines/pseudonymizing_columns.md index 3f665bd0fb..379fc937cf 100644 --- a/docs/website/docs/general-usage/customising-pipelines/pseudonymizing_columns.md +++ b/docs/website/docs/general-usage/customising-pipelines/pseudonymizing_columns.md @@ -4,12 +4,9 @@ description: Pseudonymizing (or anonymizing) columns by replacing the special ch keywords: [pseudonymize, anonymize, columns, special characters] --- -# Pseudonymizing columns +# Pseudonymizing Columns -Pseudonymization is a deterministic way to hide personally identifiable info (PII), enabling us to -consistently achieve the same mapping. If instead you wish to anonymize, you can delete the data, or -replace it with a constant. In the example below, we create a dummy source with a PII column called -"name", which we replace with deterministic hashes (i.e. replacing the German umlaut). +Pseudonymization is a deterministic way to hide personally identifiable information (PII), enabling us to consistently achieve the same mapping. If you wish to anonymize instead, you can delete the data or replace it with a constant. In the example below, we create a dummy source with a PII column called "name", which we replace with deterministic hashes (i.e., replacing the German umlaut). ```python import dlt @@ -25,9 +22,9 @@ def dummy_source(prefix: str = None): def pseudonymize_name(doc): ''' - Pseudonmyisation is a deterministic type of PII-obscuring + Pseudonymization is a deterministic type of PII-obscuring. Its role is to allow identifying users by their hash, - without revealing the underlying info. + without revealing the underlying information. ''' # add a constant salt to generate salt = 'WI@N57%zZrmk#88c' @@ -46,7 +43,7 @@ for row in dummy_source().dummy_data.add_map(pseudonymize_name): #{'id': 1, 'name': '92d3972b625cbd21f28782fb5c89552ce1aa09281892a2ab32aee8feeb3544a1'} #{'id': 2, 'name': '443679926a7cff506a3b5d5d094dc7734861352b9e0791af5d39db5a7356d11a'} -# Or create an instance of the data source, modify the resource and run the source. +# Or create an instance of the data source, modify the resource, and run the source. # 1. Create an instance of the source so you can edit it. data_source = dummy_source() @@ -59,3 +56,5 @@ for row in data_source: pipeline = dlt.pipeline(pipeline_name='example', destination='bigquery', dataset_name='normalized_data') load_info = pipeline.run(data_source) ``` + + diff --git a/docs/website/docs/general-usage/customising-pipelines/removing_columns.md b/docs/website/docs/general-usage/customising-pipelines/removing_columns.md index 8493ffaec5..f331d5658a 100644 --- a/docs/website/docs/general-usage/customising-pipelines/removing_columns.md +++ b/docs/website/docs/general-usage/customising-pipelines/removing_columns.md @@ -6,9 +6,7 @@ keywords: [deleting, removing, columns, drop] # Removing columns -Removing columns before loading data into a database is a reliable method to eliminate sensitive or -unnecessary fields. For example, in the given scenario, a source is created with a "country_id" column, -which is then excluded from the database before loading. +Removing columns before loading data into a database is a reliable method to eliminate sensitive or unnecessary fields. For example, in the given scenario, a source is created with a "country_id" column, which is then excluded from the database before loading. Let's create a sample pipeline demonstrating the process of removing a column. @@ -27,7 +25,7 @@ Let's create a sample pipeline demonstrating the process of removing a column. return dummy_data() ``` - This function creates three columns `id`, `name` and `country_code`. + This function creates three columns: `id`, `name`, and `country_code`. 1. Next, create a function to filter out columns from the data before loading it into a database as follows: @@ -75,7 +73,7 @@ Let's create a sample pipeline demonstrating the process of removing a column. #{'id': 2, 'name': 'Jane Washington 2'} ``` -1. At last, create a pipeline: +1. Finally, create a pipeline: ```python # Integrating with a DLT pipeline diff --git a/docs/website/docs/general-usage/customising-pipelines/renaming_columns.md b/docs/website/docs/general-usage/customising-pipelines/renaming_columns.md index e58dae6d9d..44f1bd0a5c 100644 --- a/docs/website/docs/general-usage/customising-pipelines/renaming_columns.md +++ b/docs/website/docs/general-usage/customising-pipelines/renaming_columns.md @@ -8,14 +8,12 @@ keywords: [renaming, columns, special characters] ## Renaming columns by replacing the special characters -In the example below, we create a dummy source with special characters in the name. We then write a -function that we intend to apply to the resource to modify its output (i.e. replacing the German -umlaut): `replace_umlauts_in_dict_keys`. +In the example below, we create a dummy source with special characters in the name. We then write a function that we intend to apply to the resource to modify its output (i.e., replacing the German umlaut): `replace_umlauts_in_dict_keys`. ```python import dlt -# create a dummy source with umlauts (special characters) in key names (um) +# Create a dummy source with umlauts (special characters) in key names (um) @dlt.source def dummy_source(prefix: str = None): @dlt.resource @@ -53,3 +51,5 @@ for row in data_source: # {'Objekt_0': {'Groesse': 0, 'Aequivalenzpruefung': True}} # ... ``` + + diff --git a/docs/website/docs/general-usage/data-enrichments/currency_conversion_data_enrichment.md b/docs/website/docs/general-usage/data-enrichments/currency_conversion_data_enrichment.md index 6b09510f68..dcf3272038 100644 --- a/docs/website/docs/general-usage/data-enrichments/currency_conversion_data_enrichment.md +++ b/docs/website/docs/general-usage/data-enrichments/currency_conversion_data_enrichment.md @@ -7,16 +7,16 @@ keywords: [data enrichment, currency conversion, latest market rates] # Data enrichment part two: Currency conversion data enrichment Currency conversion data enrichment means adding additional information to currency-related data. -Often, you have a data set of monetary value in one currency. For various reasons such as reporting, +Often, you have a dataset of monetary value in one currency. For various reasons such as reporting, analysis, or global operations, it may be necessary to convert these amounts into different currencies. ## Currency conversion process -Here is step-by-step process for currency conversion data enrichment: +Here is a step-by-step process for currency conversion data enrichment: 1. Define base and target currencies. e.g., USD (base) to EUR (target). 1. Obtain current exchange rates from a reliable source like a financial data API. -1. Convert the monetary values at obtained exchange rates. +1. Convert the monetary values at the obtained exchange rates. 1. Include metadata like conversion rate, date, and time. 1. Save the updated dataset in a data warehouse or lake using a data pipeline. @@ -24,18 +24,18 @@ We use the [ExchangeRate-API](https://app.exchangerate-api.com/) to fetch the la conversion rates. However, you can use any service you prefer. :::note -ExchangeRate-API free tier offers 1500 free calls monthly. For production, consider +ExchangeRate-API's free tier offers 1500 free calls monthly. For production, consider upgrading to a higher plan. ::: -## Creating data enrichment pipeline +## Creating a data enrichment pipeline You can either follow the example in the linked Colab notebook or follow this documentation to create the currency conversion data enrichment pipeline. ### A. Colab notebook -The Colab notebook combines three data enrichment processes for a sample dataset, it's second part +The Colab notebook combines three data enrichment processes for a sample dataset; its second part contains "Data enrichment part two: Currency conversion data enrichment". Here's the link to the notebook: @@ -53,13 +53,13 @@ currency_conversion_enrichment/ └── currency_enrichment_pipeline.py ``` -### 1. Creating resource +### 1. Creating a resource `dlt` works on the principle of [sources](../../general-usage/source.md) and [resources.](../../general-usage/resource.md) 1. The last part of our data enrichment ([part one](../../general-usage/data-enrichments/user_agent_device_data_enrichment.md)) - involved enriching the data with user-agent device data. This included adding two new columns to the dataset as folows: + involved enriching the data with user-agent device data. This included adding two new columns to the dataset as follows: - `device_price_usd`: average price of the device in USD. @@ -109,7 +109,7 @@ the `dlt` [state](../../general-usage/state.md). The first step is to register on [ExhangeRate-API](https://app.exchangerate-api.com/) and obtain the API token. -1. In the `.dlt`folder, there's a file called `secrets.toml`. It's where you store sensitive +1. In the `.dlt` folder, there's a file called `secrets.toml`. It's where you store sensitive information securely, like access tokens. Keep this file safe. Here's its format for service account authentication: @@ -200,7 +200,7 @@ API token. processing. `Transformers` are a form of `dlt resource` that takes input from other resources - via `data_from` argument to enrich or transform the data. + via the `data_from` argument to enrich or transform the data. [Click here.](../../general-usage/resource.md#process-resources-with-dlttransformer) Conversely, `add_map` used to customize a resource applies transformations at an item level @@ -264,3 +264,4 @@ API token. For example, the "pipeline_name" for the above pipeline example is `data_enrichment_two`; you can use any custom name instead. + diff --git a/docs/website/docs/general-usage/data-enrichments/url-parser-data-enrichment.md b/docs/website/docs/general-usage/data-enrichments/url-parser-data-enrichment.md index f4578d065f..84120f4483 100644 --- a/docs/website/docs/general-usage/data-enrichments/url-parser-data-enrichment.md +++ b/docs/website/docs/general-usage/data-enrichments/url-parser-data-enrichment.md @@ -6,28 +6,28 @@ keywords: [data enrichment, url parser, referer data enrichment] # Data enrichment part three: URL parser data enrichment -URL parser data enrichment is extracting various URL components to gain additional insights and +URL parser data enrichment involves extracting various URL components to gain additional insights and context about the URL. This extracted information can be used for data analysis, marketing, SEO, and more. ## URL parsing process -Here is step-by-step process for URL parser data enrichment : +Here is the step-by-step process for URL parser data enrichment: -1. Get the URL data that is needed to be parsed from a source or create one. -1. Send the URL data to an API like [URL Parser API](https://urlparse.com/). -1. Get the parsed URL data. -1. Include metadata like conversion rate, date, and time. -1. Save the updated dataset in a data warehouse or lake using a data pipeline. +1. Get the URL data that needs to be parsed from a source or create one. +2. Send the URL data to an API like [URL Parser API](https://urlparse.com/). +3. Get the parsed URL data. +4. Include metadata like conversion rate, date, and time. +5. Save the updated dataset in a data warehouse or lake using a data pipeline. We use **[URL Parse API](https://urlparse.com/)** to extract the information about the URL. However, you can use any API you prefer. :::tip -`URL Parse API` is free, with 1000 requests/hour limit, which can be increased on request. +`URL Parse API` is free, with a 1000 requests/hour limit, which can be increased upon request. ::: -By default the URL Parse API will return a JSON response like: +By default, the URL Parse API will return a JSON response like: ```text { @@ -51,7 +51,7 @@ By default the URL Parse API will return a JSON response like: } ``` -## Creating data enrichment pipeline +## Creating a data enrichment pipeline You can either follow the example in the linked Colab notebook or follow this documentation to create the URL-parser data enrichment pipeline. @@ -80,7 +80,7 @@ url_parser_enrichment/ └── url_enrichment_pipeline.py ``` -### 1. Creating resource +### 1. Creating a resource `dlt` works on the principle of [sources](../../general-usage/source.md) and [resources.](../../general-usage/resource.md) @@ -91,7 +91,7 @@ different tracking services. Let's examine a synthetic dataset created for this article. It includes: -- `user_id`: Web trackers typically assign unique ID to users for tracking their journeys and +- `user_id`: Web trackers typically assign unique IDs to users for tracking their journeys and interactions over time. - `device_name`: User device information helps in understanding the user base's device. @@ -140,7 +140,7 @@ Here's the resource that yields the sample data as discussed above: ### 2. Create `url_parser` function We use a free service called [URL Parse API](https://urlparse.com/), to parse the urls. You don’t -need to register to use this service neither get an API key. +need to register to use this service or get an API key. 1. Create a `url_parser` function as follows: ```python @@ -185,7 +185,7 @@ need to register to use this service neither get an API key. processing. `Transformers` are a form of `dlt resource` that takes input from other resources - via `data_from` argument to enrich or transform the data. + via the `data_from` argument to enrich or transform the data. [Click here.](../../general-usage/resource.md#process-resources-with-dlttransformer) Conversely, `add_map` used to customize a resource applies transformations at an item level @@ -222,7 +222,7 @@ need to register to use this service neither get an API key. ) ``` - This will execute the `url_parser` function with the tracked data and return parsed URL. + This will execute the `url_parser` function with the tracked data and return the parsed URL. ::: ### Run the pipeline @@ -248,3 +248,4 @@ need to register to use this service neither get an API key. For example, the "pipeline_name" for the above pipeline example is `data_enrichment_three`; you can use any custom name instead. + diff --git a/docs/website/docs/general-usage/data-enrichments/user_agent_device_data_enrichment.md b/docs/website/docs/general-usage/data-enrichments/user_agent_device_data_enrichment.md index 8b33a852a8..598f1328ee 100644 --- a/docs/website/docs/general-usage/data-enrichments/user_agent_device_data_enrichment.md +++ b/docs/website/docs/general-usage/data-enrichments/user_agent_device_data_enrichment.md @@ -6,34 +6,26 @@ keywords: [data enrichment, user-agent data, device enrichment] # Data enrichment part one: User-agent device data enrichment -Data enrichment enhances raw data with valuable information from multiple sources, increasing its -analytical and decision-making value. +Data enrichment enhances raw data with valuable information from multiple sources, increasing its analytical and decision-making value. -This part covers enriching sample data with device price. Understanding the price segment -of the device that the user used to access your service can be helpful in personalized marketing, -customer segmentation, and many more. +This part covers enriching sample data with device price. Understanding the price segment of the device that the user used to access your service can be helpful in personalized marketing, customer segmentation, and many more. -This documentation will discuss how to enrich the user device information with the average market -price. +This documentation will discuss how to enrich the user device information with the average market price. ## Setup Guide -We use SerpAPI to retrieve device prices using Google Shopping, but alternative services or APIs are -viable. +We use SerpAPI to retrieve device prices using Google Shopping, but alternative services or APIs are viable. :::note -SerpAPI free tier offers 100 free calls monthly. For production, consider upgrading to a higher -plan. +SerpAPI's free tier offers 100 free calls monthly. For production, consider upgrading to a higher plan. ::: -## Creating data enrichment pipeline -You can either follow the example in the linked Colab notebook or follow this documentation to -create the user-agent device data enrichment pipeline. +## Creating a data enrichment pipeline +You can either follow the example in the linked Colab notebook or follow this documentation to create the user-agent device data enrichment pipeline. ### A. Colab notebook -The Colab notebook combines three data enrichment processes for a sample dataset, starting with "Data -enrichment part one: User-agent device data". +The Colab notebook combines three data enrichment processes for a sample dataset, starting with "Data enrichment part one: User-agent device data". Here's the link to the notebook: **[Colab Notebook](https://colab.research.google.com/drive/1ZKEkf1LRSld7CWQFS36fUXjhJKPAon7P?usp=sharing).** @@ -47,19 +39,15 @@ user_device_enrichment/ │ └── secrets.toml └── device_enrichment_pipeline.py ``` -### 1. Creating resource +### 1. Creating a resource - `dlt` works on the principle of [sources](https://dlthub.com/docs/general-usage/source) - and [resources.](https://dlthub.com/docs/general-usage/resource) + `dlt` works on the principle of [sources](https://dlthub.com/docs/general-usage/source) and [resources.](https://dlthub.com/docs/general-usage/resource) - This data resource yields data typical of what many web analytics and - tracking tools can collect. However, the specifics of what data is collected - and how it's used can vary significantly among different tracking services. + This data resource yields data typical of what many web analytics and tracking tools can collect. However, the specifics of what data is collected and how it's used can vary significantly among different tracking services. Let's examine a synthetic dataset created for this article. It includes: - `user_id`: Web trackers typically assign unique ID to users for - tracking their journeys and interactions over time. + `user_id`: Web trackers typically assign a unique ID to users for tracking their journeys and interactions over time. `device_name`: User device information helps in understanding the user base's device. @@ -107,16 +95,11 @@ user_device_enrichment/ ### 2. Create `fetch_average_price` function -This particular function retrieves the average price of a device by utilizing SerpAPI and Google -shopping listings. To filter the data, the function uses `dlt` state, and only fetches prices -from SerpAPI for devices that have not been updated in the most recent run or for those that were -loaded more than 180 days in the past. +This particular function retrieves the average price of a device by utilizing SerpAPI and Google shopping listings. To filter the data, the function uses `dlt` state, and only fetches prices from SerpAPI for devices that have not been updated in the most recent run or for those that were loaded more than 180 days in the past. The first step is to register on [SerpAPI](https://serpapi.com/) and obtain the API token key. -1. In the `.dlt`folder, there's a file called `secrets.toml`. It's where you store sensitive - information securely, like access tokens. Keep this file safe. Here's its format for service - account authentication: +1. In the `.dlt` folder, there's a file called `secrets.toml`. It's where you store sensitive information securely, like access tokens. Keep this file safe. Here's its format for service account authentication: ```python [sources] @@ -230,19 +213,11 @@ The first step is to register on [SerpAPI](https://serpapi.com/) and obtain the - Transformer function - The `dlt` library's `transformer` and `add_map` functions serve distinct purposes in data - processing. + The `dlt` library's `transformer` and `add_map` functions serve distinct purposes in data processing. - `Transformers` used to process a resource and are ideal for post-load data transformations in a - pipeline, compatible with tools like `dbt`, the `dlt SQL client`, or Pandas for intricate data - manipulation. To read more: - [Click here.](../../general-usage/resource#process-resources-with-dlttransformer) + [Transformers](../../general-usage/resource#process-resources-with-dlttransformer) are used to process a resource and are ideal for post-load data transformations in a pipeline, compatible with tools like `dbt`, the `dlt SQL client`, or Pandas for intricate data manipulation. - Conversely, `add_map` used to customize a resource applies transformations at an item level - within a resource. It's useful for tasks like anonymizing individual data records. More on this - can be found under - [Customize resources](../../general-usage/resource#customize-resources) in the - documentation. + Conversely, `add_map` is used to customize a resource and applies transformations at an item level within a resource. It's useful for tasks like anonymizing individual data records. More on this can be found under [customize resources](../../general-usage/resource#customize-resources) in the documentation. 1. Here, we create the pipeline and use the `add_map` functionality: @@ -262,9 +237,7 @@ The first step is to register on [SerpAPI](https://serpapi.com/) and obtain the ``` :::info - Please note that the same outcome can be achieved by using the transformer function. To - do so, you need to add the transformer decorator at the top of the `fetch_average_price` function. - For `pipeline.run`, you can use the following code: + Please note that the same outcome can be achieved by using the transformer function. To do so, you need to add the transformer decorator at the top of the `fetch_average_price` function. For `pipeline.run`, you can use the following code: ```python # using fetch_average_price as a transformer function @@ -274,14 +247,12 @@ The first step is to register on [SerpAPI](https://serpapi.com/) and obtain the ) ``` - This will execute the `fetch_average_price` function with the tracked data and return the average - price. + This will execute the `fetch_average_price` function with the tracked data and return the average price. ::: ### Run the pipeline -1. Install necessary dependencies for the preferred - [destination](https://dlthub.com/docs/dlt-ecosystem/destinations/), For example, duckdb: +1. Install necessary dependencies for the preferred [destination](https://dlthub.com/docs/dlt-ecosystem/destinations/), For example, DuckDB: ``` pip install dlt[duckdb] @@ -299,7 +270,5 @@ The first step is to register on [SerpAPI](https://serpapi.com/) and obtain the dlt pipeline show ``` - For example, the "pipeline_name" for the above pipeline example is `data_enrichment_one`; you can use - any custom name instead. - + For example, the "pipeline_name" for the above pipeline example is `data_enrichment_one`; you can use any custom name instead. diff --git a/docs/website/docs/general-usage/destination-tables.md b/docs/website/docs/general-usage/destination-tables.md index 8e1f771e47..38f4791b06 100644 --- a/docs/website/docs/general-usage/destination-tables.md +++ b/docs/website/docs/general-usage/destination-tables.md @@ -118,7 +118,7 @@ load_info = pipeline.run(data, table_name="users") ``` Running this pipeline will create two tables in the destination, `users` and `users__pets`. The -`users` table will contain the top level data, and the `users__pets` table will contain the child +`users` table will contain the top-level data, and the `users__pets` table will contain the child data. Here is what the tables may look like: **mydata.users** @@ -141,21 +141,20 @@ creating and linking children and parent tables. This is how it works: -1. Each row in all (top level and child) data tables created by `dlt` contains UNIQUE column named +1. Each row in all (top-level and child) data tables created by `dlt` contains a UNIQUE column named `_dlt_id`. -1. Each child table contains FOREIGN KEY column `_dlt_parent_id` linking to a particular row +2. Each child table contains a FOREIGN KEY column `_dlt_parent_id` linking to a particular row (`_dlt_id`) of a parent table. -1. Rows in child tables come from the lists: `dlt` stores the position of each item in the list in +3. Rows in child tables come from the lists: `dlt` stores the position of each item in the list in `_dlt_list_idx`. -1. For tables that are loaded with the `merge` write disposition, we add a ROOT KEY column - `_dlt_root_id`, which links child table to a row in top level table. - +4. For tables that are loaded with the `merge` write disposition, we add a ROOT KEY column + `_dlt_root_id`, which links the child table to a row in the top-level table. :::note -If you define your own primary key in a child table, it will be used to link to parent table +If you define your own primary key in a child table, it will be used to link to the parent table, and the `_dlt_parent_id` and `_dlt_list_idx` will not be added. `_dlt_id` is always added even in -case the primary key or other unique columns are defined. +cases where the primary key or other unique columns are defined. ::: @@ -164,8 +163,8 @@ case the primary key or other unique columns are defined. During a pipeline run, dlt [normalizes both table and column names](schema.md#naming-convention) to ensure compatibility with the destination database's accepted format. All names from your source data will be transformed into snake_case and will only include alphanumeric characters. Please be aware that the names in the destination database may differ somewhat from those in your original input. ### Variant columns -If your data has inconsistent types, `dlt` will dispatch the data to several **variant columns**. For example, if you have a resource (ie json file) with a filed with name **answer** and your data contains boolean values, you will get get a column with name **answer** of type **BOOLEAN** in your destination. If for some reason, on next load you get integer value and string value in **answer**, the inconsistent data will go to **answer__v_bigint** and **answer__v_text** columns respectively. -The general naming rule for variant columns is `__v_` where `original_name` is the existing column name (with data type clash) and `type` is the name of data type stored in the variant. +If your data has inconsistent types, `dlt` will dispatch the data to several **variant columns**. For example, if you have a resource (i.e., a JSON file) with a field named `answer` and your data contains boolean values, you will get a column with the name `answer` of type `BOOLEAN` in your destination. If for some reason, on the next load you get integer value and string value in `answer`, the inconsistent data will go to `answer__v_bigint` and `answer__v_text` columns respectively. +The general naming rule for variant columns is `__v_` where `original_name` is the existing column name (with data type clash) and `type` is the name of the data type stored in the variant. ## Load Packages and Load IDs @@ -215,7 +214,7 @@ In that case, the user may see the partially loaded data. It is possible to filt row with a `load_id` that does not exist in `_dlt_loads` is not yet completed. The same procedure may be used to identify and delete data for packages that never got completed. -For each load, you can test and [alert](../running-in-production/alerting.md) on anomalies (e.g. +For each load, you can test and [alert](../running-in-production/alerting.md) on anomalies (e.g., no data, too much loaded to a table). There are also some useful load stats in the `Load info` tab of the [Streamlit app](../dlt-ecosystem/visualizations/exploring-the-data.md#exploring-the-data) mentioned above. @@ -227,7 +226,7 @@ of 1 and is then updated to 2. This can be repeated for every additional transfo ### Data lineage -Data lineage can be super relevant for architectures like the +Data lineage can be extremely relevant for architectures like the [data vault architecture](https://www.data-vault.co.uk/what-is-data-vault/) or when troubleshooting. The data vault architecture is a data warehouse that large organizations use when representing the same process across multiple systems, which adds data lineage requirements. Using the pipeline name @@ -270,7 +269,7 @@ load_info = pipeline.run(users) ``` Running this pipeline will create a schema in the destination database with the name `mydata_staging`. -If you inspect the tables in this schema, you will find `mydata_staging.users` table identical to the +If you inspect the tables in this schema, you will find the `mydata_staging.users` table identical to the `mydata.users` table in the previous example. Here is what the tables may look like after running the pipeline: @@ -295,7 +294,7 @@ the current one. ## Versioned datasets -When you set the `full_refresh` argument to `True` in `dlt.pipeline` call, dlt creates a versioned dataset. +When you set the `full_refresh` argument to `True` in the `dlt.pipeline` call, dlt creates a versioned dataset. This means that each time you run the pipeline, the data is loaded into a new dataset (a new database schema). The dataset name is the same as the `dataset_name` you provided in the pipeline definition with a datetime-based suffix. @@ -323,3 +322,4 @@ Every time you run this pipeline, a new schema will be created in the destinatio datetime-based suffix. The data will be loaded into tables in this schema. For example, the first time you run the pipeline, the schema will be named `mydata_20230912064403`, the second time it will be named `mydata_20230912064407`, and so on. + diff --git a/docs/website/docs/general-usage/destination.md b/docs/website/docs/general-usage/destination.md index c20aa62d16..8383a2a297 100644 --- a/docs/website/docs/general-usage/destination.md +++ b/docs/website/docs/general-usage/destination.md @@ -6,12 +6,12 @@ keywords: [destination, load data, configure destination, name destination] # Destination -[Destination](glossary.md#destination) is a location in which `dlt` creates and maintains the current version of the schema and loads your data. Destinations come in various forms: databases, datalakes, vector stores or files. `dlt` deals with this variety via modules which you declare when creating a pipeline. +A [Destination](glossary.md#destination) is a location in which `dlt` creates and maintains the current version of the schema and loads your data. Destinations come in various forms: databases, datalakes, vector stores, or files. `dlt` deals with this variety via modules which you declare when creating a pipeline. We maintain a set of [built-in destinations](../dlt-ecosystem/destinations/) that you can use right away. ## Declare the destination type -We recommend that you declare the destination type when creating a pipeline instance with `dlt.pipeline`. This allows the `run` method to synchronize your local pipeline state with destination and `extract` and `normalize` to create compatible load packages and schemas. You can also pass the destination to `run` and `load` methods. +We recommend that you declare the destination type when creating a pipeline instance with `dlt.pipeline`. This allows the `run` method to synchronize your local pipeline state with the destination and `extract` and `normalize` to create compatible load packages and schemas. You can also pass the destination to `run` and `load` methods. * Use destination **shorthand type** @@ -21,7 +21,7 @@ import dlt pipeline = dlt.pipeline("pipeline", destination="filesystem") ``` -Above we want to use **filesystem** built-in destination. You can use shorthand types only for built-ins. +Above, we want to use the **filesystem** built-in destination. You can use shorthand types only for built-ins. * Use full **destination class type** @@ -31,7 +31,7 @@ import dlt pipeline = dlt.pipeline("pipeline", destination="dlt.destinations.filesystem") ``` -Above we use built in **filesystem** destination by providing a class type `filesystem` from module `dlt.destinations`. You can pass [destinations from external modules](#declare-external-destination) as well. +Above, we use the built-in **filesystem** destination by providing a class type `filesystem` from the module `dlt.destinations`. You can pass [destinations from external modules](#declare-external-destination) as well. * Import **destination class** @@ -42,13 +42,13 @@ from dlt.destinations import filesystem pipeline = dlt.pipeline("pipeline", destination=filesystem) ``` -Above we import destination class for **filesystem** and pass it to the pipeline. +Above, we import the destination class for **filesystem** and pass it to the pipeline. All examples above will create the same destination class with default parameters and pull required config and secret values from [configuration](credentials/configuration.md) - they are equivalent. ### Pass explicit parameters and a name to a destination -You can instantiate **destination class** yourself to configure it explicitly. When doing this you work with destinations the same way you work with [sources](source.md) +You can instantiate a **destination class** yourself to configure it explicitly. When doing this, you work with destinations the same way you work with [sources](source.md) ```py import dlt @@ -57,14 +57,14 @@ azure_bucket = filesystem("az://dlt-azure-bucket", destination_name="production_ pipeline = dlt.pipeline("pipeline", destination=azure_bucket) ``` -Above we import and instantiate the `filesystem` destination class. We pass explicit url of the bucket and name the destination to `production_az_bucket`. +Above, we import and instantiate the `filesystem` destination class. We pass the explicit URL of the bucket and name the destination `production_az_bucket`. -If destination is not named, its shorthand type (the Python class name) serves as a destination name. Name your destination explicitly if you need several separate configurations of destinations of the same type (i.e. you wish to maintain credentials for development, staging and production storage buckets in the same config file). Destination name is also stored in the [load info](../running-in-production/running.md#inspect-and-save-the-load-info-and-trace) and pipeline traces so use them also when you need more descriptive names (other than, for example, `filesystem`). +If the destination is not named, its shorthand type (the Python class name) serves as a destination name. Name your destination explicitly if you need several separate configurations of destinations of the same type (i.e., you wish to maintain credentials for development, staging, and production storage buckets in the same config file). The destination name is also stored in the [load info](../running-in-production/running.md#inspect-and-save-the-load-info-and-trace) and pipeline traces, so use them also when you need more descriptive names (other than, for example, `filesystem`). ## Configure a destination -We recommend to pass the credentials and other required parameters to configuration via TOML files, environment variables or other [config providers](credentials/config_providers.md). This allows you, for example, to easily switch to production destinations after deployment. +We recommend passing the credentials and other required parameters to the configuration via TOML files, environment variables, or other [config providers](credentials/config_providers.md). This allows you, for example, to easily switch to production destinations after deployment. -We recommend to use the [default config section layout](credentials/configuration.md#default-layout-and-default-key-lookup-during-injection) as below: +We recommend using the [default config section layout](credentials/configuration.md#default-layout-and-default-key-lookup-during-injection) as below: ```toml [destination.filesystem] @@ -81,7 +81,7 @@ DESTINATION__FILESYSTEM__CREDENTIALS__AZURE_STORAGE_ACCOUNT_NAME=dltdata DESTINATION__FILESYSTEM__CREDENTIALS__AZURE_STORAGE_ACCOUNT_KEY="storage key" ``` -For named destinations you use their names in the config section +For named destinations, you use their names in the config section ```toml [destination.production_az_bucket] @@ -92,10 +92,10 @@ azure_storage_account_key="storage key" ``` -Note that when you use [`dlt init` command](../walkthroughs/add-a-verified-source.md) to create or add a data source, `dlt` creates a sample configuration for selected destination. +Note that when you use the [`dlt init` command](../walkthroughs/add-a-verified-source.md) to create or add a data source, `dlt` creates a sample configuration for the selected destination. ### Pass explicit credentials -You can pass credentials explicitly when creating destination class instance. This replaces the `credentials` argument in `dlt.pipeline` and `pipeline.load` methods - which is now deprecated. You can pass the required credentials object, its dictionary representation or the supported native form like below: +You can pass credentials explicitly when creating a destination class instance. This replaces the `credentials` argument in `dlt.pipeline` and `pipeline.load` methods - which is now deprecated. You can pass the required credentials object, its dictionary representation, or the supported native form like below: ```py import dlt @@ -110,7 +110,7 @@ pipeline = dlt.pipeline( :::tip -You can create and pass partial credentials and `dlt` will fill the missing data. Below we pass postgres connection string but without password and expect that it will be present in environment variables (or any other [config provider](credentials/config_providers.md)) +You can create and pass partial credentials and `dlt` will fill the missing data. Below, we pass a PostgreSQL connection string but without a password and expect that it will be present in environment variables (or any other [config provider](credentials/config_providers.md)) ```py import dlt @@ -138,19 +138,19 @@ pipeline = dlt.pipeline( ``` -Please read how to use [various built in credentials types](credentials/config_specs.md). +Please read how to use [various built-in credentials types](credentials/config_specs.md). ::: ## Access a destination When loading data, `dlt` will access the destination in two cases: 1. At the beginning of the `run` method to sync the pipeline state with the destination (or if you call `pipeline.sync_destination` explicitly). -2. In the `pipeline.load` method - to migrate schema and load the load package. +2. In the `pipeline.load` method - to migrate the schema and load the load package. -Obviously, dlt will access the destination when you instantiate [sql_client](../dlt-ecosystem/transformations/sql.md). +Obviously, `dlt` will access the destination when you instantiate an [sql_client](../dlt-ecosystem/transformations/sql.md). :::note -`dlt` will not import the destination dependencies or access destination configuration if access is not needed. You can build multi-stage pipelines where steps are executed in separate processes or containers - the `extract` and `normalize` step do not need destination dependencies, configuration and actual connection. +`dlt` will not import the destination dependencies or access destination configuration if access is not needed. You can build multi-stage pipelines where steps are executed in separate processes or containers - the `extract` and `normalize` step do not need destination dependencies, configuration, and actual connection. ```py @@ -159,11 +159,11 @@ from dlt.destinations import filesystem # just declare the destination. pipeline = dlt.pipeline("pipeline", destination="filesystem") -# no destination credentials not config needed to extract +# no destination credentials or config needed to extract pipeline.extract(["a", "b", "c"], table_name="letters") # same to normalize pipeline.normalize() -# here dependencies dependencies will be imported, secrets pulled and destination accessed +# here dependencies will be imported, secrets pulled and destination accessed # we pass bucket_url explicitly and expect credentials passed by config provider load_info = pipeline.load(destination=filesystem(bucket_url=bucket_url)) load_info.raise_on_failed_jobs() @@ -172,4 +172,5 @@ load_info.raise_on_failed_jobs() ::: ## Declare external destination -You can implement [your own destination](../walkthroughs/create-new-destination.md) and pass the destination class type or instance to `dlt` pipeline. \ No newline at end of file +You can implement [your own destination](../walkthroughs/create-new-destination.md) and pass the destination class type or instance to the `dlt` pipeline. + diff --git a/docs/website/docs/general-usage/full-loading.md b/docs/website/docs/general-usage/full-loading.md index 4651d156f0..89d541adb9 100644 --- a/docs/website/docs/general-usage/full-loading.md +++ b/docs/website/docs/general-usage/full-loading.md @@ -5,7 +5,7 @@ keywords: [full loading, loading methods, replace] --- # Full loading -Full loading is the act of fully reloading the data of your tables. All existing data +Full loading is the act of fully reloading the data in your tables. All existing data will be removed and replaced by whatever the source produced on this run. Resources that are not selected while performing a full load will not replace any data in the destination. @@ -27,7 +27,7 @@ p.run(issues, write_disposition="replace", primary_key="id", table_name="issues" ## Choosing the correct replace strategy for your full load -`dlt` implements three different strategies for doing a full load on your table: `truncate-and-insert`, `insert-from-staging` and `staging-optimized`. The exact behaviour of these strategies can also vary between the available destinations. +`dlt` implements three different strategies for doing a full load on your table: `truncate-and-insert`, `insert-from-staging`, and `staging-optimized`. The exact behavior of these strategies can also vary between the available destinations. You can select a strategy with a setting in your `config.toml` file. If you do not select a strategy, dlt will default to `truncate-and-insert`. @@ -41,8 +41,8 @@ replace_strategy = "staging-optimized" The `truncate-and-insert` replace strategy is the default and the fastest of all three strategies. If you load data with this setting, then the destination tables will be truncated at the beginning of the load and the new data will be inserted consecutively but not within the same transaction. -The downside of this strategy is, that your tables will have no data for a while until the load is completed. You -may end up with new data in some tables and no data in other tables if the load fails during the run. Such incomplete load may be however detected by checking if the +The downside of this strategy is that your tables will have no data for a while until the load is completed. You +may end up with new data in some tables and no data in other tables if the load fails during the run. Such an incomplete load may, however, be detected by checking if the [_dlt_loads table contains load id](destination-tables.md#load-packages-and-load-ids) from _dlt_load_id of the replaced tables. If you prefer to have no data downtime, please use one of the other strategies. ### The `insert-from-staging` strategy @@ -68,3 +68,4 @@ opportunities, you should use this strategy. The `staging-optimized` strategy be more about [table cloning on snowflake](https://docs.snowflake.com/en/user-guide/object-clone). For all other [destinations](../dlt-ecosystem/destinations/index.md), please look at their respective documentation pages to see if and how the `staging-optimized` strategy is implemented. If it is not implemented, `dlt` will fall back to the `insert-from-staging` strategy. + diff --git a/docs/website/docs/general-usage/glossary.md b/docs/website/docs/general-usage/glossary.md index fd88bc1e5f..b8de4c3bbe 100644 --- a/docs/website/docs/general-usage/glossary.md +++ b/docs/website/docs/general-usage/glossary.md @@ -8,13 +8,13 @@ keywords: [glossary, resource, source, pipeline] ## [Source](source) -Location that holds data with certain structure. Organized into one or more resources. +A location that holds data with a certain structure, organized into one or more resources. - If endpoints in an API are the resources, then the API is the source. -- If tabs in a spreadsheet are the resources, then the source is the spreadsheet. -- If tables in a database are the resources, then the source is the database. +- If tabs in a spreadsheet are the resources, then the spreadsheet is the source. +- If tables in a database are the resources, then the database is the source. -Within this documentation, **source** refers also to the software component (i.e. Python function) +Within this documentation, **source** also refers to the software component (i.e., Python function) that **extracts** data from the source location using one or more resource components. ## [Resource](resource) @@ -26,38 +26,39 @@ origin. - If the source is a spreadsheet, then a resource is a tab in that spreadsheet. - If the source is a database, then a resource is a table in that database. -Within this documentation, **resource** refers also to the software component (i.e. Python function) -that **extracts** the data from source location. +Within this documentation, **resource** also refers to the software component (i.e., Python function) +that **extracts** the data from the source location. ## [Destination](../dlt-ecosystem/destinations) -The data store where data from the source is loaded (e.g. Google BigQuery). +The data store where data from the source is loaded (e.g., Google BigQuery). ## [Pipeline](pipeline) -Moves the data from the source to the destination, according to instructions provided in the schema -(i.e. extracting, normalizing, and loading the data). +Pipeline moves the data from the source to the destination, according to instructions provided in the schema +(i.e., extracting, normalizing, and loading the data). ## [Verified Source](../walkthroughs/add-a-verified-source) -A Python module distributed with `dlt init` that allows creating pipelines that extract data from a -particular **Source**. Such module is intended to be published in order for others to use it to +A Python module distributed with `dlt init` that allows creating pipelines to extract data from a +particular **Source**. This module is intended to be published so others can use it to build pipelines. -A source must be published to become "verified": which means that it has tests, test data, -demonstration scripts, documentation and the dataset produces was reviewed by a data engineer. +A source must be published to become "verified", which means that it has tests, test data, +demonstration scripts, documentation, and the dataset it produces has been reviewed by a data engineer. ## [Schema](schema) -Describes the structure of normalized data (e.g. unpacked tables, column types, etc.) and provides -instructions on how the data should be processed and loaded (i.e. it tells `dlt` about the content +This describes the structure of normalized data (e.g., unpacked tables, column types, etc.) and provides +instructions on how the data should be processed and loaded (i.e., it tells `dlt` about the content of the data and how to load it into the destination). ## [Config](credentials/configuration) -A set of values that are passed to the pipeline at run time (e.g. to change its behavior locally vs. +A set of values that are passed to the pipeline at run time (e.g., to change its behavior locally vs. in production). ## [Credentials](credentials/config_specs) -A subset of configuration whose elements are kept secret and never shared in plain text. +A subset of the configuration whose elements are kept secret and never shared in plain text. + diff --git a/docs/website/docs/general-usage/pipeline.md b/docs/website/docs/general-usage/pipeline.md index 095e03e96d..fcc912928e 100644 --- a/docs/website/docs/general-usage/pipeline.md +++ b/docs/website/docs/general-usage/pipeline.md @@ -6,14 +6,14 @@ keywords: [pipeline, source, full refresh] # Pipeline -A [pipeline](glossary.md#pipeline) is a connection that moves the data from your Python code to a +A [pipeline](glossary.md#pipeline) is a connection that moves data from your Python code to a [destination](glossary.md#destination). The pipeline accepts `dlt` [sources](source.md) or -[resources](resource.md) as well as generators, async generators, lists and any iterables. -Once the pipeline runs, all resources get evaluated and the data is loaded at destination. +[resources](resource.md), as well as generators, async generators, lists, and any iterables. +Once the pipeline runs, all resources are evaluated and the data is loaded at the destination. Example: -This pipeline will load a list of objects into `duckdb` table with a name "three": +This pipeline will load a list of objects into a `duckdb` table named "three": ```python import dlt @@ -25,33 +25,33 @@ info = pipeline.run([{'id':1}, {'id':2}, {'id':3}], table_name="three") print(info) ``` -You instantiate a pipeline by calling `dlt.pipeline` function with following arguments: +You instantiate a pipeline by calling the `dlt.pipeline` function with the following arguments: -- `pipeline_name` a name of the pipeline that will be used to identify it in trace and monitoring +- `pipeline_name` is the name of the pipeline that will be used to identify it in trace and monitoring events and to restore its state and data schemas on subsequent runs. If not provided, `dlt` will - create pipeline name from the file name of currently executing Python module. -- `destination` a name of the [destination](../dlt-ecosystem/destinations) to which dlt - will load the data. May also be provided to `run` method of the `pipeline`. -- `dataset_name` a name of the dataset to which the data will be loaded. A dataset is a logical - group of tables i.e. `schema` in relational databases or folder grouping many files. May also be - provided later to the `run` or `load` methods of the pipeline. If not provided at all then + create a pipeline name from the file name of the currently executing Python module. +- `destination` is the name of the [destination](../dlt-ecosystem/destinations) to which dlt + will load the data. It may also be provided to the `run` method of the `pipeline`. +- `dataset_name` is the name of the dataset to which the data will be loaded. A dataset is a logical + group of tables, i.e., schema in relational databases or a folder grouping many files. It may also be + provided later to the `run` or `load` methods of the pipeline. If not provided at all, it defaults to the `pipeline_name`. -To load the data you call the `run` method and pass your data in `data` argument. +To load the data, you call the `run` method and pass your data in the `data` argument. Arguments: - `data` (the first argument) may be a dlt source, resource, generator function, or any Iterator / - Iterable (i.e. a list or the result of `map` function). -- `write_disposition` controls how to write data to a table. Defaults to "append". + Iterable (i.e., a list or the result of a `map` function). +- `write_disposition` controls how to write data to a table. It defaults to "append". - `append` will always add new data at the end of the table. - `replace` will replace existing data with new data. - `skip` will prevent data from loading. - `merge` will deduplicate and merge data based on `primary_key` and `merge_key` hints. -- `table_name` - specified in case when table name cannot be inferred i.e. from the resources or name +- `table_name` is specified in cases when the table name cannot be inferred, i.e., from the resources or the name of the generator function. -Example: This pipeline will load the data the generator `generate_rows(10)` produces: +Example: This pipeline will load the data that the generator `generate_rows(10)` produces: ```python import dlt @@ -70,44 +70,44 @@ print(info) ## Pipeline working directory Each pipeline that you create with `dlt` stores extracted files, load packages, inferred schemas, -execution traces and the [pipeline state](state.md) in a folder in the local filesystem. The default -location for such folders is in user home directory: `~/.dlt/pipelines/`. +execution traces, and the [pipeline state](state.md) in a folder in the local filesystem. The default +location for such folders is in the user's home directory: `~/.dlt/pipelines/`. You can inspect stored artifacts using the command [dlt pipeline info](../reference/command-line-interface.md#dlt-pipeline) and [programmatically](../walkthroughs/run-a-pipeline.md#4-inspect-a-load-process). -> 💡 A pipeline with given name looks for its working directory in location above - so if you have two +> 💡 A pipeline with a given name looks for its working directory in the location above - so if you have two > pipeline scripts that create a pipeline with the same name, they will see the same working folder -> and share all the possible state. You may override the default location using `pipelines_dir` +> and share all the possible state. You may override the default location using the `pipelines_dir` > argument when creating the pipeline. -> 💡 You can attach `Pipeline` instance to an existing working folder, without creating a new -> pipeline with `dlt.attach`. +> 💡 You can attach a `Pipeline` instance to an existing working folder, without creating a new +> pipeline, with `dlt.attach`. ## Do experiments with full refresh -If you [create a new pipeline script](../walkthroughs/create-a-pipeline.md) you will be -experimenting a lot. If you want that each time the pipeline resets its state and loads data to a +If you [create a new pipeline script](../walkthroughs/create-a-pipeline.md), you will be +experimenting a lot. If you want each time the pipeline to reset its state and load data to a new dataset, set the `full_refresh` argument of the `dlt.pipeline` method to True. Each time the -pipeline is created, `dlt` adds datetime-based suffix to the dataset name. +pipeline is created, `dlt` adds a datetime-based suffix to the dataset name. ## Display the loading progress -You can add a progress monitor to the pipeline. Typically, its role is to visually assure user that -pipeline run is progressing. `dlt` supports 4 progress monitors out of the box: +You can add a progress monitor to the pipeline. Typically, its role is to visually assure the user that +the pipeline run is progressing. `dlt` supports four progress monitors out of the box: - [enlighten](https://github.com/Rockhopper-Technologies/enlighten) - a status bar with progress bars that also allows for logging. -- [tqdm](https://github.com/tqdm/tqdm) - most popular Python progress bar lib, proven to work in +- [tqdm](https://github.com/tqdm/tqdm) - the most popular Python progress bar lib, proven to work in Notebooks. - [alive_progress](https://github.com/rsalmei/alive-progress) - with the most fancy animations. -- **log** - dumps the progress information to log, console or text stream. **the most useful on - production** optionally adds memory and cpu usage stats. +- **log** - dumps the progress information to the log, console, or text stream. The most useful in + production. Optionally adds memory and CPU usage stats. > 💡 You must install the required progress bar library yourself. -You pass the progress monitor in `progress` argument of the pipeline. You can use a name from the +You pass the progress monitor in the `progress` argument of the pipeline. You can use a name from the list above as in the following example: ```python @@ -146,3 +146,4 @@ pipeline = dlt.pipeline( Note that the value of the `progress` argument is [configurable](../walkthroughs/run-a-pipeline.md#2-see-the-progress-during-loading). + diff --git a/docs/website/docs/general-usage/resource.md b/docs/website/docs/general-usage/resource.md index 9b8d45982d..1542741f3d 100644 --- a/docs/website/docs/general-usage/resource.md +++ b/docs/website/docs/general-usage/resource.md @@ -13,9 +13,9 @@ resource, we add the `@dlt.resource` decorator to that function. Commonly used arguments: -- `name` The name of the table generated by this resource. Defaults to decorated function name. -- `write_disposition` How should the data be loaded at destination? Currently, supported: `append`, - `replace` and `merge`. Defaults to `append.` +- `name` The name of the table generated by this resource. Defaults to the decorated function name. +- `write_disposition` Determines how the data should be loaded at the destination. Currently supported: `append`, + `replace`, and `merge`. Defaults to `append.` Example: @@ -46,16 +46,16 @@ function. ### Define schema `dlt` will infer [schema](schema.md) for tables associated with resources from the resource's data. -You can modify the generation process by using the table and column hints. Resource decorator -accepts following arguments: +You can modify the generation process by using the table and column hints. The resource decorator +accepts the following arguments: -1. `table_name` the name of the table, if different from resource name. -1. `primary_key` and `merge_key` define name of the columns (compound keys are allowed) that will +1. `table_name` the name of the table, if different from the resource name. +1. `primary_key` and `merge_key` define the name of the columns (compound keys are allowed) that will receive those hints. Used in [incremental loading](incremental-loading.md). -1. `columns` let's you define one or more columns, including the data types, nullability and other - hints. The column definition is a `TypedDict`: `TTableSchemaColumns`. In example below, we tell - `dlt` that column `tags` (containing a list of tags) in `user` table should have type `complex` - which means that it will be loaded as JSON/struct and not as child table. +1. `columns` lets you define one or more columns, including the data types, nullability, and other + hints. The column definition is a `TypedDict`: `TTableSchemaColumns`. In the example below, we tell + `dlt` that column `tags` (containing a list of tags) in the `user` table should have type `complex` + which means that it will be loaded as JSON/struct and not as a child table. ```python @dlt.resource(name="user", columns={"tags": {"data_type": "complex"}}) @@ -67,14 +67,14 @@ accepts following arguments: ``` > 💡 You can pass dynamic hints which are functions that take the data item as input and return a -> hint value. This let's you create table and column schemas depending on the data. See example in -> next section. +> hint value. This lets you create table and column schemas depending on the data. See the example in +> the next section. > 💡 You can mark some resource arguments as [configuration and credentials](credentials) > values so `dlt` can pass them automatically to your functions. -### Put a contract on a tables, columns and data -Use the `schema_contract` argument to tell dlt how to [deal with new tables, data types and bad data types](schema-contracts.md). For example if you set it to **freeze**, `dlt` will not allow for any new tables, columns or data types to be introduced to the schema - it will raise an exception. Learn more in on available contract modes [here](schema-contracts.md#setting-up-the-contract) +### Put a contract on tables, columns, and data +Use the `schema_contract` argument to tell dlt how to [deal with new tables, data types, and bad data types](schema-contracts.md). For example, if you set it to **freeze**, `dlt` will not allow for any new tables, columns, or data types to be introduced to the schema - it will raise an exception. Learn more about available contract modes [here](schema-contracts.md#setting-up-the-contract) ### Define a schema with Pydantic @@ -106,16 +106,16 @@ def get_users(): ... ``` -The data types of the table columns are inferred from the types of the pydantic fields. These use the same type conversions +The data types of the table columns are inferred from the types of the Pydantic fields. These use the same type conversions as when the schema is automatically generated from the data. Pydantic models integrate well with [schema contracts](schema-contracts.md) as data validators. Things to note: -- Fields with an `Optional` type are marked as `nullable` -- Fields with a `Union` type are converted to the first (not `None`) type listed in the union. E.g. `status: Union[int, str]` results in a `bigint` column. -- `list`, `dict` and nested pydantic model fields will use the `complex` type which means they'll be stored as a JSON object in the database instead of creating child tables. +- Fields with an `Optional` type are marked as `nullable`. +- Fields with a `Union` type are converted to the first (not `None`) type listed in the union. E.g., `status: Union[int, str]` results in a `bigint` column. +- `list`, `dict`, and nested Pydantic model fields will use the `complex` type which means they'll be stored as a JSON object in the database instead of creating child tables. You can override this by configuring the Pydantic model @@ -132,14 +132,14 @@ def get_users(): ``` `"skip_complex_types"` omits any `dict`/`list`/`BaseModel` type fields from the schema, so dlt will fall back on the default -behaviour of creating child tables for these fields. +behavior of creating child tables for these fields. -We do not support `RootModel` that validate simple types. You can add such validator yourself, see [data filtering section](#filter-transform-and-pivot-data). +We do not support `RootModel` that validate simple types. You can add such a validator yourself, see the [data filtering section](#filter-transform-and-pivot-data). ### Dispatch data to many tables You can load data to many tables from a single resource. The most common case is a stream of events -of different types, each with different data schema. To deal with this, you can use `table_name` +of different types, each with different data schema. To deal with this, you can use the `table_name` argument on `dlt.resource`. You could pass the table name as a function with the data item as an argument and the `table_name` string as a return value. @@ -169,7 +169,7 @@ def repo_events() -> Iterator[TDataItems]: ### Parametrize a resource -You can add arguments to your resource functions like to any other. Below we parametrize our +You can add arguments to your resource functions like any other. Below we parametrize our `generate_rows` resource to generate the number of rows we request: ```python @@ -191,7 +191,7 @@ so `dlt` can pass them automatically to your functions. ### Process resources with `dlt.transformer` You can feed data from a resource into another one. The most common case is when you have an API -that returns a list of objects (i.e. users) in one endpoint and user details in another. You can deal +that returns a list of objects (i.e., users) in one endpoint and user details in another. You can deal with this by declaring a resource that obtains a list of users and another resource that receives items from the list and downloads the profiles. @@ -213,8 +213,8 @@ def users_details(user_item): # dlt figures out dependencies for you. pipeline.run(user_details) ``` -In the example above, `user_details` will receive data from default instance of `users` resource (with `limit` set to `None`). You can also use -**pipe |** operator to bind resources dynamically +In the example above, `user_details` will receive data from the default instance of the `users` resource (with `limit` set to `None`). You can also use +the **pipe |** operator to bind resources dynamically ```python # you can be more explicit and use a pipe operator. # with it you can create dynamic pipelines where the dependencies @@ -238,13 +238,10 @@ async def pokemon(id): # get bulbasaur and ivysaur (you need dlt 0.4.6 for pipe operator working with lists) print(list([1,2] | pokemon())) -``` -::: - ### Declare a standalone resource -A standalone resource is defined on a function that is top level in a module (not inner function) that accepts config and secrets values. Additionally -if `standalone` flag is specified, the decorated function signature and docstring will be preserved. `dlt.resource` will just wrap the -decorated function and user must call the wrapper to get the actual resource. Below we declare a `filesystem` resource that must be called before use. +A standalone resource is defined on a function that is top level in a module (not an inner function) that accepts config and secrets values. Additionally, +if the `standalone` flag is specified, the decorated function signature and docstring will be preserved. `dlt.resource` will just wrap the +decorated function and the user must call the wrapper to get the actual resource. Below we declare a `filesystem` resource that must be called before use. ```python @dlt.resource(standalone=True) def filesystem(bucket_url=dlt.config.value): @@ -255,7 +252,7 @@ def filesystem(bucket_url=dlt.config.value): pipeline.run(filesystem("s3://my-bucket/reports"), table_name="reports") ``` -Standalone may have dynamic name that depends on the arguments passed to the decorated function. For example:: +Standalone may have a dynamic name that depends on the arguments passed to the decorated function. For example:: ```python @dlt.resource(standalone=True, name=lambda args: args["stream_name"]) def kinesis(stream_name: str): @@ -265,52 +262,19 @@ kinesis_stream = kinesis("telemetry_stream") ``` `kinesis_stream` resource has a name **telemetry_stream** - -### Declare parallel and async resources -You can extract multiple resources in parallel threads or with async IO. -To enable this for a sync resource you can set the `parallelized` flag to `True` in the resource decorator: - - -```python -@dlt.resource(parallelized=True) -def get_users(): - for u in _get_users(): - yield u - -@dlt.resource(parallelized=True) -def get_orders(): - for o in _get_orders(): - yield o - -# users and orders will be iterated in parallel in two separate threads -pipeline.run(get_users(), get_orders()) -``` - -Async generators are automatically extracted concurrently with other resources: - -```python -@dlt.resource -async def get_users(): - async for u in _get_users(): # Assuming _get_users is an async generator - yield u -``` - -Please find more details in [extract performance](../reference/performance.md#extract) - ## Customize resources -### Filter, transform and pivot data +### Filter, transform, and pivot data -You can attach any number of transformations that are evaluated on item per item basis to your +You can attach any number of transformations that are evaluated on an item per item basis to your resource. The available transformation types: - map - transform the data item (`resource.add_map`). - filter - filter the data item (`resource.add_filter`). -- yield map - a map that returns iterator (so single row may generate many rows - +- yield map - a map that returns an iterator (so a single row may generate many rows - `resource.add_yield_map`). -Example: We have a resource that loads a list of users from an api endpoint. We want to customize it -so: +Example: We have a resource that loads a list of users from an API endpoint. We want to customize it so: 1. We remove users with `user_id == "me"`. 1. We anonymize user data. @@ -349,7 +313,7 @@ If your resource loads thousands of pages of data from a REST API or millions of table, you may want to just sample a fragment of it in order i.e. to quickly see the dataset with example data and test your transformations etc. In order to do that, you limit how many items will be yielded by a resource by calling `resource.add_limit` method. In the example below we load just -10 first items from and infinite counter - that would otherwise never end. +10 first items from an infinite counter - that would otherwise never end. ```python r = dlt.resource(itertools.count(), name="infinity").add_limit(10) @@ -357,20 +321,17 @@ assert list(r) == list(range(10)) ``` > 💡 We are not skipping any items. We are closing the iterator/generator that produces data after -> limit is reached. +> the limit is reached. > 💡 You cannot limit transformers. They should process all the data they receive fully to avoid > inconsistencies in generated datasets. -> 💡 If you are paremetrizing the value of `add_limit` and sometimes need it to be disabled, you can set `None` or `-1` -> to disable the limiting. You can also set the limit to `0` for the resource to not yield any items. - ### Set table name and adjust schema -You can change the schema of a resource, be it standalone or as a part of a source. Look for method -named `apply_hints` which takes the same arguments as resource decorator. Obviously you should call -this method before data is extracted from the resource. Example below converts an `append` resource -loading the `users` table into [merge](incremental-loading.md#merge-incremental_loading) resource +You can change the schema of a resource, be it standalone or as a part of a source. Look for the method +named `apply_hints` which takes the same arguments as the resource decorator. Obviously, you should call +this method before data is extracted from the resource. The example below converts an `append` resource +loading the `users` table into a [merge](incremental-loading.md#merge-incremental_loading) resource that will keep just one updated record per `user_id`. It also adds ["last value" incremental loading](incremental-loading.md#incremental_loading-with-last-value) on `created_at` column to prevent requesting again the already loaded records: @@ -385,7 +346,7 @@ tables.users.apply_hints( pipeline.run(tables) ``` -To just change a name of a table to which resource will load data, do the following: +To just change the name of a table to which the resource will load data, do the following: ```python tables = sql_database() tables.users.table_name = "other_users" @@ -393,10 +354,10 @@ tables.users.table_name = "other_users" ### Adjust schema when you yield data -You can set or update the table name, columns and other schema elements when your resource is executed and you already yield data. Such changes will be merged -with the existing schema in the same way `apply_hints` method above works. There are many reason to adjust schema at runtime. For example when using Airflow, you -should avoid lengthy operations (ie. reflecting database tables) during creation of the DAG so it is better do do it when DAG executes. You may also emit partial -hints (ie. precision and scale for decimal types) for column to help `dlt` type inference. +You can set or update the table name, columns, and other schema elements when your resource is executed and you already yield data. Such changes will be merged +with the existing schema in the same way the `apply_hints` method above works. There are many reasons to adjust the schema at runtime. For example, when using Airflow, you +should avoid lengthy operations (i.e., reflecting database tables) during the creation of the DAG so it is better to do it when the DAG executes. You may also emit partial +hints (i.e., precision and scale for decimal types) for a column to help `dlt` type inference. ```python @dlt.resource @@ -423,15 +384,15 @@ def sql_table(credentials, schema, table): ``` In the example above we use `dlt.mark.with_hints` and `dlt.mark.make_hints` to emit columns and primary key with the first extracted item. Table schema will -be adjusted after the `batch` is processed in the extract pipeline but before any schema contracts are applied and data is persisted in load package. +be adjusted after the `batch` is processed in the extract pipeline but before any schema contracts are applied and data is persisted in the load package. :::tip -You can emit columns as Pydantic model and use dynamic hints (ie. lambda for table name) as well. You should avoid redefining `Incremental` this way. +You can emit columns as a Pydantic model and use dynamic hints (i.e., lambda for table name) as well. You should avoid redefining `Incremental` this way. ::: ### Duplicate and rename resources -There are cases when you your resources are generic (ie. bucket filesystem) and you want to load several instances of it (ie. files from different folders) to separate tables. In example below we use `filesystem` source to load csvs from two different folders into separate tables: +There are cases when your resources are generic (i.e., bucket filesystem) and you want to load several instances of it (i.e., files from different folders) to separate tables. In the example below, we use the `filesystem` source to load CSVs from two different folders into separate tables: ```python @dlt.resource(standalone=True) def filesystem(bucket_url): @@ -443,7 +404,7 @@ def csv_reader(file_item): # load csv, parse and yield rows in file_item ... -# create two extract pipes that list files from the bucket and send to them to the reader. +# create two extract pipes that list files from the bucket and send them to the reader. # by default both pipes will load data to the same table (csv_reader) reports_pipe = filesystem("s3://my-bucket/reports") | load_csv() transactions_pipe = filesystem("s3://my-bucket/transactions") | load_csv() @@ -454,13 +415,13 @@ pipeline.run( ) ``` -`with_name` method returns a deep copy of the original resource, its data pipe and the data pipes of a parent resources. A renamed clone is fully separated from the original resource (and other clones) when loading: +The `with_name` method returns a deep copy of the original resource, its data pipe, and the data pipes of the parent resources. A renamed clone is fully separated from the original resource (and other clones) when loading: it maintains a separate [resource state](state.md#read-and-write-pipeline-state-in-a-resource) and will load to a table ## Load resources -You can pass individual resources or list of resources to the `dlt.pipeline` object. The resources -loaded outside the source context, will be added to the [default schema](schema.md) of the +You can pass individual resources or a list of resources to the `dlt.pipeline` object. The resources +loaded outside the source context will be added to the [default schema](schema.md) of the pipeline. ```python @@ -481,10 +442,11 @@ pipeline.run([generate_rows(10), generate_rows(20)]) ``` ### Do a full refresh -To do a full refresh of an `append` or `merge` resources you temporarily change the write -disposition to replace. You can use `apply_hints` method of a resource or just provide alternative +To do a full refresh of an `append` or `merge` resource, you temporarily change the write +disposition to replace. You can use the `apply_hints` method of a resource or just provide an alternative write disposition when loading: ```python p.run(merge_source(), write_disposition="replace") ``` + diff --git a/docs/website/docs/general-usage/schema-contracts.md b/docs/website/docs/general-usage/schema-contracts.md index 764b565beb..2640675e36 100644 --- a/docs/website/docs/general-usage/schema-contracts.md +++ b/docs/website/docs/general-usage/schema-contracts.md @@ -6,7 +6,7 @@ keywords: [data contracts, schema, dlt schema, pydantic] ## Schema and Data Contracts -`dlt` will evolve the schema at the destination by following the structure and data types of the extracted data. There are several modes +`dlt` evolves the schema at the destination by following the structure and data types of the extracted data. There are several modes that you can use to control this automatic schema evolution, from the default modes where all changes to the schema are accepted to a frozen schema that does not change at all. @@ -22,11 +22,11 @@ This resource will allow new tables (both child tables and [tables with dynamic ### Setting up the contract You can control the following **schema entities**: -* `tables` - contract is applied when a new table is created -* `columns` - contract is applied when a new column is created on an existing table -* `data_type` - contract is applied when data cannot be coerced into a data type associate with existing column. +* `tables` - the contract is applied when a new table is created +* `columns` - the contract is applied when a new column is created on an existing table +* `data_type` - the contract is applied when data cannot be coerced into a data type associated with an existing column. -You can use **contract modes** to tell `dlt` how to apply contract for a particular entity: +You can use **contract modes** to tell `dlt` how to apply the contract for a particular entity: * `evolve`: No constraints on schema changes. * `freeze`: This will raise an exception if data is encountered that does not fit the existing schema, so no data will be loaded to the destination * `discard_row`: This will discard any extracted row if it does not adhere to the existing schema, and this row will not be loaded to the destination. @@ -34,37 +34,37 @@ You can use **contract modes** to tell `dlt` how to apply contract for a particu :::note The default mode (**evolve**) works as follows: -1. New tables may be always created -2. New columns may be always appended to the existing table -3. Data that do not coerce to existing data type of a particular column will be sent to a [variant column](schema.md#variant-columns) created for this particular type. +1. New tables may always be created +2. New columns may always be appended to the existing table +3. Data that do not coerce to the existing data type of a particular column will be sent to a [variant column](schema.md#variant-columns) created for this particular type. ::: #### Passing schema_contract argument The `schema_contract` exists on the [dlt.source](source.md) decorator as a default for all resources in that source and on the [dlt.resource](source.md) decorator as a directive for the individual resource - and as a consequence - on all tables created by this resource. -Additionally it exists on the `pipeline.run()` method, which will override all existing settings. +Additionally, it exists on the `pipeline.run()` method, which will override all existing settings. The `schema_contract` argument accepts two forms: 1. **full**: a mapping of schema entities to contract modes -2. **shorthand** a contract mode (string) that will be applied to all schema entities. +2. **shorthand**: a contract mode (string) that will be applied to all schema entities. -For example setting `schema_contract` to *freeze* will expand to the full form: +For example, setting `schema_contract` to *freeze* will expand to the full form: ```python {"tables": "freeze", "columns": "freeze", "data_type": "freeze"} ``` -You can change the contract on the **source** instance via `schema_contract` property. For **resource** you can use [apply_hints](resource#set-table-name-and-adjust-schema). +You can change the contract on the **source** instance via the `schema_contract` property. For **resource**, you can use [apply_hints](resource#set-table-name-and-adjust-schema). #### Nuances of contract modes. -1. Contracts are applied **after names of tables and columns are normalized**. -2. Contract defined on a resource is applied to all tables and child tables created by that resource. -3. `discard_row` works on table level. So for example if you have two tables in parent-child relationship ie. *users* and *users__addresses* and contract is violated in *users__addresses* table, the row of that table is discarded while the parent row in *users* table will be loaded. +1. Contracts are applied **after the names of tables and columns are normalized**. +2. The contract defined on a resource is applied to all tables and child tables created by that resource. +3. `discard_row` works on the table level. So for example, if you have two tables in a parent-child relationship, i.e., *users* and *users__addresses*, and the contract is violated in the *users__addresses* table, the row of that table is discarded while the parent row in the *users* table will be loaded. ### Use Pydantic models for data validation Pydantic models can be used to [define table schemas and validate incoming data](resource.md#define-a-schema-with-pydantic). You can use any model you already have. `dlt` will internally synthesize (if necessary) new models that conform with the **schema contract** on the resource. -Just passing a model in `column` argument of the [dlt.resource](resource.md#define-a-schema-with-pydantic) sets a schema contract that conforms to default Pydantic behavior: +Just passing a model in the `column` argument of the [dlt.resource](resource.md#define-a-schema-with-pydantic) sets a schema contract that conforms to the default Pydantic behavior: ```python { "tables": "evolve", @@ -72,18 +72,18 @@ Just passing a model in `column` argument of the [dlt.resource](resource.md#defi "data_type": "freeze" } ``` -New tables are allowed, extra fields are ignored and invalid data raises an exception. +New tables are allowed, extra fields are ignored, and invalid data raises an exception. -If you pass schema contract explicitly the following happens to schema entities: +If you pass the schema contract explicitly, the following happens to schema entities: 1. **tables** do not impact the Pydantic models 2. **columns** modes are mapped into the **extra** modes of Pydantic (see below). `dlt` will apply this setting recursively if models contain other models. -3. **data_type** supports following modes for Pydantic: **evolve** will synthesize lenient model that allows for any data type. This may result with variant columns upstream. +3. **data_type** supports the following modes for Pydantic: **evolve** will synthesize a lenient model that allows for any data type. This may result in variant columns upstream. **freeze** will re-raise `ValidationException`. **discard_row** will remove the non-validating data items. **discard_value** is not currently supported. We may eventually do that on Pydantic v2. `dlt` maps column contract modes into the extra fields settings as follows. -Note that this works in two directions. If you use a model with such setting explicitly configured, `dlt` sets the column contract mode accordingly. This also avoids synthesizing modified models. +Note that this works in two directions. If you use a model with such a setting explicitly configured, `dlt` sets the column contract mode accordingly. This also avoids synthesizing modified models. | column mode | pydantic extra | | ------------- | -------------- | @@ -101,25 +101,25 @@ Model validation is added as a [transform step](resource.md#filter-transform-and :::note Pydantic models work on the **extracted** data **before names are normalized or child relationships are created**. Make sure to name model fields as in your input data and handle nested data with the nested models. -As a consequence, `discard_row` will drop the whole data item - even if nested model was affected. +As a consequence, `discard_row` will drop the whole data item - even if a nested model was affected. ::: ### Set contracts on Arrow Tables and Pandas All contract settings apply to [arrow tables and panda frames](../dlt-ecosystem/verified-sources/arrow-pandas.md) as well. 1. **tables** mode the same - no matter what is the data item type -2. **columns** will allow new columns, raise an exception or modify tables/frames still in extract step to avoid re-writing parquet files. -3. **data_type** changes to data types in tables/frames are not allowed and will result in data type schema clash. We could allow for more modes (evolving data types in Arrow tables sounds weird but ping us on Slack if you need it.) +2. **columns** will allow new columns, raise an exception, or modify tables/frames still in the extract step to avoid re-writing parquet files. +3. **data_type** changes to data types in tables/frames are not allowed and will result in a data type schema clash. We could allow for more modes (evolving data types in Arrow tables sounds weird but ping us on Slack if you need it.) Here's how `dlt` deals with column modes: -1. **evolve** new columns are allowed (table may be reordered to put them at the end) -2. **discard_value** column will be deleted -3. **discard_row** rows with the column present will be deleted and then column will be deleted +1. **evolve** new columns are allowed (the table may be reordered to put them at the end) +2. **discard_value** the column will be deleted +3. **discard_row** rows with the column present will be deleted and then the column will be deleted 4. **freeze** exception on a new column ### Get context from DataValidationError in freeze mode -When contract is violated in freeze mode, `dlt` raises `DataValidationError` exception. This exception gives access to the full context and passes the evidence to the caller. -As with any other exception coming from pipeline run, it will be re-raised via `PipelineStepFailed` exception which you should catch in except: +When the contract is violated in freeze mode, `dlt` raises a `DataValidationError` exception. This exception gives access to the full context and passes the evidence to the caller. +As with any other exception coming from pipeline run, it will be re-raised via the `PipelineStepFailed` exception which you should catch in except: ```python try: @@ -136,22 +136,22 @@ except as pip_ex: ``` `DataValidationError` provides the following context: -1. `schema_name`, `table_name` and `column_name` provide the logical "location" at which the contract was violated. +1. `schema_name`, `table_name`, and `column_name` provide the logical "location" at which the contract was violated. 2. `schema_entity` and `contract_mode` tell which contract was violated -3. `table_schema` contains the schema against which the contract was validated. May be Pydantic model or `dlt` TTableSchema instance +3. `table_schema` contains the schema against which the contract was validated. May be a Pydantic model or `dlt` TTableSchema instance 4. `schema_contract` the full, expanded schema contract -5. `data_item` causing data item (Python dict, arrow table, pydantic model or list of there of) +5. `data_item` causing data item (Python dict, arrow table, pydantic model, or list thereof) ### Contracts on new tables -If a table is a **new table** that has not been created on the destination yet, dlt will allow the creation of new columns. For a single pipeline run, the column mode is changed (internally) to **evolve** and then reverted back to the original mode. This allows for initial schema inference to happen and then on subsequent run, the inferred contract will be applied to a new data. +If a table is a **new table** that has not been created on the destination yet, dlt will allow the creation of new columns. For a single pipeline run, the column mode is changed (internally) to **evolve** and then reverted back to the original mode. This allows for initial schema inference to happen and then on subsequent run, the inferred contract will be applied to new data. -Following tables are considered new: +The following tables are considered new: 1. Child tables inferred from the nested data 2. Dynamic tables created from the data during extraction -3. Tables containing **incomplete** columns - columns without data type bound to them. +3. Tables containing **incomplete** columns - columns without a data type bound to them. -For example such table is considered new because column **number** is incomplete (define primary key and NOT null but no data type) +For example, such a table is considered new because the column **number** is incomplete (define primary key and NOT null but no data type) ```yaml blocks: description: Ethereum blocks @@ -168,7 +168,7 @@ What tables are not considered new: ### Code Examples -The below code will silently ignore new subtables, allow new columns to be added to existing tables and raise an error if a variant of a column is discovered. +The code below will silently ignore new subtables, allow new columns to be added to existing tables, and raise an error if a variant of a column is discovered. ```py @dlt.resource(schema_contract={"tables": "discard_row", "columns": "evolve", "data_type": "freeze"}) @@ -176,13 +176,13 @@ def items(): ... ``` -The below Code will raise on any encountered schema change. Note: You can always set a string which will be interpreted as though all keys are set to these values. +The code below will raise an error on any encountered schema change. Note: You can always set a string which will be interpreted as though all keys are set to these values. ```py pipeline.run(my_source(), schema_contract="freeze") ``` -The below code defines some settings on the source which can be overwritten on the resource which in turn can be overwritten by the global override on the `run` method. +The code below defines some settings on the source which can be overwritten on the resource which in turn can be overwritten by the global override on the `run` method. Here for all resources variant columns are frozen and raise an error if encountered, on `items` new columns are allowed but `other_items` inherits the `freeze` setting from the source, thus new columns are frozen there. New tables are allowed. @@ -206,4 +206,5 @@ pipeline.run(source()) # this will freeze the whole schema, regardless of the decorator settings pipeline.run(source(), schema_contract="freeze") -``` \ No newline at end of file +``` + diff --git a/docs/website/docs/general-usage/schema.md b/docs/website/docs/general-usage/schema.md index 7ce1d959c9..8bf9ff5673 100644 --- a/docs/website/docs/general-usage/schema.md +++ b/docs/website/docs/general-usage/schema.md @@ -6,130 +6,130 @@ keywords: [schema, dlt schema, yaml] # Schema -The schema describes the structure of normalized data (e.g. tables, columns, data types, etc.) and +The schema describes the structure of normalized data (e.g., tables, columns, data types, etc.) and provides instructions on how the data should be processed and loaded. `dlt` generates schemas from -the data during the normalization process. User can affect this standard behavior by providing -**hints** that change how tables, columns and other metadata is generated and how the data is -loaded. Such hints can be passed in the code ie. to `dlt.resource` decorator or `pipeline.run` -method. Schemas can be also exported and imported as files, which can be directly modified. +the data during the normalization process. Users can affect this standard behavior by providing +**hints** that change how tables, columns, and other metadata are generated and how the data is +loaded. Such hints can be passed in the code, i.e., to `dlt.resource` decorator or `pipeline.run` +method. Schemas can also be exported and imported as files, which can be directly modified. > 💡 `dlt` associates a schema with a [source](source.md) and a table schema with a > [resource](resource.md). ## Schema content hash and version -Each schema file contains content based hash `version_hash` that is used to: +Each schema file contains a content-based hash `version_hash` that is used to: -1. Detect manual changes to schema (ie. user edits content). +1. Detect manual changes to the schema (i.e., user edits content). 1. Detect if the destination database schema is synchronized with the file schema. Each time the schema is saved, the version hash is updated. -Each schema contains a numeric version which increases automatically whenever schema is updated and -saved. Numeric version is meant to be human-readable. There are cases (parallel processing) where +Each schema contains a numeric version which increases automatically whenever the schema is updated and +saved. The numeric version is meant to be human-readable. There are cases (parallel processing) where the order is lost. -> 💡 Schema in the destination is migrated if its hash is not stored in `_dlt_versions` table. In -> principle many pipelines may send data to a single dataset. If table name clash then a single +> 💡 The schema in the destination is migrated if its hash is not stored in the `_dlt_versions` table. In +> principle, many pipelines may send data to a single dataset. If table names clash, then a single > table with the union of the columns will be created. If columns clash, and they have different -> types etc. then the load may fail if the data cannot be coerced. +> types, etc., then the load may fail if the data cannot be coerced. ## Naming convention -`dlt` creates tables, child tables and column schemas from the data. The data being loaded, -typically JSON documents, contains identifiers (i.e. key names in a dictionary) with any Unicode -characters, any lengths and naming styles. On the other hand the destinations accept very strict -namespaces for their identifiers. Like Redshift that accepts case-insensitive alphanumeric -identifiers with maximum 127 characters. +`dlt` creates tables, child tables, and column schemas from the data. The data being loaded, +typically JSON documents, contains identifiers (i.e., key names in a dictionary) with any Unicode +characters, any lengths, and naming styles. On the other hand, the destinations accept very strict +namespaces for their identifiers. Like Redshift, that accepts case-insensitive alphanumeric +identifiers with a maximum of 127 characters. -Each schema contains `naming convention` that tells `dlt` how to translate identifiers to the +Each schema contains a `naming convention` that tells `dlt` how to translate identifiers to the namespace that the destination understands. The default naming convention: -1. Converts identifiers to snake_case, small caps. Removes all ascii characters except ascii +1. Converts identifiers to snake_case, small caps. Removes all ASCII characters except ASCII alphanumerics and underscores. -1. Adds `_` if name starts with number. -1. Multiples of `_` are converted into single `_`. +1. Adds `_` if the name starts with a number. +1. Multiples of `_` are converted into a single `_`. 1. The parent-child relation is expressed as double `_` in names. -1. It shorts the identifier if it exceed the length at the destination. +1. It shortens the identifier if it exceeds the length at the destination. -> 💡 Standard behavior of `dlt` is to **use the same naming convention for all destinations** so -> users see always the same tables and columns in their databases. +> 💡 The standard behavior of `dlt` is to **use the same naming convention for all destinations** so +> users always see the same tables and columns in their databases. -> 💡 If you provide any schema elements that contain identifiers via decorators or arguments (i.e. -> `table_name` or `columns`) all the names used will be converted via the naming convention when -> adding to the schema. For example if you execute `dlt.run(... table_name="CamelCase")` the data +> 💡 If you provide any schema elements that contain identifiers via decorators or arguments (i.e., +> `table_name` or `columns`), all the names used will be converted via the naming convention when +> adding to the schema. For example, if you execute `dlt.run(... table_name="CamelCase")`, the data > will be loaded into `camel_case`. -> 💡 Use simple, short small caps identifiers for everything! +> 💡 Use simple, short, small caps identifiers for everything! -The naming convention is configurable and users can easily create their own -conventions that i.e. pass all the identifiers unchanged if the destination accepts that (i.e. +The naming convention is configurable, and users can easily create their own +conventions that, i.e., pass all the identifiers unchanged if the destination accepts that (i.e., DuckDB). ## Data normalizer -Data normalizer changes the structure of the input data, so it can be loaded into destination. The +The data normalizer changes the structure of the input data, so it can be loaded into the destination. The standard `dlt` normalizer creates a relational structure from Python dictionaries and lists. Elements of that structure: table and column definitions, are added to the schema. -The data normalizer is configurable and users can plug their own normalizers i.e. to handle the +The data normalizer is configurable, and users can plug in their own normalizers, i.e., to handle the parent-child table linking differently or generate parquet-like data structs instead of child tables. ## Tables and columns The key components of a schema are tables and columns. You can find a dictionary of tables in -`tables` key or via `tables` property of Schema object. +the `tables` key or via the `tables` property of the Schema object. A table schema has the following properties: 1. `name` and `description`. 1. `parent` with a parent table name. -1. `columns` with dictionary of table schemas. +1. `columns` with a dictionary of table schemas. 1. `write_disposition` hint telling `dlt` how new data coming to the table is loaded. -Table schema is extended by data normalizer. Standard data normalizer adds propagated columns to it. +The table schema is extended by the data normalizer. The standard data normalizer adds propagated columns to it. -A column schema contains following properties: +A column schema contains the following properties: 1. `name` and `description` of a column in a table. 1. `data_type` with a column data type. 1. `precision` a precision for **text**, **timestamp**, **time**, **bigint**, **binary**, and **decimal** types -1. `scale` a scale for **decimal** type -1. `is_variant` telling that column was generated as variant of another column. +1. `scale` a scale for the **decimal** type +1. `is_variant` telling that the column was generated as a variant of another column. -A column schema contains following basic hints: +A column schema contains the following basic hints: -1. `nullable` tells if column is nullable or not. -1. `primary_key` marks a column as a part of primary key. -1. `merge_key` marks a column as a part of merge key used by +1. `nullable` tells if the column is nullable or not. +1. `primary_key` marks a column as a part of the primary key. +1. `merge_key` marks a column as a part of the merge key used by [incremental load](./incremental-loading.md#merge-incremental_loading). -1. `foreign_key` marks a column as a part of foreign key. -1. `root_key` marks a column as a part of root key which is a type of foreign key always referring to the +1. `foreign_key` marks a column as a part of the foreign key. +1. `root_key` marks a column as a part of the root key which is a type of foreign key always referring to the root table. -1. `unique` tells that column is unique. on some destination that generates unique index. +1. `unique` tells that the column is unique. On some destinations, that generates a unique index. `dlt` lets you define additional performance hints: -1. `partition` marks column to be used to partition data. -1. `cluster` marks column to be part to be used to cluster data -1. `sort` marks column as sortable/having order. on some destinations that non-unique generates - index. +1. `partition` marks the column to be used to partition data. +1. `cluster` marks the column to be part of the data to be used to cluster data +1. `sort` marks the column as sortable/having order. On some destinations, that non-unique generates + an index. :::note -Each destination can interpret the hints in its own way. For example `cluster` hint is used by -Redshift to define table distribution and by BigQuery to specify cluster column. DuckDB and +Each destination can interpret the hints in its own way. For example, the `cluster` hint is used by +Redshift to define table distribution and by BigQuery to specify the cluster column. DuckDB and Postgres ignore it when creating tables. ::: ### Variant columns -Variant columns are generated by a normalizer when it encounters data item with type that cannot be -coerced in existing column. Please see our [`coerce_row`](https://github.com/dlt-hub/dlt/blob/7d9baf1b8fdf2813bcf7f1afe5bb3558993305ca/dlt/common/schema/schema.py#L205) if you are interested to see how internally it works. +Variant columns are generated by a normalizer when it encounters a data item with a type that cannot be +coerced into an existing column. Please see our [`coerce_row`](https://github.com/dlt-hub/dlt/blob/7d9baf1b8fdf2813bcf7f1afe5bb3558993305ca/dlt/common/schema/schema.py#L205) if you are interested in seeing how it works internally. -Let's consider our [getting started](../getting-started#quick-start) example with slightly different approach, +Let's consider our [getting started](../getting-started#quick-start) example with a slightly different approach, where `id` is an integer type at the beginning ```py @@ -138,24 +138,24 @@ data = [ ] ``` -once pipeline runs we will have the following schema: +Once the pipeline runs, we will have the following schema: | name | data_type | nullable | | ------------- | ------------- | -------- | | id | bigint | true | | human_name | text | true | -Now imagine the data has changed and `id` field also contains strings +Now imagine the data has changed and the `id` field also contains strings ```py data = [ - {"id": 1, "human_name": "Alice"} + {"id": 1, "human_name": "Alice"}, {"id": "idx-nr-456", "human_name": "Bob"} ] ``` -So after you run the pipeline `dlt` will automatically infer type changes and will add a new field in the schema `id__v_text` -to reflect that new data type for `id` so for any type which is not compatible with integer it will create a new field. +So after you run the pipeline, `dlt` will automatically infer type changes and will add a new field in the schema `id__v_text` +to reflect that new data type for `id`. So for any type which is not compatible with integer, it will create a new field. | name | data_type | nullable | | ------------- | ------------- | -------- | @@ -163,10 +163,10 @@ to reflect that new data type for `id` so for any type which is not compatible w | human_name | text | true | | id__v_text | text | true | -On the other hand if `id` field was already a string then introducing new data with `id` containing other types -will not change schema because they can be coerced to string. +On the other hand, if the `id` field was already a string, then introducing new data with `id` containing other types +will not change the schema because they can be coerced to string. -Now go ahead and try to add a new record where `id` is float number, you should see a new field `id__v_double` in the schema. +Now go ahead and try to add a new record where `id` is a float number, you should see a new field `id__v_double` in the schema. ### Data types @@ -178,36 +178,36 @@ Now go ahead and try to add a new record where `id` is float number, you should | timestamp | `'2023-07-26T14:45:00Z'`, `datetime.datetime.now()` | Supports precision expressed as parts of a second | | date | `datetime.date(2023, 7, 26)` | | | time | `'14:01:02'`, `datetime.time(14, 1, 2)` | Supports precision - see **timestamp** | -| bigint | `9876543210` | Supports precision as number of bits | +| bigint | `9876543210` | Supports precision as a number of bits | | binary | `b'\x00\x01\x02\x03'` | Supports precision, like **text** | | complex | `[4, 5, 6]`, `{'a': 1}` | | | decimal | `Decimal('4.56')` | Supports precision and scale | | wei | `2**56` | | -`wei` is a datatype tries to best represent native Ethereum 256bit integers and fixed point +`wei` is a datatype that tries to best represent native Ethereum 256bit integers and fixed-point decimals. It works correctly on Postgres and BigQuery. All the other destinations have insufficient precision. -`complex` data type tells `dlt` to load that element as JSON or struct and do not attempt to flatten +The `complex` data type tells `dlt` to load that element as JSON or struct and not to attempt to flatten or create a child table out of it. -`time` data type is saved in destination without timezone info, if timezone is included it is stripped. E.g. `'14:01:02+02:00` -> `'14:01:02'`. +The `time` data type is saved in the destination without timezone info. If timezone is included, it is stripped. E.g., `'14:01:02+02:00` -> `'14:01:02'`. :::tip -The precision and scale are interpreted by particular destination and are validated when a column is created. Destinations that +The precision and scale are interpreted by the particular destination and are validated when a column is created. Destinations that do not support precision for a given data type will ignore it. -The precision for **timestamp** is useful when creating **parquet** files. Use 3 - for milliseconds, 6 for microseconds, 9 for nanoseconds +The precision for **timestamp** is useful when creating **parquet** files. Use 3 - for milliseconds, 6 for microseconds, 9 for nanoseconds. -The precision for **bigint** is mapped to available integer types ie. TINYINT, INT, BIGINT. The default is 64 bits (8 bytes) precision (BIGINT) +The precision for **bigint** is mapped to available integer types, i.e., TINYINT, INT, BIGINT. The default is 64 bits (8 bytes) precision (BIGINT). ::: ## Schema settings -The `settings` section of schema file lets you define various global rules that impact how tables +The `settings` section of the schema file lets you define various global rules that impact how tables and columns are inferred from data. -> 💡 It is the best practice to use those instead of providing the exact column schemas via `columns` +> 💡 It is best practice to use these instead of providing the exact column schemas via the `columns` > argument or by pasting them in `yaml`. ### Data type autodetectors @@ -226,10 +226,10 @@ settings: ### Column hint rules -You can define a global rules that will apply hints of a newly inferred columns. Those rules apply +You can define global rules that will apply hints to newly inferred columns. These rules apply to normalized column names. You can use column names directly or with regular expressions. -Example from ethereum schema: +Example from the Ethereum schema: ```yaml settings: @@ -252,10 +252,10 @@ settings: ### Preferred data types You can define rules that will set the data type for newly created columns. Put the rules under -`preferred_types` key of `settings`. On the left side there's a rule on a column name, on the right +the `preferred_types` key of `settings`. On the left side, there's a rule on a column name, on the right side is the data type. -> ❗See the column hint rules for naming convention! +> ❗See the column hint rules for the naming convention! Example: @@ -276,16 +276,15 @@ schema files in your pipeline. ## Attaching schemas to sources -We recommend to not create schemas explicitly. Instead, user should provide a few global schema -settings and then let the table and column schemas to be generated from the resource hints and the -data itself. +We recommend not creating schemas explicitly. Instead, users should provide a few global schema +settings and then let the table and column schemas be generated from the resource hints and the data itself. The `dlt.source` decorator accepts a schema instance that you can create yourself and modify in -whatever way you wish. The decorator also support a few typical use cases: +whatever way you wish. The decorator also supports a few typical use cases: ### Schema created implicitly by decorator -If no schema instance is passed, the decorator creates a schema with the name set to source name and +If no schema instance is passed, the decorator creates a schema with the name set to the source name and all the settings to default. ### Automatically load schema file stored with source python module @@ -294,16 +293,16 @@ If no schema instance is passed, and a file with a name `{source name}_schema.ym same folder as the module with the decorated function, it will be automatically loaded and used as the schema. -This should make easier to bundle a fully specified (or pre-configured) schema with a source. +This should make it easier to bundle a fully specified (or pre-configured) schema with a source. ### Schema is modified in the source function body -What if you can configure your schema or add some tables only inside your schema function, when i.e. -you have the source credentials and user settings available? You could for example add detailed -schemas of all the database tables when someone requests a table data to be loaded. This information -is available only at the moment source function is called. +What if you can configure your schema or add some tables only inside your schema function, when, i.e., +you have the source credentials and user settings available? You could, for example, add detailed +schemas of all the database tables when someone requests table data to be loaded. This information +is available only at the moment the source function is called. -Similarly to the `source_state()` and `resource_state()` , source and resource function has current +Similarly to `source_state()` and `resource_state()`, the source and resource function has the current schema available via `dlt.current.source_schema()`. Example: @@ -315,9 +314,10 @@ def textual(nesting_level: int): schema = dlt.current.source_schema() # remove date detector schema.remove_type_detection("iso_timestamp") - # convert UNIX timestamp (float, withing a year from NOW) into timestamp + # convert UNIX timestamp (float, within a year from NOW) into timestamp schema.add_type_detection("timestamp") schema.compile_settings() return dlt.resource(...) ``` + diff --git a/docs/website/docs/general-usage/source.md b/docs/website/docs/general-usage/source.md index 1b3d1ce0cc..11fdb266b4 100644 --- a/docs/website/docs/general-usage/source.md +++ b/docs/website/docs/general-usage/source.md @@ -6,24 +6,24 @@ keywords: [source, api, dlt.source] # Source -A [source](glossary.md#source) is a logical grouping of resources i.e. endpoints of a +A [source](glossary.md#source) is a logical grouping of resources, i.e., endpoints of a single API. The most common approach is to define it in a separate Python module. - A source is a function decorated with `@dlt.source` that returns one or more resources. -- A source can optionally define a [schema](schema.md) with tables, columns, performance hints and +- A source can optionally define a [schema](schema.md) with tables, columns, performance hints, and more. - The source Python module typically contains optional customizations and data transformations. -- The source Python module typically contains the authentication and pagination code for particular +- The source Python module typically contains the authentication and pagination code for a particular API. ## Declare sources -You declare source by decorating an (optionally async) function that return or yields one or more resource with `dlt.source`. Our -[Create a pipeline](../walkthroughs/create-a-pipeline.md) how to guide teaches you how to do that. +You declare a source by decorating a function that returns one or more resources with `dlt.source`. Our +[Create a pipeline](../walkthroughs/create-a-pipeline.md) guide teaches you how to do that. ### Create resources dynamically -You can create resources by using `dlt.resource` as a function. In an example below we reuse a +You can create resources by using `dlt.resource` as a function. In the example below, we reuse a single generator function to create a list of resources for several Hubspot endpoints. ```python @@ -36,28 +36,22 @@ def hubspot(api_key=dlt.secrets.value): yield requests.get(url + "/" + endpoint).json() for endpoint in endpoints: - # calling get_resource creates generator, + # calling get_resource creates a generator, # the actual code of the function will be executed in pipeline.run yield dlt.resource(get_resource(endpoint), name=endpoint) ``` ### Attach and configure schemas -You can [create, attach and configure schema](schema.md#attaching-schemas-to-sources) that will be +You can [create, attach, and configure a schema](schema.md#attaching-schemas-to-sources) that will be used when loading the source. -### Avoid long lasting operations in source function -Do not extract data in source function. Leave that task to your resources if possible. Source function is executed immediately when called (contrary to resources which delay execution - like Python generators). There are several benefits (error handling, execution metrics, parallelization) you get when you extract data in `pipeline.run` or `pipeline.extract`. - -If this is impractical (for example you want to reflect a database to create resources for tables) make sure you do not call source function too often. [See this note if you plan to deploy on Airflow](../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer.md#2-modify-dag-file) - - ## Customize sources ### Access and select resources to load -You can access resources present in a source and select which of them you want to load. In case of -`hubspot` resource above we could select and load "companies", "deals" and "products" resources: +You can access resources present in a source and select which of them you want to load. In the case of +the `hubspot` resource above, we could select and load "companies", "deals", and "products" resources: ```python from hubspot import hubspot @@ -84,9 +78,9 @@ print(source.deals.selected) source.deals.selected = False ``` -### Filter, transform and pivot data +### Filter, transform, and pivot data -You can modify and filter data in resources, for example if we want to keep only deals after certain +You can modify and filter data in resources, for example, if we want to keep only deals after a certain date: ```python @@ -97,11 +91,11 @@ Find more on transforms [here](resource.md#filter-transform-and-pivot-data). ### Load data partially -You can limit the number of items produced by each resource by calling a `add_limit` method on a -source. This is useful for testing, debugging and generating sample datasets for experimentation. +You can limit the number of items produced by each resource by calling an `add_limit` method on a +source. This is useful for testing, debugging, and generating sample datasets for experimentation. You can easily get your test dataset in a few minutes, when otherwise you'd need to wait hours for the full loading to complete. Below we limit the `pipedrive` source to just get 10 pages of data -from each endpoint. Mind that the transformers will be evaluated fully: +from each endpoint. Keep in mind that the transformers will be evaluated fully: ```python from pipedrive import pipedrive_source @@ -113,12 +107,11 @@ print(load_info) Find more on sampling data [here](resource.md#sample-from-large-data). -### Add more resources to existing source +### Add more resources to an existing source -You can add a custom resource to source after it was created. Imagine that you want to score all the -deals with a keras model that will tell you if the deal is a fraud or not. In order to do that you -declare a new -[transformer that takes the data from](resource.md#feeding-data-from-one-resource-into-another) `deals` +You can add a custom resource to a source after it was created. Imagine that you want to score all the +deals with a keras model that will tell you if the deal is a fraud or not. In order to do that, you declare a new +[transformer that takes the data from](resource.md#feeding-data-from-one-resource-into-another) the `deals` resource and add it to the source. ```python @@ -148,7 +141,7 @@ or source.resources["deal_scores"] = source.deals | deal_scores ``` :::note -When adding resource to the source, `dlt` clones the resource so your existing instance is not affected. +When adding a resource to the source, `dlt` clones the resource so your existing instance is not affected. ::: ### Reduce the nesting level of generated tables @@ -162,12 +155,12 @@ def mongo_db(): ... ``` -In the example above we want only 1 level of child tables to be generates (so there are no child +In the example above, we want only 1 level of child tables to be generated (so there are no child tables of child tables). Typical settings: - `max_table_nesting=0` will not generate child tables at all and all nested data will be represented as json. -- `max_table_nesting=1` will generate child tables of top level tables and nothing more. All nested +- `max_table_nesting=1` will generate child tables of top-level tables and nothing more. All nested data in child tables will be represented as json. You can achieve the same effect after the source instance is created: @@ -179,23 +172,23 @@ source = mongo_db() source.max_table_nesting = 0 ``` -Several data sources are prone to contain semi-structured documents with very deep nesting i.e. +Several data sources are prone to contain semi-structured documents with very deep nesting, i.e., MongoDB databases. Our practical experience is that setting the `max_nesting_level` to 2 or 3 -produces the clearest and human-readable schemas. +produces the clearest and most human-readable schemas. ### Modify schema -The schema is available via `schema` property of the source. -[You can manipulate this schema i.e. add tables, change column definitions etc. before the data is loaded.](schema.md#schema-is-modified-in-the-source-function-body) +The schema is available via the `schema` property of the source. +[You can manipulate this schema, i.e., add tables, change column definitions, etc., before the data is loaded.](schema.md#schema-is-modified-in-the-source-function-body) -Source provides two other convenience properties: +The source provides two other convenience properties: 1. `max_table_nesting` to set the maximum nesting level of child tables -1. `root_key` to propagate the `_dlt_id` of from a root table to all child tables. +1. `root_key` to propagate the `_dlt_id` from a root table to all child tables. ## Load sources -You can pass individual sources or list of sources to the `dlt.pipeline` object. By default, all the +You can pass individual sources or a list of sources to the `dlt.pipeline` object. By default, all the sources will be loaded to a single dataset. You are also free to decompose a single source into several ones. For example, you may want to break @@ -225,3 +218,4 @@ With selected resources: ```python p.run(tables.with_resources("users"), write_disposition="replace") ``` + diff --git a/docs/website/docs/general-usage/state.md b/docs/website/docs/general-usage/state.md index 23625db27c..ceac17bf16 100644 --- a/docs/website/docs/general-usage/state.md +++ b/docs/website/docs/general-usage/state.md @@ -6,14 +6,11 @@ keywords: [state, metadata, dlt.current.resource_state, dlt.current.source_state # State -The pipeline state is a Python dictionary which lives alongside your data; you can store values in -it and, on next pipeline run, request them back. +The pipeline state is a Python dictionary that lives alongside your data. You can store values in it and, on the next pipeline run, request them back. ## Read and write pipeline state in a resource -You read and write the state in your resources. Below we use the state to create a list of chess -game archives which we then use to -[prevent requesting duplicates](incremental-loading.md#advanced-state-usage-storing-a-list-of-processed-entities). +You read and write the state in your resources. Below, we use the state to create a list of chess game archives, which we then use to [prevent requesting duplicates](incremental-loading.md#advanced-state-usage-storing-a-list-of-processed-entities). ```python @dlt.resource(write_disposition="append") @@ -35,103 +32,67 @@ def players_games(chess_url, player, start_month=None, end_month=None): yield r.json().get("games", []) ``` -Above, we request the resource-scoped state. The `checked_archives` list stored under `archives` -dictionary key is private and visible only to the `players_games` resource. +Above, we request the resource-scoped state. The `checked_archives` list stored under the `archives` dictionary key is private and visible only to the `players_games` resource. -The pipeline state is stored locally in -[pipeline working directory](pipeline.md#pipeline-working-directory) and as a consequence - it -cannot be shared with pipelines with different names. You must also make sure that data written into -the state is JSON Serializable. Except standard Python types, `dlt` handles `DateTime`, `Decimal`, -`bytes` and `UUID`. +The pipeline state is stored locally in the [pipeline working directory](pipeline.md#pipeline-working-directory) and, as a consequence, it cannot be shared with pipelines with different names. You must also ensure that data written into the state is JSON Serializable. In addition to standard Python types, `dlt` handles `DateTime`, `Decimal`, `bytes`, and `UUID`. ## Share state across resources and read state in a source -You can also access the source-scoped state with `dlt.current.source_state()` which can be shared -across resources of a particular source and is also available read-only in the source-decorated -functions. The most common use case for the source-scoped state is to store mapping of custom fields -to their displayable names. You can take a look at our -[pipedrive source](https://github.com/dlt-hub/verified-sources/blob/master/sources/pipedrive/__init__.py#L118) -for an example of state passed across resources. +You can also access the source-scoped state with `dlt.current.source_state()`, which can be shared across resources of a particular source and is also available read-only in the source-decorated functions. The most common use case for the source-scoped state is to store a mapping of custom fields to their displayable names. You can take a look at our [pipedrive source](https://github.com/dlt-hub/verified-sources/blob/master/sources/pipedrive/__init__.py#L118) for an example of state passed across resources. :::tip -[decompose your source](../reference/performance.md#source-decomposition-for-serial-and-parallel-resource-execution) -in order to, for example run it on Airflow in parallel. If you cannot avoid that, designate one of -the resources as state writer and all the other as state readers. This is exactly what `pipedrive` -pipeline does. With such structure you will still be able to run some of your resources in -parallel. +[Decompose your source](../reference/performance.md#source-decomposition-for-serial-and-parallel-resource-execution) in order to, for example, run it on Airflow in parallel. If you cannot avoid that, designate one of the resources as the state writer and all the others as state readers. This is exactly what the `pipedrive` pipeline does. With such a structure, you will still be able to run some of your resources in parallel. ::: :::caution -The `dlt.state()` is a deprecated alias to `dlt.current.source_state()` and will be soon -removed. +The `dlt.state()` is a deprecated alias to `dlt.current.source_state()` and will soon be removed. ::: ## Syncing state with destination -What if you run your pipeline on, for example, Airflow where every task gets a clean filesystem and -[pipeline working directory](pipeline.md#pipeline-working-directory) is always deleted? `dlt` loads -your state into the destination together with all other data and when faced with a clean start, it -will try to restore state from the destination. +What if you run your pipeline on, for example, Airflow, where every task gets a clean filesystem and the [pipeline working directory](pipeline.md#pipeline-working-directory) is always deleted? `dlt` loads your state into the destination along with all other data and, when faced with a clean start, it will try to restore the state from the destination. -The remote state is identified by pipeline name, the destination location (as given by the -credentials) and destination dataset. To re-use the same state, use the same pipeline name and -destination. +The remote state is identified by the pipeline name, the destination location (as given by the credentials), and the destination dataset. To reuse the same state, use the same pipeline name and destination. -The state is stored in the `_dlt_pipeline_state` table at the destination and contains information -about the pipeline, pipeline run (that the state belongs to) and state blob. +The state is stored in the `_dlt_pipeline_state` table at the destination and contains information about the pipeline, pipeline run (that the state belongs to), and state blob. -`dlt` has `dlt pipeline sync` command where you can -[request the state back from that table](../reference/command-line-interface.md#sync-pipeline-with-the-destination). +`dlt` has a `dlt pipeline sync` command where you can [request the state back from that table](../reference/command-line-interface.md#sync-pipeline-with-the-destination). -> 💡 If you can keep the pipeline working directory across the runs, you can disable the state sync -> by setting `restore_from_destination=false` i.e. in your `config.toml`. +> 💡 If you can keep the pipeline working directory across the runs, you can disable the state sync by setting `restore_from_destination=false`, for example, in your `config.toml`. ## When to use pipeline state -- `dlt` uses the state internally to implement - [last value incremental loading](incremental-loading.md#incremental_loading-with-last-value). This - use case should cover around 90% of your needs to use the pipeline state. -- [Store a list of already requested entities](incremental-loading.md#advanced-state-usage-storing-a-list-of-processed-entities) - if the list is not much bigger than 100k elements. -- [Store large dictionaries of last values](incremental-loading.md#advanced-state-usage-tracking-the-last-value-for-all-search-terms-in-twitter-api) - if you are not able to implement it with the standard incremental construct. -- Store the custom fields dictionaries, dynamic configurations and other source-scoped state. +- `dlt` uses the state internally to implement [last value incremental loading](incremental-loading.md#incremental_loading-with-last-value). This use case should cover around 90% of your needs to use the pipeline state. +- [Store a list of already requested entities](incremental-loading.md#advanced-state-usage-storing-a-list-of-processed-entities) if the list is not much larger than 100k elements. +- [Store large dictionaries of last values](incremental-loading.md#advanced-state-usage-tracking-the-last-value-for-all-search-terms-in-twitter-api) if you are unable to implement it with the standard incremental construct. +- Store the custom fields dictionaries, dynamic configurations, and other source-scoped state. ## When not to use pipeline state -Do not use dlt state when it may grow to millions of elements. Do you plan to store modification -timestamps of all of your millions of user records? This is probably a bad idea! In that case you -could: +Do not use the dlt state when it may grow to millions of elements. Do you plan to store modification timestamps of all of your millions of user records? This is probably a bad idea! In that case, you could: -- Store the state in dynamo-db, redis etc. taking into the account that if the extract stage fails - you'll end with invalid state. -- Use your loaded data as the state. `dlt` exposes the current pipeline via `dlt.current.pipeline()` - from which you can obtain - [sqlclient](../dlt-ecosystem/transformations/sql.md) - and load the data of interest. In that case try at least to process your user records in batches. +- Store the state in DynamoDB, Redis, etc., taking into account that if the extract stage fails, you'll end up with an invalid state. +- Use your loaded data as the state. `dlt` exposes the current pipeline via `dlt.current.pipeline()` from which you can obtain a [sqlclient](../dlt-ecosystem/transformations/sql.md) and load the data of interest. In that case, try at least to process your user records in batches. ## Inspect the pipeline state -You can inspect pipeline state with -[`dlt pipeline` command](../reference/command-line-interface.md#dlt-pipeline): +You can inspect the pipeline state with the [`dlt pipeline` command](../reference/command-line-interface.md#dlt-pipeline): ```sh dlt pipeline -v chess_pipeline info ``` -will display source and resource state slots for all known sources. +This command will display source and resource state slots for all known sources. ## Reset the pipeline state: full or partial **To fully reset the state:** - Drop the destination dataset to fully reset the pipeline. -- [Set the `full_refresh` flag when creating pipeline](pipeline.md#do-experiments-with-full-refresh). -- Use the `dlt pipeline drop --drop-all` command to - [drop state and tables for a given schema name](../reference/command-line-interface.md#selectively-drop-tables-and-reset-state). +- [Set the `full_refresh` flag when creating a pipeline](pipeline.md#do-experiments-with-full-refresh). +- Use the `dlt pipeline drop --drop-all` command to [drop state and tables for a given schema name](../reference/command-line-interface.md#selectively-drop-tables-and-reset-state). **To partially reset the state:** -- Use the `dlt pipeline drop ` command to - [drop state and tables for a given resource](../reference/command-line-interface.md#selectively-drop-tables-and-reset-state). -- Use the `dlt pipeline drop --state-paths` command to - [reset the state at given path without touching the tables and data](../reference/command-line-interface.md#selectively-drop-tables-and-reset-state). +- Use the `dlt pipeline drop ` command to [drop state and tables for a given resource](../reference/command-line-interface.md#selectively-drop-tables-and-reset-state). +- Use the `dlt pipeline drop --state-paths` command to [reset the state at a given path without touching the tables and data](../reference/command-line-interface.md#selectively-drop-tables-and-reset-state). + diff --git a/docs/website/docs/tutorial/grouping-resources.md b/docs/website/docs/tutorial/grouping-resources.md index a54ba97fe3..64fdd1ceac 100644 --- a/docs/website/docs/tutorial/grouping-resources.md +++ b/docs/website/docs/tutorial/grouping-resources.md @@ -6,8 +6,8 @@ keywords: [api, source, decorator, dynamic resource, github, tutorial] This tutorial continues the [previous](load-data-from-an-api) part. We'll use the same GitHub API example to show you how to: 1. Load data from other GitHub API endpoints. -1. Group your resources into sources for easier management. -2. Handle secrets and configuration. +2. Group your resources into sources for easier management. +3. Handle secrets and configuration. ## Use source decorator @@ -29,7 +29,7 @@ def get_comments( response.raise_for_status() yield response.json() - # get next page + # Get the next page if "next" not in response.links: break url = response.links["next"]["url"] @@ -43,7 +43,7 @@ def github_source(): return [get_issues, get_comments] ``` -`github_source()` groups resources into a [source](../general-usage/source). A dlt source is a logical grouping of resources. You use it to group resources that belong together, for example, to load data from the same API. Loading data from a source can be run in a single pipeline. Here's what our updated script looks like: +`github_source()` groups resources into a [source](../general-usage/source). A dlt source is a logical grouping of resources. You use it to group resources that belong together, for example, to load data from the same API. Loading data from a source can be run in a single pipeline. Here's how our updated script looks like: ```py import dlt @@ -68,7 +68,7 @@ def get_issues( response.raise_for_status() yield response.json() - # Get next page + # Get the next page if "next" not in response.links: break url = response.links["next"]["url"] @@ -92,7 +92,7 @@ def get_comments( response.raise_for_status() yield response.json() - # Get next page + # Get the next page if "next" not in response.links: break url = response.links["next"]["url"] @@ -132,7 +132,7 @@ def fetch_github_data(endpoint, params={}): response.raise_for_status() yield response.json() - # Get next page + # Get the next page if "next" not in response.links: break url = response.links["next"]["url"] @@ -159,7 +159,7 @@ row_counts = pipeline.last_trace.last_normalize_info ## Handle secrets -For the next step we'd want to get the [number of repository clones](https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-repository-clones) for our dlt repo from the GitHub API. However, the `traffic/clones` endpoint that returns the data requires [authentication](https://docs.github.com/en/rest/overview/authenticating-to-the-rest-api?apiVersion=2022-11-28). +For the next step, we want to get the [number of repository clones](https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-repository-clones) for our dlt repo from the GitHub API. However, the `traffic/clones` endpoint that returns the data requires [authentication](https://docs.github.com/en/rest/overview/authenticating-to-the-rest-api?apiVersion=2022-11-28). Let's handle this by changing our `fetch_github_data()` first: @@ -175,7 +175,7 @@ def fetch_github_data(endpoint, params={}, access_token=None): response.raise_for_status() yield response.json() - # Get next page + # Get the next page if "next" not in response.links: break url = response.links["next"]["url"] @@ -194,13 +194,13 @@ def github_source(access_token): ... ``` -Here, we added `access_token` parameter and now we can use it to pass the access token to the request: +Here, we added the `access_token` parameter and now we can use it to pass the access token to the request: ```python load_info = pipeline.run(github_source(access_token="ghp_XXXXX")) ``` -It's a good start. But we'd want to follow the best practices and not hardcode the token in the script. One option is to set the token as an environment variable, load it with `os.getenv()` and pass it around as a parameter. dlt offers a more convenient way to handle secrets and credentials: it lets you inject the arguments using a special `dlt.secrets.value` argument value. +It's a good start. But we want to follow the best practices and not hardcode the token in the script. One option is to set the token as an environment variable, load it with `os.getenv()` and pass it around as a parameter. dlt offers a more convenient way to handle secrets and credentials: it lets you inject the arguments using a special `dlt.secrets.value` argument value. To use it, change the `github_source()` function to: @@ -217,7 +217,7 @@ When you add `dlt.secrets.value` as a default value for an argument, `dlt` will 1. Special environment variables. 2. `secrets.toml` file. -The `secret.toml` file is located in the `~/.dlt` folder (for global configuration) or in the `.dlt` folder in the project folder (for project-specific configuration). +The `secrets.toml` file is located in the `~/.dlt` folder (for global configuration) or in the `.dlt` folder in the project folder (for project-specific configuration). Let's add the token to the `~/.dlt/secrets.toml` file: @@ -246,7 +246,7 @@ def fetch_github_data(endpoint, params={}, access_token=None): response.raise_for_status() yield response.json() - # get next page + # Get the next page if "next" not in response.links: break url = response.links["next"]["url"] @@ -296,7 +296,7 @@ def fetch_github_data(repo_name, endpoint, params={}, access_token=None): response.raise_for_status() yield response.json() - # Get next page + # Get the next page if "next" not in response.links: break url = response.links["next"]["url"] @@ -345,7 +345,8 @@ Interested in learning more? Here are some suggestions: - [Create your resources dynamically from data](../general-usage/source#create-resources-dynamically). - [Transform your data before loading](../general-usage/resource#customize-resources) and see some [examples of customizations like column renames and anonymization](../general-usage/customising-pipelines/renaming_columns). - [Pass config and credentials into your sources and resources](../general-usage/credentials). - - [Run in production: inspecting, tracing, retry policies and cleaning up](../running-in-production/running). - - [Run resources in parallel, optimize buffers and local storage](../reference/performance.md) + - [Run in production: inspecting, tracing, retry policies, and cleaning up](../running-in-production/running). + - [Run resources in parallel, optimize buffers, and local storage](../reference/performance.md) 3. Check out our [how-to guides](../walkthroughs) to get answers to some common questions. -4. Explore the [Examples](../examples) section to see how dlt can be used in real-world scenarios +4. Explore the [Examples](../examples) section to see how dlt can be used in real-world scenarios. + diff --git a/docs/website/docs/tutorial/intro.md b/docs/website/docs/tutorial/intro.md index 2d53412ae0..32ddd0cb80 100644 --- a/docs/website/docs/tutorial/intro.md +++ b/docs/website/docs/tutorial/intro.md @@ -3,7 +3,7 @@ title: Tutorial description: Build a data pipeline with dlt keywords: [tutorial, api, github, duckdb, pipeline] --- -Welcome to the tutorial on how to efficiently use dlt to build a data pipeline. This tutorial will introduce you to the foundational concepts of dlt and guide you through basic and advanced usage scenarios. +Welcome to the tutorial on how to use dlt efficiently to build a data pipeline. This tutorial will introduce you to the foundational concepts of dlt and guide you through basic and advanced usage scenarios. As a practical example, we'll build a data pipeline that loads data from the GitHub API into DuckDB. @@ -16,6 +16,7 @@ As a practical example, we'll build a data pipeline that loads data from the Git - [Securely handling secrets](./grouping-resources.md#handle-secrets) - [Making reusable data sources](./grouping-resources.md#configurable-sources) -## Ready to dive in? +## Ready to Dive In? + +Let's begin by loading data from an API. -Let's begin by loading data from an API. \ No newline at end of file diff --git a/docs/website/docs/tutorial/load-data-from-an-api.md b/docs/website/docs/tutorial/load-data-from-an-api.md index f14566b5c0..e8ef04f13c 100644 --- a/docs/website/docs/tutorial/load-data-from-an-api.md +++ b/docs/website/docs/tutorial/load-data-from-an-api.md @@ -66,8 +66,8 @@ dlt pipeline github_issues show Try running the pipeline again with `python github_issues.py`. You will notice that the **issues** table contains two copies of the same data. This happens because the default load mode is `append`. It is very useful, for example, when you have a new folder created daily with `json` file logs, and you want to ingest them. -To get the latest data, we'd need to run the script again. But how to do that without duplicating the data? -One option is to tell `dlt` to replace the data in existing tables in the destination by using `replace` write disposition. Change the `github_issues.py` script to the following: +To get the latest data, we'd need to run the script again. But how do we do that without duplicating the data? +One option is to tell `dlt` to replace the data in existing tables in the destination by using the `replace` write disposition. Change the `github_issues.py` script to the following: ```py import dlt @@ -94,7 +94,7 @@ load_info = pipeline.run( print(load_info) ``` -Run this script twice to see that **issues** table still contains only one copy of the data. +Run this script twice to see that the `issues` table still contains only one copy of the data. :::tip What if the API has changed and new fields get added to the response? @@ -109,14 +109,14 @@ Learn more: ## Declare loading behavior -So far we have been passing the data to the `run` method directly. This is a quick way to get started. However, frequenly, you receive data in chunks, and you want to load it as it arrives. For example, you might want to load data from an API endpoint with pagination or a large file that does not fit in memory. In such cases, you can use Python generators as a data source. +So far we have been passing the data to the `run` method directly. This is a quick way to get started. However, frequently, you receive data in chunks, and you want to load it as it arrives. For example, you might want to load data from an API endpoint with pagination or a large file that does not fit in memory. In such cases, you can use Python generators as a data source. You can pass a generator to the `run` method directly or use the `@dlt.resource` decorator to turn the generator into a [dlt resource](../general-usage/resource). The decorator allows you to specify the loading behavior and relevant resource parameters. ### Load only new data (incremental loading) -Let's improve our GitHub API example and get only issues that were created since last load. -Instead of using `replace` write disposition and downloading all issues each time the pipeline is run, we do the following: +Let's improve our GitHub API example and get only issues that were created since the last load. +Instead of using the `replace` write disposition and downloading all issues each time the pipeline is run, we do the following: ```py @@ -127,8 +127,8 @@ from dlt.sources.helpers import requests def get_issues( created_at=dlt.sources.incremental("created_at", initial_value="1970-01-01T00:00:00Z") ): - # NOTE: we read only open issues to minimize number of calls to the API. - # There's a limit of ~50 calls for not authenticated Github users. + # NOTE: we read only open issues to minimize the number of calls to the API. + # There's a limit of ~50 calls for non-authenticated Github users. url = ( "https://api.github.com/repos/dlt-hub/dlt/issues" "?per_page=100&sort=created&directions=desc&state=open" @@ -140,13 +140,13 @@ def get_issues( yield response.json() # Stop requesting pages if the last element was already - # older than initial value + # older than the initial value # Note: incremental will skip those items anyway, we just - # do not want to use the api limits + # do not want to use the API limits if created_at.start_out_of_range: break - # get next page + # get the next page if "next" not in response.links: break url = response.links["next"]["url"] @@ -170,9 +170,9 @@ Let's take a closer look at the code above. We use the `@dlt.resource` decorator to declare the table name into which data will be loaded and specify the `append` write disposition. -We request issues for dlt-hub/dlt repository ordered by **created_at** field (descending) and yield them page by page in `get_issues` generator function. +We request issues for the dlt-hub/dlt repository ordered by the **created_at** field (descending) and yield them page by page in the `get_issues` generator function. -We also use `dlt.sources.incremental` to track `created_at` field present in each issue to filter in the newly created. +We also use `dlt.sources.incremental` to track the `created_at` field present in each issue to filter in the newly created ones. Now run the script. It loads all the issues from our repo to `duckdb`. Run it again, and you can see that no issues got added (if no issues were created in the meantime). @@ -214,8 +214,8 @@ from dlt.sources.helpers import requests def get_issues( updated_at=dlt.sources.incremental("updated_at", initial_value="1970-01-01T00:00:00Z") ): - # NOTE: we read only open issues to minimize number of calls to - # the API. There's a limit of ~50 calls for not authenticated + # NOTE: we read only open issues to minimize the number of calls to + # the API. There's a limit of ~50 calls for non-authenticated # Github users url = ( "https://api.github.com/repos/dlt-hub/dlt/issues" @@ -228,7 +228,7 @@ def get_issues( response.raise_for_status() yield response.json() - # Get next page + # Get the next page if "next" not in response.links: break url = response.links["next"]["url"] @@ -247,11 +247,11 @@ print(load_info) ``` -Above we add `primary_key` argument to the `dlt.resource()` that tells `dlt` how to identify the issues in the database to find duplicates which content it will merge. +Above we add a `primary_key` argument to the `dlt.resource()` that tells `dlt` how to identify the issues in the database to find duplicates which content it will merge. Note that we now track the `updated_at` field — so we filter in all issues **updated** since the last pipeline run (which also includes those newly created). -Pay attention how we use **since** parameter from [GitHub API](https://docs.github.com/en/rest/issues/issues?apiVersion=2022-11-28#list-repository-issues) +Pay attention to how we use the `since` parameter from [GitHub API](https://docs.github.com/en/rest/issues/issues?apiVersion=2022-11-28#list-repository-issues) and `updated_at.last_value` to tell GitHub to return issues updated only **after** the date we pass. `updated_at.last_value` holds the last `updated_at` value from the previous run. [Learn more about merge write disposition](../general-usage/incremental-loading#merge-incremental_loading). @@ -263,9 +263,10 @@ Continue your journey with the [Resource Grouping and Secrets](grouping-resource If you want to take full advantage of the `dlt` library, then we strongly suggest that you build your sources out of existing **building blocks:** - Pick your [destinations](../dlt-ecosystem/destinations/). -- Check [verified sources](../dlt-ecosystem/verified-sources/) provided by us and community. +- Check [verified sources](../dlt-ecosystem/verified-sources/) provided by us and the community. - Access your data with [SQL](../dlt-ecosystem/transformations/sql) or [Pandas](../dlt-ecosystem/transformations/sql). - [Append, replace and merge your tables](../general-usage/incremental-loading). - [Set up "last value" incremental loading](../general-usage/incremental-loading#incremental_loading-with-last-value). -- [Set primary and merge keys, define the columns nullability and data types](../general-usage/resource#define-schema). -- [Use built-in requests client](../reference/performance#using-the-built-in-requests-client). \ No newline at end of file +- [Set primary and merge keys, define the columns' nullability and data types](../general-usage/resource#define-schema). +- [Use the built-in requests client](../reference/performance#using-the-built-in-requests-client). +