From eb4b1bae9a884054f34475fef823a318bb45683a Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Sat, 14 Sep 2024 13:07:53 +0200 Subject: [PATCH] Docs: update the introduction, add the rest_api tutorial (#1729) Co-authored-by: Akela Drissner-Schmid <32450038+akelad@users.noreply.github.com> --- docs/website/docs/intro.md | 177 ++++---- docs/website/docs/reference/installation.md | 8 +- docs/website/docs/tutorial/filesystem.md | 58 +-- .../docs/tutorial/grouping-resources.md | 296 ------------- docs/website/docs/tutorial/intro.md | 21 - .../docs/tutorial/load-data-from-an-api.md | 402 +++++++++++++++++- docs/website/docs/tutorial/rest-api.md | 322 ++++++++++++++ .../{sql_database.md => sql-database.md} | 108 ++--- docs/website/sidebars.js | 16 +- docs/website/src/css/custom.css | 226 +++++++--- 10 files changed, 1077 insertions(+), 557 deletions(-) delete mode 100644 docs/website/docs/tutorial/grouping-resources.md delete mode 100644 docs/website/docs/tutorial/intro.md create mode 100644 docs/website/docs/tutorial/rest-api.md rename docs/website/docs/tutorial/{sql_database.md => sql-database.md} (85%) diff --git a/docs/website/docs/intro.md b/docs/website/docs/intro.md index c269c987b8..6660696cfb 100644 --- a/docs/website/docs/intro.md +++ b/docs/website/docs/intro.md @@ -6,138 +6,153 @@ keywords: [introduction, who, what, how] import snippets from '!!raw-loader!./intro-snippets.py'; -# Introduction +# Getting started ![dlt pacman](/img/dlt-pacman.gif) -## What is `dlt`? +## What is dlt? + +dlt is an open-source Python library that loads data from various, often messy data sources into well-structured, live datasets. It offers a lightweight interface for extracting data from [REST APIs](./tutorial/rest-api), [SQL databases](./tutorial/sql-database), [cloud storage](./tutorial/filesystem), [Python data structures](./tutorial/load-data-from-an-api), and [many more](./dlt-ecosystem/verified-sources). + +dlt is designed to be easy to use, flexible, and scalable: + +- dlt infers [schemas](./general-usage/schema) and [data types](./general-usage/schema/#data-types), [normalizes the data](./general-usage/schema/#data-normalizer), and handles nested data structures. +- dlt supports a variety of [popular destinations](./dlt-ecosystem/destinations/) and has an interface to add [custom destinations](./dlt-ecosystem/destinations/destination) to create reverse ETL pipelines. +- dlt can be deployed anywhere Python runs, be it on [Airflow](./walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer), [serverless functions](./walkthroughs/deploy-a-pipeline/deploy-with-google-cloud-functions) or any other cloud deployment of your choice. +- dlt automates pipeline maintenance with [schema evolution](./general-usage/schema-evolution) and [schema and data contracts](./general-usage/schema-contracts). + +To get started with dlt, install the library using pip: -`dlt` is an open-source library that you can add to your Python scripts to load data -from various and often messy data sources into well-structured, live datasets. To get started, install it with: ```sh pip install dlt ``` :::tip -We recommend using a clean virtual environment for your experiments! Here are [detailed instructions](/reference/installation). +We recommend using a clean virtual environment for your experiments! Read the [detailed instructions](./reference/installation) on how to set up one. ::: -Unlike other solutions, with dlt, there's no need to use any backends or containers. Simply import `dlt` in a Python file or a Jupyter Notebook cell, and create a pipeline to load data into any of the [supported destinations](dlt-ecosystem/destinations/). You can load data from any source that produces Python data structures, including APIs, files, databases, and more. `dlt` also supports building a [custom destination](dlt-ecosystem/destinations/destination.md), which you can use as reverse ETL. - -The library will create or update tables, infer data types, and handle nested data automatically. Here are a few example pipelines: +## Load data with dlt from … - + -:::tip -Looking to use a REST API as a source? Explore our new [REST API generic source](dlt-ecosystem/verified-sources/rest_api) for a declarative way to load data. -::: +Use dlt's [REST API source](./tutorial/rest-api) to extract data from any REST API. Define API endpoints you’d like to fetch data from, pagination method and authentication and dlt will handle the rest: - +```py +import dlt +from dlt.sources.rest_api import rest_api_source + +source = rest_api_source({ + "client": { + "base_url": "https://api.example.com/", + "auth": { + "token": dlt.secrets["your_api_token"], + }, + "paginator": { + "type": "json_response", + "next_url_path": "paging.next", + }, + }, + "resources": ["posts", "comments"], +}) +pipeline = dlt.pipeline( + pipeline_name="rest_api_example", + destination="duckdb", + dataset_name="rest_api_data", +) -Copy this example to a file or a Jupyter Notebook and run it. To make it work with the DuckDB destination, you'll need to install the **duckdb** dependency (the default `dlt` installation is really minimal): -```sh -pip install "dlt[duckdb]" +load_info = pipeline.run(source) ``` -Now **run** your Python file or Notebook cell. -How it works? The library extracts data from a [source](general-usage/glossary.md#source) (here: **chess.com REST API**), inspects its structure to create a -[schema](general-usage/glossary.md#schema), structures, normalizes, and verifies the data, and then -loads it into a [destination](general-usage/glossary.md#destination) (here: **duckdb**, into a database schema **player_data** and table name **player**). +Follow the [REST API source tutorial](./tutorial/rest-api) to learn more about the source configuration and pagination methods. + + +Use the [SQL source](./tutorial/sql-database) to extract data from the database like PostgreSQL, MySQL, SQLite, Oracle and more. - +```py +from dlt.sources.sql_database import sql_database - +source = sql_database( + "mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam" +) -Initialize the [Slack source](dlt-ecosystem/verified-sources/slack) with `dlt init` command: +pipeline = dlt.pipeline( + pipeline_name="sql_database_example", + destination="duckdb", + dataset_name="sql_data", +) -```sh -dlt init slack duckdb +load_info = pipeline.run(source) ``` -Create and run a pipeline: +Follow the [SQL source tutorial](./tutorial/sql-database) to learn more about the source configuration and supported databases. + + + + +[Filesystem](./tutorial/filesystem) source extracts data from AWS S3, Google Cloud Storage, Google Drive, Azure, or a local file system. ```py -import dlt +from dlt.sources.filesystem import filesystem -from slack import slack_source +source = filesystem( + bucket_url="s3://example-bucket", + file_glob="*.csv" +) pipeline = dlt.pipeline( - pipeline_name="slack", + pipeline_name="filesystem_example", destination="duckdb", - dataset_name="slack_data" -) - -source = slack_source( - start_date=datetime(2023, 9, 1), - end_date=datetime(2023, 9, 8), - page_size=100, + dataset_name="filesystem_data", ) load_info = pipeline.run(source) -print(load_info) ``` - - - - Pass anything that you can load with Pandas to `dlt` - - - +Follow the [filesystem source tutorial](./tutorial/filesystem) to learn more about the source configuration and supported storage services. - + -:::tip -Use our verified [SQL database source](dlt-ecosystem/verified-sources/sql_database) -to sync your databases with warehouses, data lakes, or vector stores. -::: +dlt is able to load data from Python generators or directly from Python data structures: - +```py +import dlt +@dlt.resource +def foo(): + for i in range(10): + yield {"id": i, "name": f"This is item {i}"} -Install **pymysql** driver: -```sh -pip install sqlalchemy pymysql -``` +pipeline = dlt.pipeline( + pipeline_name="python_data_example", + destination="duckdb", +) - - +load_info = pipeline.run(foo) +``` +Check out the [Python data structures tutorial](./tutorial/load-data-from-an-api) to learn about dlt fundamentals and advanced usage scenarios. -## Why use `dlt`? + -- Automated maintenance - with schema inference and evolution and alerts, and with short declarative -code, maintenance becomes simple. -- Run it where Python runs - on Airflow, serverless functions, notebooks. No -external APIs, backends, or containers, scales on micro and large infra alike. -- User-friendly, declarative interface that removes knowledge obstacles for beginners -while empowering senior professionals. + -## Getting started with `dlt` -1. Dive into our [Getting started guide](getting-started.md) for a quick intro to the essentials of `dlt`. -2. Play with the -[Google Colab demo](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing). -This is the simplest way to see `dlt` in action. -3. Read the [Tutorial](tutorial/intro) to learn how to build a pipeline that loads data from an API. -4. Check out the [How-to guides](walkthroughs/) for recipes on common use cases for creating, running, and deploying pipelines. -5. Ask us on -[Slack](https://dlthub.com/community) -if you have any questions about use cases or the library. +:::tip +If you'd like to try out dlt without installing it on your machine, check out the [Google Colab demo](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing). +::: -## Join the `dlt` community +## Join the dlt community 1. Give the library a ⭐ and check out the code on [GitHub](https://github.com/dlt-hub/dlt). -1. Ask questions and share how you use the library on -[Slack](https://dlthub.com/community). +1. Ask questions and share how you use the library on [Slack](https://dlthub.com/community). 1. Report problems and make feature requests [here](https://github.com/dlt-hub/dlt/issues/new/choose). \ No newline at end of file diff --git a/docs/website/docs/reference/installation.md b/docs/website/docs/reference/installation.md index 8fd80e52ff..a19e01ae80 100644 --- a/docs/website/docs/reference/installation.md +++ b/docs/website/docs/reference/installation.md @@ -137,4 +137,10 @@ conda install -c conda-forge dlt ### 4. Done! -You are now ready to [build your first pipeline](../getting-started) :) \ No newline at end of file +You are now ready to build your first pipeline with `dlt`. Check out these tutorials to get started: + +- [Load data from a REST API](../tutorial/rest-api) +- [Load data from a SQL database](../tutorial/sql-database) +- [Load data from a cloud storage or a file system](../tutorial/filesystem) + +Or read a more detailed tutorial on how to build a [custom data pipeline with dlt](../tutorial/load-data-from-an-api.md). \ No newline at end of file diff --git a/docs/website/docs/tutorial/filesystem.md b/docs/website/docs/tutorial/filesystem.md index 48832a751e..b748f794d5 100644 --- a/docs/website/docs/tutorial/filesystem.md +++ b/docs/website/docs/tutorial/filesystem.md @@ -1,17 +1,17 @@ --- -title: Load data from Cloud Storage or a local filesystem -description: Load data from -keywords: [tutorial, filesystem, cloud storage, dlt, python, data pipeline, incremental loading] +title: Load data from a cloud storage or a file system +description: Learn how to load data files like JSON, JSONL, CSV, and Parquet from a cloud storage (AWS S3, Google Cloud Storage, Google Drive, Azure Blob Storage) or a local file system using dlt. +keywords: [dlt, tutorial, filesystem, cloud storage, file system, python, data pipeline, incremental loading, json, jsonl, csv, parquet, duckdb] --- -This tutorial is for you if you need to load data files like `jsonl`, `csv`, `parquet` from either Cloud Storage (ex. AWS S3, Google Cloud Storage, Google Drive, Azure) or a local filesystem. +This tutorial is for you if you need to load data files like JSONL, CSV, and Parquet from either Cloud Storage (ex. AWS S3, Google Cloud Storage, Google Drive, Azure Blob Storage) or a local file system. ## What you will learn -- How to set up a filesystem or cloud storage as a data source -- Configuration basics for filesystems and cloud storage +- How to set up a file system or cloud storage as a data source +- Configuration basics for file systems and cloud storage - Loading methods -- Incremental loading of data from filesystems or cloud storage +- Incremental loading of data from file systems or cloud storage - How to load data of any type ## 0. Prerequisites @@ -22,13 +22,13 @@ This tutorial is for you if you need to load data files like `jsonl`, `csv`, `pa ## 1. Setting up a new project -To help you get started quickly, `dlt` provides some handy CLI commands. One of these commands will help you set up a new `dlt` project: +To help you get started quickly, dlt provides some handy CLI commands. One of these commands will help you set up a new dlt project: ```sh dlt init filesystem duckdb ``` -This command creates a project that loads data from a filesystem into a DuckDB database. You can easily switch out duckdb for any other [supported destinations](../dlt-ecosystem/destinations). +This command creates a project that loads data from a file system into a DuckDB database. You can easily switch out duckdb for any other [supported destinations](../dlt-ecosystem/destinations). After running this command, your project will have the following structure: ```text @@ -41,14 +41,14 @@ requirements.txt Here’s what each file does: -- `filesystem_pipeline.py`: This is the main script where you'll define your data pipeline. It contains several different examples of loading data from a filesystem source. +- `filesystem_pipeline.py`: This is the main script where you'll define your data pipeline. It contains several different examples of loading data from the filesystem source. - `requirements.txt`: This file lists all the Python dependencies required for your project. - `.dlt/`: This directory contains the [configuration files](../general-usage/credentials/) for your project: - `secrets.toml`: This file stores your API keys, tokens, and other sensitive information. - `config.toml`: This file contains the configuration settings for your dlt project. :::note -When deploying your pipeline in a production environment, managing all configurations with files might not be convenient. In this case, we recommend you to use the environment variables to store secrets and configs instead. Read more about [configuration providers](../general-usage/credentials/setup#available-config-providers) available in `dlt`. +When deploying your pipeline in a production environment, managing all configurations with files might not be convenient. In this case, we recommend you to use the environment variables to store secrets and configs instead. Read more about [configuration providers](../general-usage/credentials/setup#available-config-providers) available in dlt. ::: ## 2. Creating the pipeline @@ -58,14 +58,14 @@ The filesystem source provides users with building blocks for loading data from 1. Listing the files in the bucket / directory. 2. Reading the files and yielding records. -`dlt`'s filesystem source includes several resources: +dlt's filesystem source includes several resources: - the `filesystem` resource lists files in the directory or bucket - several readers resources (`read_csv`, `read_parquet`, `read_jsonl`) read files and yield the records. These resources have a special type, they called [transformers](../general-usage/resource#process-resources-with-dlttransformer). Transformers expect items from another resource. In this particular case transformers expect `FileItem` object and transform it into multiple records. -Let's initialize a source and create a pipeline for loading `csv` files from Google Cloud Storage to DuckDB. You can replace code from `filesystem_pipeline.py` with the following: +Let's initialize a source and create a pipeline for loading CSV files from Google Cloud Storage to DuckDB. You can replace code from `filesystem_pipeline.py` with the following: ```py import dlt @@ -85,10 +85,10 @@ What's happening in the snippet above? 2. We pipe the files names yielded by the filesystem resource to the transformer resource `read_csv` to read each file and iterate over records from the file. We name this transformer resource `"encounters"` using the `with_name()`. dlt will use the resource name `"encounters"` as a table name when loading the data. :::note -A [transformer](../general-usage/resource#process-resources-with-dlttransformer) in `dlt` is a special type of resource that processes each record from another resource. This lets you chain multiple resources together. +A [transformer](../general-usage/resource#process-resources-with-dlttransformer) in dlt is a special type of resource that processes each record from another resource. This lets you chain multiple resources together. ::: -3. We create the `dlt` pipeline configuring with the name `hospital_data_pipeline` and DuckDB destination. +3. We create the dlt pipeline configuring with the name `hospital_data_pipeline` and DuckDB destination. 4. We call `pipeline.run()`. This is where the underlying generators are iterated: - dlt retrieves remote data, - normalizes data, @@ -170,7 +170,7 @@ files = filesystem( As you can see, all parameters of `filesystem` can be specified directly in the code or taken from the configuration. :::tip -`dlt` supports more ways of authorizing with the cloud storages, including identity-based and default credentials. To learn more about adding credentials to your pipeline, please refer to the [Configuration and secrets section](../general-usage/credentials/complex_types#aws-credentials). +dlt supports more ways of authorizing with the cloud storages, including identity-based and default credentials. To learn more about adding credentials to your pipeline, please refer to the [Configuration and secrets section](../general-usage/credentials/complex_types#aws-credentials). ::: ## 4. Running the pipeline @@ -192,7 +192,7 @@ Load package 1726074108.8017762 is LOADED and contains no failed jobs ## 5. Exploring the data -Now that the pipeline has run successfully, let's explore the data loaded into DuckDB. `dlt` comes with a built-in browser application that allows you to interact with the data. To enable it, run the following command: +Now that the pipeline has run successfully, let's explore the data loaded into DuckDB. dlt comes with a built-in browser application that allows you to interact with the data. To enable it, run the following command: ```sh pip install streamlit @@ -212,14 +212,14 @@ You can explore the loaded data, run queries, and see some pipeline execution de ## 6. Appending, replacing, and merging loaded data -If you try running the pipeline again with `python filesystem_pipeline.py`, you will notice that all the tables have duplicated data. This happens because by default, `dlt` appends the data to the destination table. It is very useful, for example, when you have daily data updates and you want to ingest them. With `dlt`, you can control how the data is loaded into the destination table by setting the `write_disposition` parameter in the resource configuration. The possible values are: +If you try running the pipeline again with `python filesystem_pipeline.py`, you will notice that all the tables have duplicated data. This happens because by default, dlt appends the data to the destination table. It is very useful, for example, when you have daily data updates and you want to ingest them. With dlt, you can control how the data is loaded into the destination table by setting the `write_disposition` parameter in the resource configuration. The possible values are: - `append`: Appends the data to the destination table. This is the default. - `replace`: Replaces the data in the destination table with the new data. - `merge`: Merges the new data with the existing data in the destination table based on the primary key. -To specify the `write_disposition`, you can set it in the `pipeline.run` command. Let's change the write disposition to `merge`. In this case, `dlt` will deduplicate the data before loading them into the destination. +To specify the `write_disposition`, you can set it in the `pipeline.run` command. Let's change the write disposition to `merge`. In this case, dlt will deduplicate the data before loading them into the destination. -To enable data deduplication, we also should specify a `primary_key` or `merge_key`, which will be used by `dlt` to define if two records are different. Both keys could consist of several columns. `dlt` will try to use `merge_key` and fallback to `primary_key` if it's not specified. To specify any hints about the data, including column types, primary keys, you can use the [`apply_hints`](../general-usage/resource#set-table-name-and-adjust-schema) method. +To enable data deduplication, we also should specify a `primary_key` or `merge_key`, which will be used by dlt to define if two records are different. Both keys could consist of several columns. dlt will try to use `merge_key` and fallback to `primary_key` if it's not specified. To specify any hints about the data, including column types, primary keys, you can use the [`apply_hints`](../general-usage/resource#set-table-name-and-adjust-schema) method. ```py import dlt @@ -241,7 +241,7 @@ You can learn more about `write_disposition` in the [write dispositions section] ## 7. Loading data incrementally -When loading data from files, you often only want to load files that have been modified. `dlt` makes this easy with [incremental loading](../general-usage/incremental-loading). To load only modified files, you can use the `apply_hint` method: +When loading data from files, you often only want to load files that have been modified. dlt makes this easy with [incremental loading](../general-usage/incremental-loading). To load only modified files, you can use the `apply_hint` method: ```py import dlt @@ -257,7 +257,7 @@ info = pipeline.run(reader, write_disposition="merge") print(info) ``` -Notice that we used `apply_hints` on the `files` resource, not on `reader`. Why did we do that? As mentioned before, the `filesystem` resource lists all files in the storage based on the `file_glob` parameter. So at this point, we can also specify additional conditions to filter out files. In this case, we only want to load files that have been modified since the last load. `dlt` will automatically keep the state of incremental load and manage the correct filtering. +Notice that we used `apply_hints` on the `files` resource, not on `reader`. Why did we do that? As mentioned before, the `filesystem` resource lists all files in the storage based on the `file_glob` parameter. So at this point, we can also specify additional conditions to filter out files. In this case, we only want to load files that have been modified since the last load. dlt will automatically keep the state of incremental load and manage the correct filtering. But what if we not only want to process modified files, but we also want to load only new records? In the `encounters` table, we can see the column named `STOP` indicating the timestamp of the end of the encounter. Let's modify our code to load only those records whose `STOP` timestamp was updated since our last load. @@ -275,7 +275,7 @@ info = pipeline.run(reader, write_disposition="merge") print(info) ``` -Notice that we applied incremental loading both for `files` and for `reader`. Therefore, `dlt` will first filter out only modified files and then filter out new records based on the `STOP` column. +Notice that we applied incremental loading both for `files` and for `reader`. Therefore, dlt will first filter out only modified files and then filter out new records based on the `STOP` column. If you run `dlt pipeline hospital_data_pipeline show`, you can see the pipeline now has new information in the state about the incremental variable: @@ -287,7 +287,7 @@ To learn more about incremental loading, check out the [filesystem incremental l Now let's add the file names to the actual records. This could be useful to connect the files' origins to the actual records. -Since the `filesystem` source yields information about files, we can modify the transformer to add any available metadata. Let's create a custom transformer function. We can just copy-paste the `read_csv` function from `dlt` code and add one column `file_name` to the dataframe: +Since the `filesystem` source yields information about files, we can modify the transformer to add any available metadata. Let's create a custom transformer function. We can just copy-paste the `read_csv` function from dlt code and add one column `file_name` to the dataframe: ```py from typing import Any, Iterator @@ -327,9 +327,9 @@ After executing this code, you'll see a new column in the `encounters` table: ## 9. Load any other type of files -`dlt` natively supports three file types: `csv`, `parquet`, and `jsonl` (more details in [filesystem transformer resource](../dlt-ecosystem/verified-sources/filesystem/basic#2-choose-the-right-transformer-resource)). But you can easily create your own. In order to do this, you just need a function that takes as input a `FileItemDict` iterator and yields a list of records (recommended for performance) or individual records. +dlt natively supports three file types: CSV, Parquet, and JSONL (more details in [filesystem transformer resource](../dlt-ecosystem/verified-sources/filesystem/basic#2-choose-the-right-transformer-resource)). But you can easily create your own. In order to do this, you just need a function that takes as input a `FileItemDict` iterator and yields a list of records (recommended for performance) or individual records. -Let's create and apply a transformer that reads `json` files instead of `csv` (the implementation for `json` is a little bit different from `jsonl`). +Let's create and apply a transformer that reads JSON files instead of CSV (the implementation for JSON is a little bit different from JSONL). ```py from typing import Iterator @@ -360,10 +360,10 @@ Check out [other examples](../dlt-ecosystem/verified-sources/filesystem/advanced ## What's next? -Congratulations on completing the tutorial! You've learned how to set up a filesystem source in `dlt` and run a data pipeline to load the data into DuckDB. +Congratulations on completing the tutorial! You've learned how to set up a filesystem source in dlt and run a data pipeline to load the data into DuckDB. -Interested in learning more about `dlt`? Here are some suggestions: +Interested in learning more about dlt? Here are some suggestions: - Learn more about the filesystem source configuration in [filesystem source](../dlt-ecosystem/verified-sources/filesystem) - Learn more about different credential types in [Built-in credentials](../general-usage/credentials/complex_types#built-in-credentials) -- Learn how to [create a custom source](./load-data-from-an-api.md) in the advanced tutorial \ No newline at end of file +- Learn how to [create a custom source](./load-data-from-an-api.md) in the advanced tutorial diff --git a/docs/website/docs/tutorial/grouping-resources.md b/docs/website/docs/tutorial/grouping-resources.md deleted file mode 100644 index 2bbfd231f2..0000000000 --- a/docs/website/docs/tutorial/grouping-resources.md +++ /dev/null @@ -1,296 +0,0 @@ ---- -title: Resource grouping and secrets -description: Advanced tutorial on loading data from an API -keywords: [api, source, decorator, dynamic resource, github, tutorial] ---- - -This tutorial continues the [previous](load-data-from-an-api) part. We'll use the same GitHub API example to show you how to: -1. Load data from other GitHub API endpoints. -1. Group your resources into sources for easier management. -2. Handle secrets and configuration. - -## Use source decorator - -In the previous tutorial, we loaded issues from the GitHub API. Now we'll prepare to load comments from the API as well. Here's a sample [dlt resource](../general-usage/resource) that does that: - -```py -import dlt -from dlt.sources.helpers.rest_client import paginate - -@dlt.resource( - table_name="comments", - write_disposition="merge", - primary_key="id", -) -def get_comments( - updated_at = dlt.sources.incremental("updated_at", initial_value="1970-01-01T00:00:00Z") -): - for page in paginate( - "https://api.github.com/repos/dlt-hub/dlt/comments", - params={"per_page": 100} - ): - yield page -``` - -We can load this resource separately from the issues resource, however loading both issues and comments in one go is more efficient. To do that, we'll use the `@dlt.source` decorator on a function that returns a list of resources: - -```py -@dlt.source -def github_source(): - return [get_issues, get_comments] -``` - -`github_source()` groups resources into a [source](../general-usage/source). A dlt source is a logical grouping of resources. You use it to group resources that belong together, for example, to load data from the same API. Loading data from a source can be run in a single pipeline. Here's what our updated script looks like: - -```py -import dlt -from dlt.sources.helpers.rest_client import paginate - -@dlt.resource( - table_name="issues", - write_disposition="merge", - primary_key="id", -) -def get_issues( - updated_at = dlt.sources.incremental("updated_at", initial_value="1970-01-01T00:00:00Z") -): - for page in paginate( - "https://api.github.com/repos/dlt-hub/dlt/issues", - params={ - "since": updated_at.last_value, - "per_page": 100, - "sort": "updated", - "directions": "desc", - "state": "open", - } - ): - yield page - - -@dlt.resource( - table_name="comments", - write_disposition="merge", - primary_key="id", -) -def get_comments( - updated_at = dlt.sources.incremental("updated_at", initial_value="1970-01-01T00:00:00Z") -): - for page in paginate( - "https://api.github.com/repos/dlt-hub/dlt/comments", - params={ - "since": updated_at.last_value, - "per_page": 100, - } - ): - yield page - - -@dlt.source -def github_source(): - return [get_issues, get_comments] - - -pipeline = dlt.pipeline( - pipeline_name='github_with_source', - destination='duckdb', - dataset_name='github_data', -) - -load_info = pipeline.run(github_source()) -print(load_info) -``` - -### Dynamic resources - -You've noticed that there's a lot of code duplication in the `get_issues` and `get_comments` functions. We can reduce that by extracting the common fetching code into a separate function and use it in both resources. Even better, we can use `dlt.resource` as a function and pass it the `fetch_github_data()` generator function directly. Here's the refactored code: - -```py -import dlt -from dlt.sources.helpers.rest_client import paginate - -BASE_GITHUB_URL = "https://api.github.com/repos/dlt-hub/dlt" - -def fetch_github_data(endpoint, params={}): - url = f"{BASE_GITHUB_URL}/{endpoint}" - return paginate(url, params=params) - -@dlt.source -def github_source(): - for endpoint in ["issues", "comments"]: - params = {"per_page": 100} - yield dlt.resource( - fetch_github_data(endpoint, params), - name=endpoint, - write_disposition="merge", - primary_key="id", - ) - -pipeline = dlt.pipeline( - pipeline_name='github_dynamic_source', - destination='duckdb', - dataset_name='github_data', -) -load_info = pipeline.run(github_source()) -row_counts = pipeline.last_trace.last_normalize_info -``` - -## Handle secrets - -For the next step we'd want to get the [number of repository clones](https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-repository-clones) for our dlt repo from the GitHub API. However, the `traffic/clones` endpoint that returns the data requires [authentication](https://docs.github.com/en/rest/overview/authenticating-to-the-rest-api?apiVersion=2022-11-28). - -Let's handle this by changing our `fetch_github_data()` first: - -```py -from dlt.sources.helpers.rest_client.auth import BearerTokenAuth - -def fetch_github_data(endpoint, params={}, access_token=None): - url = f"{BASE_GITHUB_URL}/{endpoint}" - return paginate( - url, - params=params, - auth=BearerTokenAuth(token=access_token) if access_token else None, - ) - - -@dlt.source -def github_source(access_token): - for endpoint in ["issues", "comments", "traffic/clones"]: - params = {"per_page": 100} - yield dlt.resource( - fetch_github_data(endpoint, params, access_token), - name=endpoint, - write_disposition="merge", - primary_key="id", - ) - -... -``` - -Here, we added `access_token` parameter and now we can use it to pass the access token to the request: - -```py -load_info = pipeline.run(github_source(access_token="ghp_XXXXX")) -``` - -It's a good start. But we'd want to follow the best practices and not hardcode the token in the script. One option is to set the token as an environment variable, load it with `os.getenv()` and pass it around as a parameter. dlt offers a more convenient way to handle secrets and credentials: it lets you inject the arguments using a special `dlt.secrets.value` argument value. - -To use it, change the `github_source()` function to: - -```py -@dlt.source -def github_source( - access_token: str = dlt.secrets.value, -): - ... -``` - -When you add `dlt.secrets.value` as a default value for an argument, `dlt` will try to load and inject this value from different configuration sources in the following order: - -1. Special environment variables. -2. `secrets.toml` file. - -The `secret.toml` file is located in the `~/.dlt` folder (for global configuration) or in the `.dlt` folder in the project folder (for project-specific configuration). - -Let's add the token to the `~/.dlt/secrets.toml` file: - -```toml -[github_with_source_secrets] -access_token = "ghp_A...3aRY" -``` - -Now we can run the script and it will load the data from the `traffic/clones` endpoint: - -```py -... - -@dlt.source -def github_source( - access_token: str = dlt.secrets.value, -): - for endpoint in ["issues", "comments", "traffic/clones"]: - params = {"per_page": 100} - yield dlt.resource( - fetch_github_data(endpoint, params, access_token), - name=endpoint, - write_disposition="merge", - primary_key="id", - ) - - -pipeline = dlt.pipeline( - pipeline_name="github_with_source_secrets", - destination="duckdb", - dataset_name="github_data", -) -load_info = pipeline.run(github_source()) -``` - -## Configurable sources - -The next step is to make our dlt GitHub source reusable so it can load data from any GitHub repo. We'll do that by changing both `github_source()` and `fetch_github_data()` functions to accept the repo name as a parameter: - -```py -import dlt -from dlt.sources.helpers.rest_client import paginate - -BASE_GITHUB_URL = "https://api.github.com/repos/{repo_name}" - - -def fetch_github_data(repo_name, endpoint, params={}, access_token=None): - """Fetch data from GitHub API based on repo_name, endpoint, and params.""" - url = BASE_GITHUB_URL.format(repo_name=repo_name) + f"/{endpoint}" - return paginate( - url, - params=params, - auth=BearerTokenAuth(token=access_token) if access_token else None, - ) - - -@dlt.source -def github_source( - repo_name: str = dlt.config.value, - access_token: str = dlt.secrets.value, -): - for endpoint in ["issues", "comments", "traffic/clones"]: - params = {"per_page": 100} - yield dlt.resource( - fetch_github_data(repo_name, endpoint, params, access_token), - name=endpoint, - write_disposition="merge", - primary_key="id", - ) - - -pipeline = dlt.pipeline( - pipeline_name="github_with_source_secrets", - destination="duckdb", - dataset_name="github_data", -) -load_info = pipeline.run(github_source()) -``` - -Next, create a `.dlt/config.toml` file in the project folder and add the `repo_name` parameter to it: - -```toml -[github_with_source_secrets] -repo_name = "dlt-hub/dlt" -``` - -That's it! Now you have a reusable source that can load data from any GitHub repo. - -## What’s next - -Congratulations on completing the tutorial! You've come a long way since the [getting started](../getting-started) guide. By now, you've mastered loading data from various GitHub API endpoints, organizing resources into sources, managing secrets securely, and creating reusable sources. You can use these skills to build your own pipelines and load data from any source. - -Interested in learning more? Here are some suggestions: -1. You've been running your pipelines locally. Learn how to [deploy and run them in the cloud](../walkthroughs/deploy-a-pipeline/). -2. Dive deeper into how dlt works by reading the [Using dlt](../general-usage) section. Some highlights: - - [Connect the transformers to the resources](../general-usage/resource#feeding-data-from-one-resource-into-another) to load additional data or enrich it. - - [Create your resources dynamically from data](../general-usage/source#create-resources-dynamically). - - [Transform your data before loading](../general-usage/resource#customize-resources) and see some [examples of customizations like column renames and anonymization](../general-usage/customising-pipelines/renaming_columns). - - [Pass config and credentials into your sources and resources](../general-usage/credentials). - - [Run in production: inspecting, tracing, retry policies and cleaning up](../running-in-production/running). - - [Run resources in parallel, optimize buffers and local storage](../reference/performance.md) - - [Use REST API client helpers](../general-usage/http/rest-client.md) to simplify working with REST APIs. -3. Check out our [how-to guides](../walkthroughs) to get answers to some common questions. -4. Explore the [Examples](../examples) section to see how dlt can be used in real-world scenarios diff --git a/docs/website/docs/tutorial/intro.md b/docs/website/docs/tutorial/intro.md deleted file mode 100644 index 2d53412ae0..0000000000 --- a/docs/website/docs/tutorial/intro.md +++ /dev/null @@ -1,21 +0,0 @@ ---- -title: Tutorial -description: Build a data pipeline with dlt -keywords: [tutorial, api, github, duckdb, pipeline] ---- -Welcome to the tutorial on how to efficiently use dlt to build a data pipeline. This tutorial will introduce you to the foundational concepts of dlt and guide you through basic and advanced usage scenarios. - -As a practical example, we'll build a data pipeline that loads data from the GitHub API into DuckDB. - -## What We'll Cover - -- [Fetching data from the GitHub API](./load-data-from-an-api.md) -- [Understanding and managing data loading behaviors](./load-data-from-an-api.md#append-or-replace-your-data) -- [Incrementally loading new data and deduplicating existing data](./load-data-from-an-api.md#load-only-new-data-incremental-loading) -- [Making our data fetch more dynamic and reducing code redundancy](./grouping-resources.md) -- [Securely handling secrets](./grouping-resources.md#handle-secrets) -- [Making reusable data sources](./grouping-resources.md#configurable-sources) - -## Ready to dive in? - -Let's begin by loading data from an API. \ No newline at end of file diff --git a/docs/website/docs/tutorial/load-data-from-an-api.md b/docs/website/docs/tutorial/load-data-from-an-api.md index 2b1bf1c62c..1718ef642b 100644 --- a/docs/website/docs/tutorial/load-data-from-an-api.md +++ b/docs/website/docs/tutorial/load-data-from-an-api.md @@ -1,19 +1,106 @@ --- -title: Load data from an API -description: quick start with dlt +title: "Build a dlt pipeline" +description: Build a data pipeline with dlt keywords: [getting started, quick start, basic examples] --- -In this section, we will retrieve and load data from the GitHub API into [DuckDB](https://duckdb.org). Specifically, we will load issues from our [dlt-hub/dlt](https://github.com/dlt-hub/dlt) repository. We picked DuckDB as our destination because it is a lightweight, in-process database that is easy to set up and use. +This tutorial introduces you to foundational dlt concepts, demonstrating how to build a custom data pipeline that loads data from pure Python data structures to DuckDB. It starts with a simple example and progresses to more advanced topics and usage scenarios. -Before we start, make sure you have installed `dlt` with the DuckDB dependency: +## What you will learn + +- Loading data from a list of Python dictionaries into DuckDB. +- Low level API usage with built-in HTTP client. +- Understand and manage data loading behaviors. +- Incrementally load new data and deduplicate existing data. +- Dynamic resource creation and reducing code redundancy. +- Group resources into sources. +- Securely handle secrets. +- Make reusable data sources. + +## Prerequisites + +- Python 3.9 or higher installed +- Virtual environment set up + +## Installing dlt + +Before we start, make sure you have a Python virtual environment set up. Follow the instructions in the [installation guide](../reference/installation) to create a new virtual environment and install dlt. + +Verify that dlt is installed by running the following command in your terminal: + +```sh +dlt --version +``` + +## Quick start + +For starters, let's load a list of Python dictionaries into DuckDB and inspect the created dataset. Here is the code: + +```py +import dlt + +data = [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}] + +pipeline = dlt.pipeline( + pipeline_name="quick_start", destination="duckdb", dataset_name="mydata" +) +load_info = pipeline.run(data, table_name="users") + +print(load_info) +``` + +When you look at the code above, you can see that we: +1. Import the `dlt` library. +2. Define our data to load. +3. Create a pipeline that loads data into DuckDB. Here we also specify the `pipeline_name` and `dataset_name`. We'll use both in a moment. +4. Run the pipeline. + +Save this Python script with the name `quick_start_pipeline.py` and run the following command: ```sh -pip install "dlt[duckdb]" +python quick_start_pipeline.py ``` +The output should look like: + +```sh +Pipeline quick_start completed in 0.59 seconds +1 load package(s) were loaded to destination duckdb and into dataset mydata +The duckdb destination used duckdb:////home/user-name/quick_start/quick_start.duckdb location to store data +Load package 1692364844.460054 is LOADED and contains no failed jobs +``` + +`dlt` just created a database schema called **mydata** (the `dataset_name`) with a table **users** in it. + +### Explore the data + +To allow sneak peek and basic discovery you can take advantage of [built-in integration with Strealmit](reference/command-line-interface#show-tables-and-data-in-the-destination): + +```sh +dlt pipeline quick_start show +``` + +**quick_start** is the name of the pipeline from the script above. If you do not have Streamlit installed yet do: + +```sh +pip install streamlit +``` + +Now you should see the **users** table: + +![Streamlit Explore data](/img/streamlit-new.png) +Streamlit Explore data. Schema and data for a test pipeline “quick_start”. + :::tip -Need help with this tutorial? Join our [Slack community](https://dlthub.com/community) for quick support. +`dlt` works in Jupyter Notebook and Google Colab! See our [Quickstart Colab Demo.](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing) + +Looking for source code of all the snippets? You can find and run them [from this repository](https://github.com/dlt-hub/dlt/blob/devel/docs/website/docs/getting-started-snippets.py). +::: + +Now that you have a basic understanding of how to get started with dlt, you might be eager to dive deeper. For that we need to switch to a more advanced data source - the GitHub API. We will load issues from our [dlt-hub/dlt](https://github.com/dlt-hub/dlt) repository. + +:::note +This tutorial uses GitHub REST API for demonstration purposes only. If you need to read data from a REST API, consider using the dlt's REST API source. Check out the [REST API source tutorial](./rest-api) for quick start or [REST API source reference](../dlt-ecosystem/verified-sources/rest_api) for more details. ::: ## Create a pipeline @@ -197,16 +284,299 @@ Let's zoom in on the changes: 2. `paginate()` takes the URL of the API endpoint and optional parameters. In this case, we pass the `since` parameter to get only issues updated after the last pipeline run. 3. We're not explicitly setting up pagination, `paginate()` handles it for us. Magic! Under the hood, `paginate()` analyzes the response and detects the pagination method used by the API. Read more about pagination in the [REST client documentation](../general-usage/http/rest-client.md#paginating-api-responses). -## Next steps +If you want to take full advantage of the `dlt` library, then we strongly suggest that you build your sources out of existing building blocks: +To make most of `dlt`, consider the following: + +## Use source decorator + +In the previous step, we loaded issues from the GitHub API. Now we'll load comments from the API as well. Here's a sample [dlt resource](../general-usage/resource) that does that: + +```py +import dlt +from dlt.sources.helpers.rest_client import paginate + +@dlt.resource( + table_name="comments", + write_disposition="merge", + primary_key="id", +) +def get_comments( + updated_at = dlt.sources.incremental("updated_at", initial_value="1970-01-01T00:00:00Z") +): + for page in paginate( + "https://api.github.com/repos/dlt-hub/dlt/comments", + params={"per_page": 100} + ): + yield page +``` + +We can load this resource separately from the issues resource, however loading both issues and comments in one go is more efficient. To do that, we'll use the `@dlt.source` decorator on a function that returns a list of resources: + +```py +@dlt.source +def github_source(): + return [get_issues, get_comments] +``` + +`github_source()` groups resources into a [source](../general-usage/source). A dlt source is a logical grouping of resources. You use it to group resources that belong together, for example, to load data from the same API. Loading data from a source can be run in a single pipeline. Here's what our updated script looks like: + +```py +import dlt +from dlt.sources.helpers.rest_client import paginate + +@dlt.resource( + table_name="issues", + write_disposition="merge", + primary_key="id", +) +def get_issues( + updated_at = dlt.sources.incremental("updated_at", initial_value="1970-01-01T00:00:00Z") +): + for page in paginate( + "https://api.github.com/repos/dlt-hub/dlt/issues", + params={ + "since": updated_at.last_value, + "per_page": 100, + "sort": "updated", + "directions": "desc", + "state": "open", + } + ): + yield page + + +@dlt.resource( + table_name="comments", + write_disposition="merge", + primary_key="id", +) +def get_comments( + updated_at = dlt.sources.incremental("updated_at", initial_value="1970-01-01T00:00:00Z") +): + for page in paginate( + "https://api.github.com/repos/dlt-hub/dlt/comments", + params={ + "since": updated_at.last_value, + "per_page": 100, + } + ): + yield page + + +@dlt.source +def github_source(): + return [get_issues, get_comments] + + +pipeline = dlt.pipeline( + pipeline_name='github_with_source', + destination='duckdb', + dataset_name='github_data', +) + +load_info = pipeline.run(github_source()) +print(load_info) +``` + +### Dynamic resources + +You've noticed that there's a lot of code duplication in the `get_issues` and `get_comments` functions. We can reduce that by extracting the common fetching code into a separate function and use it in both resources. Even better, we can use `dlt.resource` as a function and pass it the `fetch_github_data()` generator function directly. Here's the refactored code: + +```py +import dlt +from dlt.sources.helpers.rest_client import paginate + +BASE_GITHUB_URL = "https://api.github.com/repos/dlt-hub/dlt" + +def fetch_github_data(endpoint, params={}): + url = f"{BASE_GITHUB_URL}/{endpoint}" + return paginate(url, params=params) + +@dlt.source +def github_source(): + for endpoint in ["issues", "comments"]: + params = {"per_page": 100} + yield dlt.resource( + fetch_github_data(endpoint, params), + name=endpoint, + write_disposition="merge", + primary_key="id", + ) + +pipeline = dlt.pipeline( + pipeline_name='github_dynamic_source', + destination='duckdb', + dataset_name='github_data', +) +load_info = pipeline.run(github_source()) +row_counts = pipeline.last_trace.last_normalize_info +``` + +## Handle secrets + +For the next step we'd want to get the [number of repository clones](https://docs.github.com/en/rest/metrics/traffic?apiVersion=2022-11-28#get-repository-clones) for our dlt repo from the GitHub API. However, the `traffic/clones` endpoint that returns the data requires [authentication](https://docs.github.com/en/rest/overview/authenticating-to-the-rest-api?apiVersion=2022-11-28). + +Let's handle this by changing our `fetch_github_data()` first: + +```py +from dlt.sources.helpers.rest_client.auth import BearerTokenAuth + +def fetch_github_data(endpoint, params={}, access_token=None): + url = f"{BASE_GITHUB_URL}/{endpoint}" + return paginate( + url, + params=params, + auth=BearerTokenAuth(token=access_token) if access_token else None, + ) + + +@dlt.source +def github_source(access_token): + for endpoint in ["issues", "comments", "traffic/clones"]: + params = {"per_page": 100} + yield dlt.resource( + fetch_github_data(endpoint, params, access_token), + name=endpoint, + write_disposition="merge", + primary_key="id", + ) + +... +``` + +Here, we added `access_token` parameter and now we can use it to pass the access token to the request: + +```py +load_info = pipeline.run(github_source(access_token="ghp_XXXXX")) +``` + +It's a good start. But we'd want to follow the best practices and not hardcode the token in the script. One option is to set the token as an environment variable, load it with `os.getenv()` and pass it around as a parameter. dlt offers a more convenient way to handle secrets and credentials: it lets you inject the arguments using a special `dlt.secrets.value` argument value. + +To use it, change the `github_source()` function to: + +```py +@dlt.source +def github_source( + access_token: str = dlt.secrets.value, +): + ... +``` + +When you add `dlt.secrets.value` as a default value for an argument, `dlt` will try to load and inject this value from different configuration sources in the following order: + +1. Special environment variables. +2. `secrets.toml` file. + +The `secret.toml` file is located in the `~/.dlt` folder (for global configuration) or in the `.dlt` folder in the project folder (for project-specific configuration). + +Let's add the token to the `~/.dlt/secrets.toml` file: + +```toml +[github_with_source_secrets] +access_token = "ghp_A...3aRY" +``` + +Now we can run the script and it will load the data from the `traffic/clones` endpoint: + +```py +... + +@dlt.source +def github_source( + access_token: str = dlt.secrets.value, +): + for endpoint in ["issues", "comments", "traffic/clones"]: + params = {"per_page": 100} + yield dlt.resource( + fetch_github_data(endpoint, params, access_token), + name=endpoint, + write_disposition="merge", + primary_key="id", + ) + + +pipeline = dlt.pipeline( + pipeline_name="github_with_source_secrets", + destination="duckdb", + dataset_name="github_data", +) +load_info = pipeline.run(github_source()) +``` + +## Configurable sources + +The next step is to make our dlt GitHub source reusable so it can load data from any GitHub repo. We'll do that by changing both `github_source()` and `fetch_github_data()` functions to accept the repo name as a parameter: + +```py +import dlt +from dlt.sources.helpers.rest_client import paginate + +BASE_GITHUB_URL = "https://api.github.com/repos/{repo_name}" + + +def fetch_github_data(repo_name, endpoint, params={}, access_token=None): + """Fetch data from GitHub API based on repo_name, endpoint, and params.""" + url = BASE_GITHUB_URL.format(repo_name=repo_name) + f"/{endpoint}" + return paginate( + url, + params=params, + auth=BearerTokenAuth(token=access_token) if access_token else None, + ) + + +@dlt.source +def github_source( + repo_name: str = dlt.config.value, + access_token: str = dlt.secrets.value, +): + for endpoint in ["issues", "comments", "traffic/clones"]: + params = {"per_page": 100} + yield dlt.resource( + fetch_github_data(repo_name, endpoint, params, access_token), + name=endpoint, + write_disposition="merge", + primary_key="id", + ) + + +pipeline = dlt.pipeline( + pipeline_name="github_with_source_secrets", + destination="duckdb", + dataset_name="github_data", +) +load_info = pipeline.run(github_source()) +``` + +Next, create a `.dlt/config.toml` file in the project folder and add the `repo_name` parameter to it: + +```toml +[github_with_source_secrets] +repo_name = "dlt-hub/dlt" +``` + +That's it! Now you have a reusable source that can load data from any GitHub repo. + +## What’s next + +Congratulations on completing the tutorial! You've come a long way since the [getting started](../getting-started) guide. By now, you've mastered loading data from various GitHub API endpoints, organizing resources into sources, managing secrets securely, and creating reusable sources. You can use these skills to build your own pipelines and load data from any source. + +Interested in learning more? Here are some suggestions: +1. You've been running your pipelines locally. Learn how to [deploy and run them in the cloud](../walkthroughs/deploy-a-pipeline/). +2. Dive deeper into how dlt works by reading the [Using dlt](../general-usage) section. Some highlights: + - [Set up "last value" incremental loading](../general-usage/incremental-loading#incremental_loading-with-last-value). + - Learn about data loading strategies: [append, replace and merge](../general-usage/incremental-loading). + - [Connect the transformers to the resources](../general-usage/resource#feeding-data-from-one-resource-into-another) to load additional data or enrich it. + - [Customize your data schema—set primary and merge keys, define column nullability, and specify data types](../general-usage/resource#define-schema). + - [Create your resources dynamically from data](../general-usage/source#create-resources-dynamically). + - [Transform your data before loading](../general-usage/resource#customize-resources) and see some [examples of customizations like column renames and anonymization](../general-usage/customising-pipelines/renaming_columns). + - Employ data transformations using [SQL](../dlt-ecosystem/transformations/sql) or [Pandas](../dlt-ecosystem/transformations/sql). + - [Pass config and credentials into your sources and resources](../general-usage/credentials). + - [Run in production: inspecting, tracing, retry policies and cleaning up](../running-in-production/running). + - [Run resources in parallel, optimize buffers and local storage](../reference/performance.md) + - [Use REST API client helpers](../general-usage/http/rest-client.md) to simplify working with REST APIs. +3. Check out our [how-to guides](../walkthroughs) to get answers to some common questions. +4. Explore [destinations](../dlt-ecosystem/destinations/) and [sources](../dlt-ecosystem/verified-sources/) provided by us and community. +5. Explore the [Examples](../examples) section to see how dlt can be used in real-world scenarios -Continue your journey with the [Resource Grouping and Secrets](grouping-resources) tutorial. -If you want to take full advantage of the `dlt` library, then we strongly suggest that you build your sources out of existing **building blocks:** -- Pick your [destinations](../dlt-ecosystem/destinations/). -- Check [verified sources](../dlt-ecosystem/verified-sources/) provided by us and community. -- Access your data with [SQL](../dlt-ecosystem/transformations/sql) or [Pandas](../dlt-ecosystem/transformations/sql). -- [Append, replace and merge your tables](../general-usage/incremental-loading). -- [Set up "last value" incremental loading](../general-usage/incremental-loading#incremental_loading-with-last-value). -- [Set primary and merge keys, define the columns nullability and data types](../general-usage/resource#define-schema). -- [Use built-in requests client](../reference/performance#using-the-built-in-requests-client). \ No newline at end of file diff --git a/docs/website/docs/tutorial/rest-api.md b/docs/website/docs/tutorial/rest-api.md new file mode 100644 index 0000000000..3e214e0b55 --- /dev/null +++ b/docs/website/docs/tutorial/rest-api.md @@ -0,0 +1,322 @@ +--- +title: Load data from a REST API +description: How to extract data from a REST API using dlt's REST API source +keywords: [tutorial, api, github, duckdb, rest api, source, pagination, authentication] +--- + +This tutorial demonstrates how to extract data from a REST API using dlt's REST API source and load it into a destination. You will learn how to build a data pipeline that loads data from the [Pokemon](https://pokeapi.co/) and the [GitHub API](https://docs.github.com/en/) into a local DuckDB database. + +Extracting data from an API is straightforward with dlt: provide the base URL, define the resources you want to fetch, and dlt will handle the pagination, authentication, and data loading. + +## What you will learn + +- How to set up a REST API source +- Configuration basics for API endpoints +- Configuring the destination database +- Relationships between different resources +- How to append, replace, and merge data in the destination +- Loading data incrementally by fetching only new or updated data + +## Prerequisites + +- Python 3.9 or higher installed +- Virtual environment set up + +## Installing dlt + +Before we start, make sure you have a Python virtual environment set up. Follow the instructions in the [installation guide](../reference/installation) to create a new virtual environment and install dlt. + +Verify that dlt is installed by running the following command in your terminal: + +```sh +dlt --version +``` + +If you see the version number (such as "dlt 0.5.3"), you're ready to proceed. + +## Setting up a new project + +Initialize a new dlt project with REST API source and DuckDB destination: + +```sh +dlt init rest_api duckdb +``` + +`dlt init` creates multiple files and a directory for your project. Let's take a look at the project structure: + +```sh +rest_api_pipeline.py +requirements.txt +.dlt/ + config.toml + secrets.toml +``` + +Here's what each file and directory contains: + +- `rest_api_pipeline.py`: This is the main script where you'll define your data pipeline. It contains two basic pipeline examples for Pokemon and GitHub APIs. You can modify or rename this file as needed. +- `requirements.txt`: This file lists all the Python dependencies required for your project. +- `.dlt/`: This directory contains the [configuration files](../general-usage/credentials/) for your project: + - `secrets.toml`: This file stores your API keys, tokens, and other sensitive information. + - `config.toml`: This file contains the configuration settings for your dlt project. + +## Installing dependencies + +Before we proceed, let's install the required dependencies for this tutorial. Run the following command to install the dependencies listed in the `requirements.txt` file: + +```sh +pip install -r requirements.txt +``` + +## Running the pipeline + +Let's verify that the pipeline is working as expected. Run the following command to execute the pipeline: + +```sh +python rest_api_pipeline.py +``` + +You should see the output of the pipeline execution in the terminal. The output will also diplay the location of the DuckDB database file where the data is stored: + +```sh +Pipeline rest_api_pokemon load step completed in 1.08 seconds +1 load package(s) were loaded to destination duckdb and into dataset rest_api_data +The duckdb destination used duckdb:////home/user-name/quick_start/rest_api_pokemon.duckdb location to store data +Load package 1692364844.9254808 is LOADED and contains no failed jobs +``` + +## Exploring the data + +Now that the pipeline has run successfully, let's explore the data loaded into DuckDB. dlt comes with a built-in browser application that allows you to interact with the data. To enable it, run the following command: + +```sh +pip install streamlit +``` + +Next, run the following command to start the data browser: + +```sh +dlt pipeline rest_api_pokemon show +``` + +The command opens a new browser window with the data browser application. `rest_api_pokemon` is the name of the pipeline defined in the `rest_api_pipeline.py` file. +You can explore the loaded data, run queries and see some pipeline execution details: + +![Explore rest_api data in Streamlit App](https://dlt-static.s3.eu-central-1.amazonaws.com/images/docs-rest-api-tutorial-streamlit-screenshot.png) + +## Configuring the REST API source + +Now that your environment and the project are set up, let's take a closer look at the configuration of the REST API source. Open the `rest_api_pipeline.py` file in your code editor and locate the following code snippet: + +```py +def load_pokemon() -> None: + pipeline = dlt.pipeline( + pipeline_name="rest_api_pokemon", + destination="duckdb", + dataset_name="rest_api_data", + ) + + pokemon_source = rest_api_source( + { + "client": { + "base_url": "https://pokeapi.co/api/v2/" + }, + "resource_defaults": { + "endpoint": { + "params": { + "limit": 1000, + }, + }, + }, + "resources": [ + "pokemon", + "berry", + "location", + ], + } + ) + + ... + + load_info = pipeline.run(pokemon_source) + print(load_info) +``` + +Here what's happening in the code: + +1. With `dlt.pipeline()` we define a new pipeline named `rest_api_pokemon` with DuckDB as the destination and `rest_api_data` as the dataset name. +2. The `rest_api_source()` function creates a new REST API source object. +3. We pass this source object to the `pipeline.run()` method to start the pipeline execution. Inside the `run()` method, dlt will fetch data from the API and load it into the DuckDB database. +4. The `print(load_info)` outputs the pipeline execution details to the console. + +Let's break down the configuration of the REST API source. It consists of three main parts: `client`, `resource_defaults`, and `resources`. + +```py +config: RESTAPIConfig = { + "client": { + ... + }, + "resource_defaults": { + ... + }, + "resources": [ + ... + ], +} +``` + +- The `client` configuration is used to connect to the web server and authenticate if necessary. For our simple example, we only need to specify the `base_url` of the API: `https://pokeapi.co/api/v2/`. +- The `resource_defaults` configuration allows you to set default parameters for all resources. Normally you would set common parameters here, such as pagination limits. In our Pokemon API example, we set the `limit` parameter to 1000 for all resources to retrieve more data in a single request and reduce the number of HTTP API calls. +- The `resources` list contains the names of the resources you want to load from the API. REST API will use some conventions to determine the endpoint URL based on the resource name. For example, the resource name `pokemon` will be translated to the endpoint URL `https://pokeapi.co/api/v2/pokemon`. + +:::note +### Pagination +You may have noticed that we didn't specify any pagination configuration in the `rest_api_source()` function. That's because for REST APIs that follow best practices, dlt can automatically detect and handle pagination. Read more about [configuring pagination](../dlt-ecosystem/verified-sources/rest_api/basic#pagination) in the REST API source documentation. +::: + +## Appending, replacing, and merging loaded data + +Try running the pipeline again with `python rest_api_pipeline.py`. You will notice that all the tables have data duplicated. This happens because by default, dlt appends the data to the destination table. In dlt you can control how the data is loaded into the destination table by setting the `write_disposition` parameter in the resource configuration. The possible values are: +- `append`: Appends the data to the destination table. This is the default. +- `replace`: Replaces the data in the destination table with the new data. +- `merge`: Merges the new data with the existing data in the destination table based on the primary key. + +### Replacing the data + +In our case, we don't want to append the data every time we run the pipeline. Let's start with the simpler `replace` write disposition. + +To change the write disposition to `replace`, update the `resource_defaults` configuration in the `rest_api_pipeline.py` file: + +```py +... +pokemon_source = rest_api_source( + { + "client": { + "base_url": "https://pokeapi.co/api/v2/", + }, + "resource_defaults": { + "endpoint": { + "params": { + "limit": 1000, + }, + }, + "write_disposition": "replace", # Setting the write disposition to `replace` + }, + "resources": [ + "pokemon", + "berry", + "location", + ], + } +) +... +``` + +Run the pipeline again with `python rest_api_pipeline.py`. This time, the data will be replaced in the destination table instead of being appended. + +### Merging the data + +When you want to update the existing data as new data is loaded, you can use the `merge` write disposition. This requires specifying a primary key for the resource. The primary key is used to match the new data with the existing data in the destination table. + +Let's update our example to use the `merge` write disposition. We need to specify the primary key for the `pokemon` resource and set the write disposition to `merge`: + +```py +... +pokemon_source = rest_api_source( + { + "client": { + "base_url": "https://pokeapi.co/api/v2/", + }, + "resource_defaults": { + "endpoint": { + "params": { + "limit": 1000, + }, + }, + # For the `berry` and `location` resources, we keep + # the`replace` write disposition + "write_disposition": "replace", + }, + "resources": [ + # We create a specific configuration for the `pokemon` resource + # using a dictionary instead of a string to configure + # the primary key and write disposition + { + "name": "pokemon", + "primary_key": "id", + "write_disposition": "merge", + }, + # The `berry` and `location` resources will use the default + "berry", + "location", + ], + } +) +``` + +Run the pipeline with `python rest_api_pipeline.py`, the data for the `pokemon` resource will be merged with the existing data in the destination table based on the `id` field. + +## Loading data incrementally + +When working with some APIs, you may need to load data incrementally to avoid fetching the entire dataset every time and to reduce the load time. APIs that support incremental loading usually provide a way to fetch only new or changed data (most often by using a timestamp field like `updated_at`, `created_at`, or incremental IDs). + +To illustrate incremental loading, let's consider the GitHub API. In the `rest_api_pipeline.py` file, you can find an example of how to load data from the GitHub API incrementally. Let's take a look at the configuration: + +```py +pipeline = dlt.pipeline( + pipeline_name="rest_api_github", + destination="duckdb", + dataset_name="rest_api_data", +) + +github_source = rest_api_source({ + "client": { + "base_url": "https://api.github.com/repos/dlt-hub/dlt/", + }, + "resource_defaults": { + "primary_key": "id", + "write_disposition": "merge", + "endpoint": { + "params": { + "per_page": 100, + }, + }, + }, + "resources": [ + { + "name": "issues", + "endpoint": { + "path": "issues", + "params": { + "sort": "updated", + "direction": "desc", + "state": "open", + "since": { + "type": "incremental", + "cursor_path": "updated_at", + "initial_value": "2024-01-25T11:21:28Z", + }, + }, + }, + }, + ], +}) + +load_info = pipeline.run(github_source()) +print(load_info) +``` + +In this configuration, the `since` parameter is defined as a special incremental parameter. The `cursor_path` field specifies the JSON path to the field that will be used to fetch the updated data and we use the `initial_value` for the initial value for the incremental parameter. This value will be used in the first request to fetch the data. + +When the pipeline runs, dlt will automatically update the `since` parameter with the latest value from the response data. This way, you can fetch only the new or updated data from the API. + +Read more about [incremental loading](../dlt-ecosystem/verified-sources/rest_api/basic#incremental-loading) in the REST API source documentation. + +## What's next? + +Congratulations on completing the tutorial! You've learned how to set up a REST API source in dlt and run a data pipeline to load the data into DuckDB. + +Interested in learning more about dlt? Here are some suggestions: + +- Learn more about the REST API source configuration in [REST API source documentation](../dlt-ecosystem/verified-sources/rest_api/) +- Learn how to [create a custom source](./load-data-from-an-api.md) in the advanced tutorial \ No newline at end of file diff --git a/docs/website/docs/tutorial/sql_database.md b/docs/website/docs/tutorial/sql-database.md similarity index 85% rename from docs/website/docs/tutorial/sql_database.md rename to docs/website/docs/tutorial/sql-database.md index 4ff9b5baef..1a7702b637 100644 --- a/docs/website/docs/tutorial/sql_database.md +++ b/docs/website/docs/tutorial/sql-database.md @@ -4,9 +4,9 @@ description: How to extract data from a SQL Database using dlt's SQL Database co keywords: [sql connector, sql database pipeline, sql database] --- -This tutorial will show you how you can use dlt to load data from a SQL Database (PostgreSQL, MySQL, Microsoft SQL Server, Oracle, IBM DB2, etc.) into any dlt-compatible destination (Postgres, BigQuery, Snowflake, DuckDB, etc.). - -To make it easy to reproduce, we will be loading data from the [public MySQL RFam database](https://docs.rfam.org/en/latest/database.html) into a local DuckDB instance. +This tutorial will show you how you can use dlt to load data from a SQL Database (PostgreSQL, MySQL, Microsoft SQL Server, Oracle, IBM DB2, etc.) into any dlt-compatible destination (Postgres, BigQuery, Snowflake, DuckDB, etc.). + +To make it easy to reproduce, we will be loading data from the [public MySQL RFam database](https://docs.rfam.org/en/latest/database.html) into a local DuckDB instance. ## What you will learn @@ -28,7 +28,7 @@ Initialize a new dlt project in your current working directory using the `dlt in dlt init sql_database duckdb ``` -This is a handy CLI command that creates files and folders required for a SQL Database to DuckDB pipeline. You can easily replace `duckdb` with any other [supported destinations](../dlt-ecosystem/destinations). +This is a handy CLI command that creates files and folders required for a SQL Database to DuckDB pipeline. You can easily replace `duckdb` with any other [supported destinations](../dlt-ecosystem/destinations). After running this command, your project will have the following structure: @@ -50,19 +50,19 @@ Here’s what each file does: :::note When deploying your pipeline in a production environment, managing all configurations with the TOML files might not be convenient. In this case, we highly recommend using environment variables or other [configuration providers](../general-usage/credentials/setup#available-config-providers) available in dlt to store secrets and configs instead. -::: - +::: + ## 2. Configure the pipeline script -With the necessary files in place, we can now start writing our pipeline script. The existing file `sql_database_pipeline.py` already contains many pre-configured example functions that can help you get started with different data loading scenarios. However, for the purpose of this tutorial, we will be writing a new function from scratch. - +With the necessary files in place, we can now start writing our pipeline script. The existing file `sql_database_pipeline.py` already contains many pre-configured example functions that can help you get started with different data loading scenarios. However, for the purpose of this tutorial, we will be writing a new function from scratch. + :::note Running the script as it is will execute the function `load_standalone_table_resource()`, so remember to comment out the function call from inside the main block. ::: - -The following function will load the tables `family` and `genome`. - + +The following function will load the tables `family` and `genome`. + ```py def load_tables_family_and_genome(): @@ -72,7 +72,7 @@ def load_tables_family_and_genome(): # Create a dlt pipeline object pipeline = dlt.pipeline( pipeline_name="sql_to_duckdb_pipeline", # custom name for the pipeline - destination="duckdb", # dlt destination to which the data will be loaded + destination="duckdb", # dlt destination to which the data will be loaded dataset_name="sql_to_duckdb_pipeline_data" # custom name for the dataset created in the destination ) @@ -89,15 +89,15 @@ if __name__ == '__main__': Explanation: - The `sql_database` source has two built-in helper functions: `sql_database()` and `sql_table()`: - - `sql_database()` is a [dlt source function](https://dlthub.com/docs/general-usage/source) that iteratively loads the tables (in this example, `"family"` and `"genome"`) passed inside the `with_resource()` method. - - `sql_table()` is a [dlt resource function](https://dlthub.com/docs/general-usage/resource) that loads standalone tables. For example, if we wanted to only load the table `"family"`, then we could have done it using `sql_table(table="family")`. + - `sql_database()` is a [dlt source function](../general-usage/source) that iteratively loads the tables (in this example, `"family"` and `"genome"`) passed inside the `with_resource()` method. + - `sql_table()` is a [dlt resource function](../general-usage/resource) that loads standalone tables. For example, if we wanted to only load the table `"family"`, then we could have done it using `sql_table(table="family")`. - `dlt.pipeline()` creates a `dlt` pipeline with the name `"sql_to_duckdb_pipeline"` with the destination DuckDB. -- `pipeline.run()` method loads the data into the destination. +- `pipeline.run()` method loads the data into the destination. + +## 3. Add credentials + +To sucessfully connect to your SQL database, you will need to pass credentials into your pipeline. dlt automatically looks for this information inside the generated TOML files. -## 3. Add credentials - -To sucessfully connect to your SQL database, you will need to pass credentials into your pipeline. dlt automatically looks for this information inside the generated TOML files. - Simply paste the [connection details](https://docs.rfam.org/en/latest/database.html) inside `secrets.toml` as follows: ```toml [sources.sql_database.credentials] @@ -107,38 +107,38 @@ password = "" username = "rfamro" host = "mysql-rfam-public.ebi.ac.uk" port = 4497 -``` - -Alternatively, you can also paste the credentials as a connection string: +``` + +Alternatively, you can also paste the credentials as a connection string: ```toml sources.sql_database.credentials="mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam" -``` - +``` + For more details on the credentials format and other connection methods read the section on [configuring connection to the SQL Database](https://dlthub.com/docs/dlt-ecosystem/verified-sources/sql_database#credentials-format). -## 4. Install dependencies - +## 4. Install dependencies + Before running the pipeline, make sure to install all the necessary dependencies: -1. **General dependencies**: These are the general dependencies needed by the `sql_database` source. +1. **General dependencies**: These are the general dependencies needed by the `sql_database` source. ```sh pip install -r requirements.txt ``` -2. **Database-specific dependencies**: In addition to the general dependencies, you will also need to install `pymysql` to connect to the MySQL database in this tutorial: +2. **Database-specific dependencies**: In addition to the general dependencies, you will also need to install `pymysql` to connect to the MySQL database in this tutorial: ```sh pip install pymysql ``` Explanation: dlt uses SQLAlchemy to connect to the source database and hence, also requires the database-specific SQLAlchemy dialect, such as `pymysql` (MySQL), `psycopg2` (Postgres), `pymssql` (MSSQL), `snowflake-sqlalchemy` (Snowflake), etc. See the [SQLAlchemy docs](https://docs.sqlalchemy.org/en/20/dialects/#external-dialects) for a full list of available dialects. -## 5. Run the pipeline - -After performing steps 1-4, you should now be able to successfully run the pipeline by executing the following command: - +## 5. Run the pipeline + +After performing steps 1-4, you should now be able to successfully run the pipeline by executing the following command: + ```sh python sql_database_pipeline.py ``` -This will create the file `sql_to_duckdb_pipeline.duckdb` in your dlt project directory which contains the loaded data. +This will create the file `sql_to_duckdb_pipeline.duckdb` in your dlt project directory which contains the loaded data. ## 6. Explore the data @@ -152,10 +152,10 @@ Next, run the following command to launch the data browser app: ```sh dlt pipeline sql_to_duckdb_pipeline show -``` +``` + +You can explore the loaded data, run queries and see some pipeline execution details. -You can explore the loaded data, run queries and see some pipeline execution details. - ![streamlit-screenshot](https://storage.googleapis.com/dlt-blog-images/docs-sql-database-tutorial-streamlit-screenshot.png) ## 7. Append, replace, or merge loaded data @@ -177,9 +177,9 @@ def load_tables_family_and_genome(): source = sql_database().with_resources("family", "genome") pipeline = dlt.pipeline( - pipeline_name="sql_to_duckdb_pipeline", - destination="duckdb", - dataset_name="sql_to_duckdb_pipeline_data" + pipeline_name="sql_to_duckdb_pipeline", + destination="duckdb", + dataset_name="sql_to_duckdb_pipeline_data" ) load_info = pipeline.run(source, write_disposition="replace") # Set write_disposition to load the data with "replace" @@ -195,8 +195,8 @@ Run the pipeline again with `sql_database_pipeline.py`. This time, the data will ### Load with merge -When you want to update the existing data as new data is loaded, you can use the `merge` write disposition. This requires specifying a primary key for the table. The primary key is used to match the new data with the existing data in the destination table. - +When you want to update the existing data as new data is loaded, you can use the `merge` write disposition. This requires specifying a primary key for the table. The primary key is used to match the new data with the existing data in the destination table. + In the previous example, we set `write_disposition="replace"` inside `pipeline.run()` which caused all the tables to be loaded with `replace`. However, it's also possible to define the `write_disposition` strategy separately for each tables using the `apply_hints` method. In the example below, we use `apply_hints` on each table to specify different primary keys for merge: ```py @@ -209,12 +209,12 @@ def load_tables_family_and_genome(): source.genome.apply_hints(write_disposition="merge", primary_key="upid") # merge table "genome" on column "upid" pipeline = dlt.pipeline( - pipeline_name="sql_to_duckdb_pipeline", - destination="duckdb", - dataset_name="sql_to_duckdb_pipeline_data" + pipeline_name="sql_to_duckdb_pipeline", + destination="duckdb", + dataset_name="sql_to_duckdb_pipeline_data" ) - load_info = pipeline.run(source) + load_info = pipeline.run(source) print(load_info) @@ -224,8 +224,8 @@ if __name__ == '__main__': ## 8. Load data incrementally -Often you don't want to load the whole data in each load, but rather only the new or modified data. dlt makes this easy with [incremental loading](../general-usage/incremental-loading). - +Often you don't want to load the whole data in each load, but rather only the new or modified data. dlt makes this easy with [incremental loading](../general-usage/incremental-loading). + In the example below, we configure the table `"family"` to load incrementally based on the column `"updated"`: ```py @@ -237,12 +237,12 @@ def load_tables_family_and_genome(): source.family.apply_hints(incremental=dlt.sources.incremental("updated")) pipeline = dlt.pipeline( - pipeline_name="sql_to_duckdb_pipeline", - destination="duckdb", - dataset_name="sql_to_duckdb_pipeline_data" + pipeline_name="sql_to_duckdb_pipeline", + destination="duckdb", + dataset_name="sql_to_duckdb_pipeline_data" ) - load_info = pipeline.run(source) + load_info = pipeline.run(source) print(load_info) @@ -256,9 +256,9 @@ In the first run of the pipeline `python sql_database_pipeline.py`, the entire t ## What's next? -Congratulations on completing the tutorial! You learned how to set up a SQL Database source in dlt and run a data pipeline to load the data into DuckDB. - +Congratulations on completing the tutorial! You learned how to set up a SQL Database source in dlt and run a data pipeline to load the data into DuckDB. + Interested in learning more about dlt? Here are some suggestions: - Learn more about the SQL Database source configuration in [the SQL Database source reference](../dlt-ecosystem/verified-sources/sql_database) - Learn more about different credential types in [Built-in credentials](../general-usage/credentials/complex_types#built-in-credentials) -- Learn how to [create a custom source](./load-data-from-an-api.md) in the advanced tutorial +- Learn how to [create a custom source](./load-data-from-an-api.md) in the advanced tutorial diff --git a/docs/website/sidebars.js b/docs/website/sidebars.js index 217e5d30c2..299bb2a642 100644 --- a/docs/website/sidebars.js +++ b/docs/website/sidebars.js @@ -28,20 +28,19 @@ function *walkSync(dir) { /** @type {import('@docusaurus/plugin-content-docs').SidebarsConfig} */ const sidebars = { tutorialSidebar: [ - 'intro', - 'getting-started', { type: 'category', - label: 'Tutorial', + label: 'Getting started', link: { type: 'doc', - id: 'tutorial/intro', + id: 'intro', }, items: [ + 'reference/installation', + 'tutorial/rest-api', + 'tutorial/sql-database', 'tutorial/filesystem', 'tutorial/load-data-from-an-api', - 'tutorial/grouping-resources', - 'tutorial/sql_database' ] }, { @@ -156,10 +155,10 @@ const sidebars = { }, { type: 'category', - label: 'Using dlt', + label: 'Core concepts', link: { type: 'generated-index', - title: 'Using dlt', + title: 'Core concepts', slug: 'general-usage', keywords: ['concepts', 'usage'], }, @@ -350,7 +349,6 @@ const sidebars = { keywords: ['reference'], }, items: [ - 'reference/installation', 'reference/command-line-interface', 'reference/telemetry', 'reference/frequently-asked-questions', diff --git a/docs/website/src/css/custom.css b/docs/website/src/css/custom.css index e4d7793372..4d016a9a7f 100644 --- a/docs/website/src/css/custom.css +++ b/docs/website/src/css/custom.css @@ -521,183 +521,309 @@ html[data-theme='dark'] .slack-navbar::after { * Sidebar icons ****************/ + +/* Master version */ /* Introduction */ -.theme-doc-sidebar-menu.menu__list>li:nth-child(1)>a::before { +html.docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(1)>a::before { background-image: url(../../static/img/Introduction-Inactive.svg); } -.theme-doc-sidebar-menu.menu__list>li:hover:nth-child(1)>a::before, +html.docs-version-master .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(1)>a::before, .theme-doc-sidebar-menu.menu__list>li:nth-child(1)>a.menu__link--active::before { background-image: url(../../static/img/Introduction-Active.svg); } -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(1)>a::before { +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(1)>a::before { background-image: url(../../static/img/Introduction-Inactive-1.svg); } -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(1)>a::before, -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(1)>a.menu__link--active::before { +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(1)>a::before, +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(1)>a.menu__link--active::before { background-image: url(../../static/img/Introduction-Active-1.svg); } /* Getting started */ -.theme-doc-sidebar-menu.menu__list>li:nth-child(2)>a::before { +html.docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(2)>a::before { background-image: url(../../static/img/GettingStarted-Inactive.svg); } -.theme-doc-sidebar-menu.menu__list>li:hover:nth-child(2)>a::before, +html.docs-version-master .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(2)>a::before, .theme-doc-sidebar-menu.menu__list>li:nth-child(2)>a.menu__link--active::before { background-image: url(../../static/img/GettingStarted-Active.svg); } -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(2)>a::before { +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(2)>a::before { background-image: url(../../static/img/GettingStarted-Inactive-1.svg); } -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(2)>a::before, -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(2)>a.menu__link--active::before { +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(2)>a::before, +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(2)>a.menu__link--active::before { background-image: url(../../static/img/GettingStarted-Active-1.svg); } /* Tutorial */ -.theme-doc-sidebar-menu.menu__list>li:nth-child(3)>div>a::before { +html.docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(3)>div>a::before { background-image: url(../../static/img/Pipelines-Inactive.svg); } -.theme-doc-sidebar-menu.menu__list>li:hover:nth-child(3)>div>a::before, +html.docs-version-master .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(3)>div>a::before, .theme-doc-sidebar-menu.menu__list>li:nth-child(3)>div>[aria-expanded="true"]::before { background-image: url(../../static/img/Pipelines-Active.svg); } -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(3)>div>a::before { +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(3)>div>a::before { background-image: url(../../static/img/Pipelines-Inactive-1.svg); } -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(3)>div>a::before, +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(3)>div>a::before, html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(3)>div>[aria-expanded="true"]::before { background-image: url(../../static/img/Pipelines-Active-1.svg); } +/* Integrations */ + +html.docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(4)>div>a::before { + background-image: url(../../static/img/UsingLoadedData-Inactive.svg); +} + +html.docs-version-master .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(4)>div>a::before, +.theme-doc-sidebar-menu.menu__list>li:nth-child(4)>div>[aria-expanded="true"]::before { + background-image: url(../../static/img/UsingLoadedData-Active.svg); +} + +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(4)>div>a::before { + background-image: url(../../static/img/UsingLoadedData-Inactive-1.svg); +} + +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(4)>div>a::before, +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(4)>div>[aria-expanded="true"]::before { + background-image: url(../../static/img/UsingLoadedData-Active-1.svg); +} + +/* Using dlt */ + +html.docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(5)>div>a::before { + background-image: url(../../static/img/GeneralUsage-Inactive.svg); +} + +html.docs-version-master .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(5)>div>a::before, +.theme-doc-sidebar-menu.menu__list>li:nth-child(5)>div>[aria-expanded="true"]::before { + background-image: url(../../static/img/GeneralUsage-Active.svg); +} + +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(5)>div>a::before { + background-image: url(../../static/img/GeneralUsage-Inactive-1.svg); +} + +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(5)>div>a::before, +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(5)>div>[aria-expanded="true"]::before { + background-image: url(../../static/img/GeneralUsage-Active-1.svg); +} + +/* How-to Guides */ + +html.docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(6)>div>a::before { + background-image: url(../../static/img/Walkthrough-Inactive.svg); +} + +html.docs-version-master .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(6)>div>a::before, +.theme-doc-sidebar-menu.menu__list>li:nth-child(6)>div>[aria-expanded="true"]::before { + background-image: url(../../static/img/Walkthrough-Active.svg); +} + +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(6)>div>a::before { + background-image: url(../../static/img/Walkthrough-Inactive-1.svg); +} + +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(6)>div>a::before, +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(6)>div>[aria-expanded="true"]::before { + background-image: url(../../static/img/Walkthrough-Active-1.svg); +} + +/* Code Examples */ + +html.docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(7)>div>a::before { + background-image: url(../../static/img/Howdltworks-Inactive.svg); +} + +html.docs-version-master .theme-doc-sidebar-menu.menu__list>li:hover:nth_child(7)>div>a::before, +.theme-doc-sidebar-menu.menu__list>li:nth_child(7)>div>[aria-expanded="true"]::before { + background-image: url(../../static/img/Howdltworks-Active.svg); +} + +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(7)>div>a::before { + background-image: url(../../static/img/Howdltworks-Inactive-1.svg); +} + +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:hover-nth-child(7)>div>a::before, +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(7)>div>[aria-expanded="true"]::before { + background-image: url(../../static/img/Howdltworks-Active-1.svg); +} + +/* Reference */ + +html.docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(8)>div>a::before { + background-image: url(../../static/img/Reference-Inactive.svg); +} + +html.docs-version-master .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(8)>div>a::before, +.theme-doc-sidebar-menu.menu__list>li:nth-child(8)>div>[aria-expanded="true"]::before { + background-image: url(../../static/img/Reference-Active.svg); +} + +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(8)>div>a::before { + background-image: url(../../static/img/Reference-Inactive-1.svg); +} + +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(8)>div>a::before, +html[data-theme='dark'].docs-version-master .theme-doc-sidebar-menu.menu__list>li:nth-child(8)>div>[aria-expanded="true"]::before { + background-image: url(../../static/img/Reference-Active-1.svg); +} + +/* End of Master version */ + +/* Development version */ + +/* Getting started */ + +.theme-doc-sidebar-menu.menu__list>li:nth-child(1)>div>a::before { + background-image: url(../../static/img/GettingStarted-Inactive.svg); +} + +.theme-doc-sidebar-menu.menu__list>li:hover:nth-child(1)>div>a::before, +.theme-doc-sidebar-menu.menu__list>li:nth-child(1)>div>[aria-expanded="true"]::before { + background-image: url(../../static/img/GettingStarted-Active.svg); +} + +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(1)>div>a::before { + background-image: url(../../static/img/GettingStarted-Inactive-1.svg); +} + +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(1)>div>a::before, +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(1)>div>[aria-expanded="true"]::before { + background-image: url(../../static/img/GettingStarted-Active-1.svg); +} + /* Sources */ -.theme-doc-sidebar-menu.menu__list>li:nth-child(4)>div>a::before { +.theme-doc-sidebar-menu.menu__list>li:nth-child(2)>div>a::before { background-image: url(../../static/img/Sources-Inactive.svg); } -.theme-doc-sidebar-menu.menu__list>li:hover:nth-child(4)>div>a::before, -.theme-doc-sidebar-menu.menu__list>li:nth-child(4)>div>[aria-expanded="true"]::before { +.theme-doc-sidebar-menu.menu__list>li:hover:nth-child(2)>div>a::before, +.theme-doc-sidebar-menu.menu__list>li:nth-child(2)>div>[aria-expanded="true"]::before { background-image: url(../../static/img/Sources-Active.svg); } -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(4)>div>a::before { +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(2)>div>a::before { background-image: url(../../static/img/Sources-Inactive-1.svg); } -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(4)>div>a::before, -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(4)>div>[aria-expanded="true"]::before { +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(2)>div>a::before, +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(2)>div>[aria-expanded="true"]::before { background-image: url(../../static/img/Sources-Active-1.svg); } /* Destinations */ -.theme-doc-sidebar-menu.menu__list>li:nth-child(5)>div>a::before { +.theme-doc-sidebar-menu.menu__list>li:nth-child(3)>div>a::before { background-image: url(../../static/img/Destinations-Inactive.svg); } -.theme-doc-sidebar-menu.menu__list>li:hover:nth-child(5)>div>a::before, -.theme-doc-sidebar-menu.menu__list>li:nth-child(5)>div>[aria-expanded="true"]::before { +.theme-doc-sidebar-menu.menu__list>li:hover:nth-child(3)>div>a::before, +.theme-doc-sidebar-menu.menu__list>li:nth-child(3)>div>[aria-expanded="true"]::before { background-image: url(../../static/img/Destinations-Active.svg); } -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(5)>div>a::before { +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(3)>div>a::before { background-image: url(../../static/img/Destinations-Inactive-1.svg); } -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(5)>div>a::before, -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(5)>div>[aria-expanded="true"]::before { +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(3)>div>a::before, +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(3)>div>[aria-expanded="true"]::before { background-image: url(../../static/img/Destinations-Active-1.svg); } /* Using dlt */ -.theme-doc-sidebar-menu.menu__list>li:nth-child(6)>div>a::before { +.theme-doc-sidebar-menu.menu__list>li:nth-child(4)>div>a::before { background-image: url(../../static/img/GeneralUsage-Inactive.svg); } -.theme-doc-sidebar-menu.menu__list>li:hover:nth-child(6)>div>a::before, -.theme-doc-sidebar-menu.menu__list>li:nth-child(6)>div>[aria-expanded="true"]::before { +.theme-doc-sidebar-menu.menu__list>li:hover:nth-child(4)>div>a::before, +.theme-doc-sidebar-menu.menu__list>li:nth-child(4)>div>[aria-expanded="true"]::before { background-image: url(../../static/img/GeneralUsage-Active.svg); } -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(6)>div>a::before { +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(4)>div>a::before { background-image: url(../../static/img/GeneralUsage-Inactive-1.svg); } -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(6)>div>a::before, -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(6)>div>[aria-expanded="true"]::before { +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(4)>div>a::before, +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(4)>div>[aria-expanded="true"]::before { background-image: url(../../static/img/GeneralUsage-Active-1.svg); } /* How-to Guides */ -.theme-doc-sidebar-menu.menu__list>li:nth-child(7)>div>a::before { +.theme-doc-sidebar-menu.menu__list>li:nth-child(5)>div>a::before { background-image: url(../../static/img/Walkthrough-Inactive.svg); } -.theme-doc-sidebar-menu.menu__list>li:hover:nth-child(7)>div>a::before, -.theme-doc-sidebar-menu.menu__list>li:nth-child(7)>div>[aria-expanded="true"]::before { +.theme-doc-sidebar-menu.menu__list>li:hover:nth-child(5)>div>a::before, +.theme-doc-sidebar-menu.menu__list>li:nth-child(5)>div>[aria-expanded="true"]::before { background-image: url(../../static/img/Walkthrough-Active.svg); } -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(7)>div>a::before { +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(5)>div>a::before { background-image: url(../../static/img/Walkthrough-Inactive-1.svg); } -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(7)>div>a::before, -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(7)>div>[aria-expanded="true"]::before { +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(5)>div>a::before, +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(5)>div>[aria-expanded="true"]::before { background-image: url(../../static/img/Walkthrough-Active-1.svg); } /* Code Examples */ -.theme-doc-sidebar-menu.menu__list>li:nth-child(8)>div>a::before { +.theme-doc-sidebar-menu.menu__list>li:nth-child(6)>div>a::before { background-image: url(../../static/img/Howdltworks-Inactive.svg); } -.theme-doc-sidebar-menu.menu__list>li:hover:nth-child(8)>div>a::before, -.theme-doc-sidebar-menu.menu__list>li:nth-child(8)>div>[aria-expanded="true"]::before { +.theme-doc-sidebar-menu.menu__list>li:hover:nth-child(6)>div>a::before, +.theme-doc-sidebar-menu.menu__list>li:nth-child(6)>div>[aria-expanded="true"]::before { background-image: url(../../static/img/Howdltworks-Active.svg); } -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(8)>div>a::before { +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(6)>div>a::before { background-image: url(../../static/img/Howdltworks-Inactive-1.svg); } -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(8)>div>a::before, -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(8)>div>[aria-expanded="true"]::before { +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(6)>div>a::before, +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(6)>div>[aria-expanded="true"]::before { background-image: url(../../static/img/Howdltworks-Active-1.svg); } /* Reference */ -.theme-doc-sidebar-menu.menu__list>li:nth-child(9)>div>a::before { +.theme-doc-sidebar-menu.menu__list>li:nth-child(7)>div>a::before { background-image: url(../../static/img/Reference-Inactive.svg); } -.theme-doc-sidebar-menu.menu__list>li:hover:nth-child(9)>div>a::before, -.theme-doc-sidebar-menu.menu__list>li:nth-child(9)>div>[aria-expanded="true"]::before { +.theme-doc-sidebar-menu.menu__list>li:hover:nth-child(7)>div>a::before, +.theme-doc-sidebar-menu.menu__list>li:nth-child(7)>div>[aria-expanded="true"]::before { background-image: url(../../static/img/Reference-Active.svg); } -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(9)>div>a::before { +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(7)>div>a::before { background-image: url(../../static/img/Reference-Inactive-1.svg); } -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(9)>div>a::before, -html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(9)>div>[aria-expanded="true"]::before { +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:hover:nth-child(7)>div>a::before, +html[data-theme='dark'] .theme-doc-sidebar-menu.menu__list>li:nth-child(7)>div>[aria-expanded="true"]::before { background-image: url(../../static/img/Reference-Active-1.svg); }