From ad31994d7791bf44719629b23c3be9122238289f Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Thu, 22 Aug 2024 14:00:24 +0200 Subject: [PATCH] Add intro, and rest api tutorial --- docs/website/docs/intro.md | 158 +++++---- docs/website/docs/tutorial/filesystem.md | 26 ++ docs/website/docs/tutorial/intro.md | 2 +- .../docs/tutorial/load-data-from-an-api.md | 2 +- docs/website/docs/tutorial/rest-api.md | 327 ++++++++++++++++++ docs/website/docs/tutorial/sql-database.md | 25 ++ docs/website/sidebars.js | 5 +- 7 files changed, 480 insertions(+), 65 deletions(-) create mode 100644 docs/website/docs/tutorial/filesystem.md create mode 100644 docs/website/docs/tutorial/rest-api.md create mode 100644 docs/website/docs/tutorial/sql-database.md diff --git a/docs/website/docs/intro.md b/docs/website/docs/intro.md index c269c987b8..644f3da51a 100644 --- a/docs/website/docs/intro.md +++ b/docs/website/docs/intro.md @@ -10,112 +10,146 @@ import snippets from '!!raw-loader!./intro-snippets.py'; ![dlt pacman](/img/dlt-pacman.gif) -## What is `dlt`? +## What is dlt? + +dlt is a Python library that simplifies how you move data between various sources and destinations. It offers a lightweight interface for extracting data from [REST APIs](./tutorial/rest-api), [SQL databases](./tutorial/sql-database), [cloud storages](./tutorial/filesystem), [Python data structures](getting-started), and more. + +dlt is designed to be easy to use, flexible, and scalable: + +- dlt infers [data types](./general-usage/schema/#data-types) and [schemas](./general-usage/schema), normalizes the data, and handles nested data structures. +- dlt supports variety of [popular destinations](./dlt-ecosystem/destinations/) and has an interface to add [custom destinations](./dlt-ecosystem/destinations/destination) to create reverse ETL pipelines. +- Use dlt locally or [in the cloud](./walkthroughs/deploy-a-pipeline) to build data pipelines, data lakes, and data warehouses. + +To get started with dlt, install the library using pip: -`dlt` is an open-source library that you can add to your Python scripts to load data -from various and often messy data sources into well-structured, live datasets. To get started, install it with: ```sh pip install dlt ``` :::tip -We recommend using a clean virtual environment for your experiments! Here are [detailed instructions](/reference/installation). +We recommend using a clean virtual environment for your experiments! Here are [detailed instructions](/reference/installation) on how to set up one. ::: -Unlike other solutions, with dlt, there's no need to use any backends or containers. Simply import `dlt` in a Python file or a Jupyter Notebook cell, and create a pipeline to load data into any of the [supported destinations](dlt-ecosystem/destinations/). You can load data from any source that produces Python data structures, including APIs, files, databases, and more. `dlt` also supports building a [custom destination](dlt-ecosystem/destinations/destination.md), which you can use as reverse ETL. - -The library will create or update tables, infer data types, and handle nested data automatically. Here are a few example pipelines: +## Load data with dlt from … - + -:::tip -Looking to use a REST API as a source? Explore our new [REST API generic source](dlt-ecosystem/verified-sources/rest_api) for a declarative way to load data. -::: +Use dlt's [REST API source](tutorial/rest-api) to extract data from any REST API. Define API endpoints you’d like to fetch data from, pagination method and authentication and dlt will handle the rest: - +```py +# from dlt.sources import rest_api + +source = rest_api({ + "client": { + "base_url": "https://api.example.com/", + "auth": { + "token": dlt.secrets["your_api_token"], + }, + "paginator": { + "type": "json_response", + "next_url_path": "paging.next", + }, + }, + "resources": [ + "posts", + "comments" + ] +}) +pipeline = dlt.pipeline( + pipeline_name="rest_api_example", + destination="duckdb", + dataset_name="rest_api_data", +) -Copy this example to a file or a Jupyter Notebook and run it. To make it work with the DuckDB destination, you'll need to install the **duckdb** dependency (the default `dlt` installation is really minimal): -```sh -pip install "dlt[duckdb]" +load_info = pipeline.run(source) ``` -Now **run** your Python file or Notebook cell. -How it works? The library extracts data from a [source](general-usage/glossary.md#source) (here: **chess.com REST API**), inspects its structure to create a -[schema](general-usage/glossary.md#schema), structures, normalizes, and verifies the data, and then -loads it into a [destination](general-usage/glossary.md#destination) (here: **duckdb**, into a database schema **player_data** and table name **player**). +Follow the [REST API source tutorial](tutorial/rest-api) to learn more about the source configuration and pagination methods. + + +Use the [SQL source](tutorial/sql-database) to extract data from the database like PostgreSQL, MySQL, SQLite, Oracle and more. - +```py +# from dlt.sources.sql import sql_database - +source = sql_database( + "mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam" +) -Initialize the [Slack source](dlt-ecosystem/verified-sources/slack) with `dlt init` command: +pipeline = dlt.pipeline( + pipeline_name="sql_database_example", + destination="duckdb", + dataset_name="sql_data", +) -```sh -dlt init slack duckdb +load_info = pipeline.run(source) ``` -Create and run a pipeline: +Follow the [SQL source tutorial](tutorial/sql-database) to learn more about the source configuration and supported databases. + + + + +[Filesystem](./tutorial/filesystem) source extracts data from AWS S3, Google Cloud Storage, Google Drive, Azure, or a local file system. ```py -import dlt +# from dlt.sources.filesystem import filesystem -from slack import slack_source +source = filesystem( + bucket_url="s3://example-bucket", + file_glob="*.csv" +) pipeline = dlt.pipeline( - pipeline_name="slack", + pipeline_name="filesystem_example", destination="duckdb", - dataset_name="slack_data" -) - -source = slack_source( - start_date=datetime(2023, 9, 1), - end_date=datetime(2023, 9, 8), - page_size=100, + dataset_name="filesystem_data", ) load_info = pipeline.run(source) -print(load_info) ``` - - - - Pass anything that you can load with Pandas to `dlt` - - - +Follow the [filesystem source tutorial](./tutorial/filesystem) to learn more about the source configuration and supported storage services. - + -:::tip -Use our verified [SQL database source](dlt-ecosystem/verified-sources/sql_database) -to sync your databases with warehouses, data lakes, or vector stores. -::: +dlt is able to load data from Python generators or directly from Python data structures: - +```py +import dlt +@dlt.resource +def foo(): + for i in range(10): + yield {"id": i, "name": f"This is item {i}"} -Install **pymysql** driver: -```sh -pip install sqlalchemy pymysql +pipeline = dlt.pipeline( + pipeline_name="python_data_example", + destination="duckdb", +) + +load_info = pipeline.run(foo) ``` +Check out the [getting started guide](getting-started) to learn more about working with Python data. + + -## Why use `dlt`? +## Why use dlt? - Automated maintenance - with schema inference and evolution and alerts, and with short declarative code, maintenance becomes simple. @@ -124,18 +158,18 @@ external APIs, backends, or containers, scales on micro and large infra alike. - User-friendly, declarative interface that removes knowledge obstacles for beginners while empowering senior professionals. -## Getting started with `dlt` -1. Dive into our [Getting started guide](getting-started.md) for a quick intro to the essentials of `dlt`. +## Getting started with dlt +1. Dive into our [Getting started guide](getting-started.md) for a quick intro to the essentials of dlt. 2. Play with the [Google Colab demo](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing). -This is the simplest way to see `dlt` in action. +This is the simplest way to see dlt in action. 3. Read the [Tutorial](tutorial/intro) to learn how to build a pipeline that loads data from an API. 4. Check out the [How-to guides](walkthroughs/) for recipes on common use cases for creating, running, and deploying pipelines. 5. Ask us on [Slack](https://dlthub.com/community) if you have any questions about use cases or the library. -## Join the `dlt` community +## Join the dlt community 1. Give the library a ⭐ and check out the code on [GitHub](https://github.com/dlt-hub/dlt). 1. Ask questions and share how you use the library on diff --git a/docs/website/docs/tutorial/filesystem.md b/docs/website/docs/tutorial/filesystem.md new file mode 100644 index 0000000000..b560f90870 --- /dev/null +++ b/docs/website/docs/tutorial/filesystem.md @@ -0,0 +1,26 @@ +--- +title: Load data from Filesystem or Cloud Storage +description: How to extract and load data from a filesystem or cloud storage using dlt +keywords: [tutorial, filesystem, cloud storage, dlt, python, data pipeline, incremental loading] +--- + +## What you will learn + +- How to set up a filesystem or cloud storage source +- Configuration basics for filesystems and cloud storage +- Loading methods +- Incremental loading of data from filesystems or cloud storage + +## Prerequisites + +- Python 3.9 or higher +- Virtual environment set up + +## Installing dlt +## Setting up a new project +## Creating a new pipeline +## Configuring filesystem source as the data source +## Running the pipeline +## Append, replace, and merge loading methods +## Incremental loading +## What's next? diff --git a/docs/website/docs/tutorial/intro.md b/docs/website/docs/tutorial/intro.md index 2d53412ae0..c15e123239 100644 --- a/docs/website/docs/tutorial/intro.md +++ b/docs/website/docs/tutorial/intro.md @@ -1,5 +1,5 @@ --- -title: Tutorial +title: Tutorials description: Build a data pipeline with dlt keywords: [tutorial, api, github, duckdb, pipeline] --- diff --git a/docs/website/docs/tutorial/load-data-from-an-api.md b/docs/website/docs/tutorial/load-data-from-an-api.md index ec6136b6d3..93e7d1696f 100644 --- a/docs/website/docs/tutorial/load-data-from-an-api.md +++ b/docs/website/docs/tutorial/load-data-from-an-api.md @@ -1,5 +1,5 @@ --- -title: Load data from an API +title: "Building a custom dlt pipeline" description: quick start with dlt keywords: [getting started, quick start, basic examples] --- diff --git a/docs/website/docs/tutorial/rest-api.md b/docs/website/docs/tutorial/rest-api.md new file mode 100644 index 0000000000..9f87d9ac48 --- /dev/null +++ b/docs/website/docs/tutorial/rest-api.md @@ -0,0 +1,327 @@ +--- +title: Load data from a REST API +description: How to extract data from a REST API using dlt's generic REST API source +keywords: [tutorial, api, github, duckdb, rest api, source, pagination, authentication] +--- + +This tutorial shows how to extract data from a REST API using the dlt's generic REST API source. The tutorial will guide you through the basics of setting up and configuring the source to load data from the API into a destination. + +As a practical example, we'll build a data pipeline that loads data from the Pokemon API into DuckDB. + +## What you will learn + +- How to set up a REST API source +- Configuration basics for API endpoints +- Handling pagination, authentication, and relationships between different resources +- Loading methods +- Incremental loading of data from REST APIs + +## Prerequisites + +- Python 3.9 or higher +- Virtual environment set up + +## Installing dlt + +Before we start, make sure you have a Python virtual environment set up. Follow the instructions in the [installation guide](https://dlthub.com/docs/reference/installation) to create a new virtual environment and install dlt. + +Verify that dlt is installed by running: + +```sh +dlt --version +``` + +If you see the version number (such as "dlt 0.5.3"), you're ready to proceed. + +## Setting up a new project + +Initialize a new dlt project with DuckDB as the destination database: + +```sh +dlt init rest_api duckdb +``` + +`dlt init` creates multiple files and a directory for your project. Let's take a look at the project structure: + +```sh +rest_api_pipeline.py +requirements.txt +.dlt/ + config.toml + secrets.toml +``` + +Here's what each file and directory contains: + +- `rest_api_pipeline.py`: This is the main script where you'll define your data pipeline. It contains two basic pipeline examples for Pokemon and GitHub APIs. You can modify or rename this file as needed. +- `requirements.txt`: This file lists all the Python dependencies required for your project. +- `.dlt/`: This directory contains the [configuration files](../general-usage/credentials/) for your project: + - `secrets.toml`: This file stores your API keys, tokens, and other sensitive information. + - `config.toml`: This file contains the configuration settings for your dlt project. + +## Installing dependencies + +Before we proceed, let's install the required dependencies for this tutorial. Run the following command to install the dependencies listed in the `requirements.txt` file: + +```sh +pip install -r requirements.txt +``` + +## Running the pipeline + +Let's verify that the pipeline is working as expected. Run the following command to execute the pipeline: + +```sh +python rest_api_pipeline.py +``` + +You should see the output of the pipeline execution in the terminal. The output will also diplay the location of the DuckDB database file where the data is stored: + +```sh +Pipeline rest_api_pokemon load step completed in 1.08 seconds +1 load package(s) were loaded to destination duckdb and into dataset rest_api_data +The duckdb destination used duckdb:////home/user-name/quick_start/rest_api_pokemon.duckdb location to store data +Load package 1692364844.9254808 is LOADED and contains no failed jobs +``` + +## Exploring the data + +Now that the pipeline has run successfully, let's explore the data loaded into DuckDB. dlt comes with a built-in command browser application that allows you to interact with the data. To enable it, run the following command: + +```sh +pip install streamlit +``` + +Next, run the following command to start the data browser: + +```sh +dlt pipeline rest_api_pokemon show +``` + +The command opens a new browser window with the data browser application. You can explore the loaded data, run queries and see some pipeline execution details: + +![Streamlit Explore data](/img/streamlit-new.png) + +## Configuring the REST API source + +Now that you environment and the project are set up, let's take a closer look at the configuration of the REST API source. Open the `rest_api_pipeline.py` file in your code editor and locate the following code snippet: + +```py +pipeline = dlt.pipeline( + pipeline_name="rest_api_pokemon", + destination="duckdb", + dataset_name="rest_api_data", +) + +pokemon_source = rest_api_source( + { + "client": { + "base_url": "https://pokeapi.co/api/v2/", + }, + "resource_defaults": { + "endpoint": { + "params": { + "limit": 1000, + }, + }, + }, + "resources": [ + "pokemon", + "berry", + "location", + ], + } +) + +load_info = pipeline.run(pokemon_source) +print(load_info) +``` + +The `rest_api_source` function creates a new REST API source object. It uses the configuration object with the following structure: + + +```py +config: RESTAPIConfig = { + "client": { + ... + }, + "resource_defaults": { + ... + }, + "resources": [ + ... + ], +} +``` + +- The `client` configuration is used to connect to the API's endpoints. Here we specify the base URL of the Pokemon API (`https://pokeapi.co/api/v2/`). +- The `resource_defaults` configuration allows you to set default parameters for all resources. Normally you would set common parameters here, such as pagination limits. In this example, we set the `limit` parameter to 1000 for all resources to retrieve more data in a single request and reduce the number of API calls. +- The `resources` list contains the names of the resources you want to load from the API. REST API will use some conventions to determine the endpoint URL based on the resource name. For example, the resource name `pokemon` will be translated to the endpoint URL `https://pokeapi.co/api/v2/pokemon`. + +## Append, replace, and merge loading methods + +Try running the pipeline again with `python rest_api_pipeline.py`. You will notice that +all the tables have data duplicated. This is because the default load mode is `append`. It is very useful, for example, when you have daily data updates and you want to ingest them. But in this case, we want to replace the data in the destination table with the new data. + +To do that, you can change the loading method in the pipeline configuration. Open the `rest_api_pipeline.py` and change the pipeline configuration to use the `replace` write disposition: + +```py +pipeline = dlt.pipeline( + pipeline_name="rest_api_pokemon", + destination="duckdb", + dataset_name="rest_api_data", +) + +pokemon_source = rest_api_source( + { + "client": { + "base_url": "https://pokeapi.co/api/v2/", + }, + "resource_defaults": { + "endpoint": { + "params": { + "limit": 1000, + }, + }, + "write_disposition": "replace", # Change the write disposition to replace + }, + "resources": [ + "pokemon", + "berry", + "location", + ], + } +) + +load_info = pipeline.run(pokemon_source) +print(load_info) +``` + +### Define resource relationships + +When you have a resource that depends on another resource, you can define the relationship using the `resolve` configuration. This configuration allows you to link a path parameter in the child resource to a field in the parent resource's data. + +For our Pokemon API example, let's consider the `pokemon` resource which depends on the `location` resource. Suppose we want to retrieve details about Pokémon encounters based on their location ID. The `location_id` parameter in the `pokemon` endpoint configuration is resolved from the `id` field of the `location` resource: + +```py +{ + "resources": [ + { + "name": "location", + "endpoint": { + "path": "location", + # ... + }, + }, + { + "name": "pokemon", + "endpoint": { + "path": "location/{location_id}/pokemon", + "params": { + "location_id": { + "type": "resolve", + "resource": "location", + "field": "id", + } + }, + }, + "include_from_parent": ["name"], + }, + ], +} +``` + +This configuration tells the source to get location IDs from the `location` resource and use them to fetch Pokémon encounter details for each location. So if the `location` resource yields the following data: + +```json +[ + {"id": 1, "name": "Kanto"}, + {"id": 2, "name": "Johto"}, + {"id": 3, "name": "Hoenn"} +] +``` + +The `pokemon` resource will make requests to the following endpoints: + +- `location/1/pokemon` +- `location/2/pokemon` +- `location/3/pokemon` + +The syntax for the `resolve` field in parameter configuration is: + +```py +{ + "": { + "type": "resolve", + "resource": "", + "field": "", + } +} +``` + +The `field` value can be specified as a [JSONPath](https://github.com/h2non/jsonpath-ng?tab=readme-ov-file#jsonpath-syntax) to select a nested field in the parent resource data. For example: `"field": "items[0].id"`. + +Under the hood, dlt handles this by using a [transformer resource](../general-usage/resource.md#process-resources-with-dlttransformer). + +## Load data incrementally + +When working with some APIs, you may need to load data incrementally to avoid fetching the entire dataset every time and to reduce the load time. The API that support incremental loading usually provide a way to fetch only new or changed data (most often by using a timestamp field like `updated_at`, `created_at`, or incremental IDs). + +To illustrate incremental loading, let's consider the GitHub API. In the `rest_api_pipeline.py` file, you can find an example of how to load data from the GitHub API incrementally. Let's take a look at the configuration: + +```py +pipeline = dlt.pipeline( + pipeline_name="rest_api_github", + destination="duckdb", + dataset_name="rest_api_data", +) + +github_source = rest_api_source({ + "client": { + "base_url": "https://api.github.com/repos/dlt-hub/dlt/", + }, + "resource_defaults": { + "primary_key": "id", + "write_disposition": "merge", + "endpoint": { + "params": { + "per_page": 100, + }, + }, + }, + "resources": [ + { + "name": "issues", + "endpoint": { + "path": "issues", + "params": { + "sort": "updated", + "direction": "desc", + "state": "open", + "since": { + "type": "incremental", + "cursor_path": "updated_at", + "initial_value": "2024-01-25T11:21:28Z", + }, + }, + }, + }, + ], +}) + +load_info = pipeline.run(github_source()) +print(load_info) +``` + +In this configuration, the `since` parameter is defined as an special incremental parameter. The `cursor_path` field specifies the JSON path to the field that will be used to fetch the updated data and we use the `initial_value` for the initial value for the incremental parameter. This value will be used in the first request to fetch the data. + +When the pipeline runs, dlt will automatically update the `since` parameter with the latest value from the response data. This way, you can fetch only the new or updated data from the API. + +## What's next? + +Congratulations on completing the tutorial! You've learned how to set up a REST API source in dlt and run a data pipeline to load the data into DuckDB. + +Interested in learning more about dlt? Here are some suggestions: + +- Learn more about the REST API source configuration in [REST API source documentation](../dlt-ecosystem/verified-sources/rest_api.md) +- Learn how to create a custom source in [Creating a custom source tutorial](./load-data-from-an-api.md) \ No newline at end of file diff --git a/docs/website/docs/tutorial/sql-database.md b/docs/website/docs/tutorial/sql-database.md new file mode 100644 index 0000000000..c4ae200b8c --- /dev/null +++ b/docs/website/docs/tutorial/sql-database.md @@ -0,0 +1,25 @@ +--- +title: Load data from a SQL database +description: How to extract data from a REST API using dlt's generic REST API source +keywords: [tutorial, api, github, duckdb, rest api, source, pagination, authentication] +--- + +## What you will learn + +- How to set up a SQL database source +- Configuration basics for SQL databases +- Loading methods +- Incremental loading of data from SQL databases + +## Prerequisites + +- Python 3.9 or higher +- Virtual environment set up + +## Installing dlt +## Setting up a new project +## Creating a new pipeline +## Running the pipeline +## Append, replace, and merge loading methods +## Incremental loading +## What's next? \ No newline at end of file diff --git a/docs/website/sidebars.js b/docs/website/sidebars.js index 921c3c0dc4..fe9bd208d3 100644 --- a/docs/website/sidebars.js +++ b/docs/website/sidebars.js @@ -32,12 +32,15 @@ const sidebars = { 'getting-started', { type: 'category', - label: 'Tutorial', + label: 'Tutorials', link: { type: 'doc', id: 'tutorial/intro', }, items: [ + 'tutorial/rest-api', + 'tutorial/sql-database', + 'tutorial/filesystem', 'tutorial/load-data-from-an-api', 'tutorial/grouping-resources', ]