-
Notifications
You must be signed in to change notification settings - Fork 187
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Docs: update the introduction, add the rest_api tutorial (#1729)
Co-authored-by: Akela Drissner-Schmid <[email protected]>
- Loading branch information
Showing
10 changed files
with
1,077 additions
and
557 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,138 +6,153 @@ keywords: [introduction, who, what, how] | |
|
||
import snippets from '!!raw-loader!./intro-snippets.py'; | ||
|
||
# Introduction | ||
# Getting started | ||
|
||
![dlt pacman](/img/dlt-pacman.gif) | ||
|
||
## What is `dlt`? | ||
## What is dlt? | ||
|
||
dlt is an open-source Python library that loads data from various, often messy data sources into well-structured, live datasets. It offers a lightweight interface for extracting data from [REST APIs](./tutorial/rest-api), [SQL databases](./tutorial/sql-database), [cloud storage](./tutorial/filesystem), [Python data structures](./tutorial/load-data-from-an-api), and [many more](./dlt-ecosystem/verified-sources). | ||
|
||
dlt is designed to be easy to use, flexible, and scalable: | ||
|
||
- dlt infers [schemas](./general-usage/schema) and [data types](./general-usage/schema/#data-types), [normalizes the data](./general-usage/schema/#data-normalizer), and handles nested data structures. | ||
- dlt supports a variety of [popular destinations](./dlt-ecosystem/destinations/) and has an interface to add [custom destinations](./dlt-ecosystem/destinations/destination) to create reverse ETL pipelines. | ||
- dlt can be deployed anywhere Python runs, be it on [Airflow](./walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer), [serverless functions](./walkthroughs/deploy-a-pipeline/deploy-with-google-cloud-functions) or any other cloud deployment of your choice. | ||
- dlt automates pipeline maintenance with [schema evolution](./general-usage/schema-evolution) and [schema and data contracts](./general-usage/schema-contracts). | ||
|
||
To get started with dlt, install the library using pip: | ||
|
||
`dlt` is an open-source library that you can add to your Python scripts to load data | ||
from various and often messy data sources into well-structured, live datasets. To get started, install it with: | ||
```sh | ||
pip install dlt | ||
``` | ||
:::tip | ||
We recommend using a clean virtual environment for your experiments! Here are [detailed instructions](/reference/installation). | ||
We recommend using a clean virtual environment for your experiments! Read the [detailed instructions](./reference/installation) on how to set up one. | ||
::: | ||
|
||
Unlike other solutions, with dlt, there's no need to use any backends or containers. Simply import `dlt` in a Python file or a Jupyter Notebook cell, and create a pipeline to load data into any of the [supported destinations](dlt-ecosystem/destinations/). You can load data from any source that produces Python data structures, including APIs, files, databases, and more. `dlt` also supports building a [custom destination](dlt-ecosystem/destinations/destination.md), which you can use as reverse ETL. | ||
|
||
The library will create or update tables, infer data types, and handle nested data automatically. Here are a few example pipelines: | ||
## Load data with dlt from … | ||
|
||
<Tabs | ||
groupId="source-type" | ||
defaultValue="api" | ||
defaultValue="rest-api" | ||
values={[ | ||
{"label": "Data from an API", "value": "api"}, | ||
{"label": "Data from a dlt Source", "value": "source"}, | ||
{"label": "Data from CSV/XLS/Pandas", "value": "csv"}, | ||
{"label": "Data from a Database", "value":"database"} | ||
{"label": "REST APIs", "value": "rest-api"}, | ||
{"label": "SQL databases", "value": "sql-database"}, | ||
{"label": "Cloud storages or files", "value": "filesystem"}, | ||
{"label": "Python data structures", "value": "python-data"}, | ||
]}> | ||
<TabItem value="api"> | ||
<TabItem value="rest-api"> | ||
|
||
:::tip | ||
Looking to use a REST API as a source? Explore our new [REST API generic source](dlt-ecosystem/verified-sources/rest_api) for a declarative way to load data. | ||
::: | ||
Use dlt's [REST API source](./tutorial/rest-api) to extract data from any REST API. Define API endpoints you’d like to fetch data from, pagination method and authentication and dlt will handle the rest: | ||
|
||
<!--@@@DLT_SNIPPET api--> | ||
```py | ||
import dlt | ||
from dlt.sources.rest_api import rest_api_source | ||
|
||
source = rest_api_source({ | ||
"client": { | ||
"base_url": "https://api.example.com/", | ||
"auth": { | ||
"token": dlt.secrets["your_api_token"], | ||
}, | ||
"paginator": { | ||
"type": "json_response", | ||
"next_url_path": "paging.next", | ||
}, | ||
}, | ||
"resources": ["posts", "comments"], | ||
}) | ||
|
||
pipeline = dlt.pipeline( | ||
pipeline_name="rest_api_example", | ||
destination="duckdb", | ||
dataset_name="rest_api_data", | ||
) | ||
|
||
Copy this example to a file or a Jupyter Notebook and run it. To make it work with the DuckDB destination, you'll need to install the **duckdb** dependency (the default `dlt` installation is really minimal): | ||
```sh | ||
pip install "dlt[duckdb]" | ||
load_info = pipeline.run(source) | ||
``` | ||
Now **run** your Python file or Notebook cell. | ||
|
||
How it works? The library extracts data from a [source](general-usage/glossary.md#source) (here: **chess.com REST API**), inspects its structure to create a | ||
[schema](general-usage/glossary.md#schema), structures, normalizes, and verifies the data, and then | ||
loads it into a [destination](general-usage/glossary.md#destination) (here: **duckdb**, into a database schema **player_data** and table name **player**). | ||
Follow the [REST API source tutorial](./tutorial/rest-api) to learn more about the source configuration and pagination methods. | ||
</TabItem> | ||
<TabItem value="sql-database"> | ||
|
||
Use the [SQL source](./tutorial/sql-database) to extract data from the database like PostgreSQL, MySQL, SQLite, Oracle and more. | ||
|
||
</TabItem> | ||
```py | ||
from dlt.sources.sql_database import sql_database | ||
|
||
<TabItem value="source"> | ||
source = sql_database( | ||
"mysql+pymysql://[email protected]:4497/Rfam" | ||
) | ||
|
||
Initialize the [Slack source](dlt-ecosystem/verified-sources/slack) with `dlt init` command: | ||
pipeline = dlt.pipeline( | ||
pipeline_name="sql_database_example", | ||
destination="duckdb", | ||
dataset_name="sql_data", | ||
) | ||
|
||
```sh | ||
dlt init slack duckdb | ||
load_info = pipeline.run(source) | ||
``` | ||
|
||
Create and run a pipeline: | ||
Follow the [SQL source tutorial](./tutorial/sql-database) to learn more about the source configuration and supported databases. | ||
|
||
</TabItem> | ||
<TabItem value="filesystem"> | ||
|
||
[Filesystem](./tutorial/filesystem) source extracts data from AWS S3, Google Cloud Storage, Google Drive, Azure, or a local file system. | ||
|
||
```py | ||
import dlt | ||
from dlt.sources.filesystem import filesystem | ||
|
||
from slack import slack_source | ||
source = filesystem( | ||
bucket_url="s3://example-bucket", | ||
file_glob="*.csv" | ||
) | ||
|
||
pipeline = dlt.pipeline( | ||
pipeline_name="slack", | ||
pipeline_name="filesystem_example", | ||
destination="duckdb", | ||
dataset_name="slack_data" | ||
) | ||
|
||
source = slack_source( | ||
start_date=datetime(2023, 9, 1), | ||
end_date=datetime(2023, 9, 8), | ||
page_size=100, | ||
dataset_name="filesystem_data", | ||
) | ||
|
||
load_info = pipeline.run(source) | ||
print(load_info) | ||
``` | ||
|
||
</TabItem> | ||
<TabItem value="csv"> | ||
|
||
Pass anything that you can load with Pandas to `dlt` | ||
|
||
<!--@@@DLT_SNIPPET csv--> | ||
|
||
Follow the [filesystem source tutorial](./tutorial/filesystem) to learn more about the source configuration and supported storage services. | ||
|
||
</TabItem> | ||
<TabItem value="database"> | ||
<TabItem value="python-data"> | ||
|
||
:::tip | ||
Use our verified [SQL database source](dlt-ecosystem/verified-sources/sql_database) | ||
to sync your databases with warehouses, data lakes, or vector stores. | ||
::: | ||
dlt is able to load data from Python generators or directly from Python data structures: | ||
|
||
<!--@@@DLT_SNIPPET db--> | ||
```py | ||
import dlt | ||
|
||
@dlt.resource | ||
def foo(): | ||
for i in range(10): | ||
yield {"id": i, "name": f"This is item {i}"} | ||
|
||
Install **pymysql** driver: | ||
```sh | ||
pip install sqlalchemy pymysql | ||
``` | ||
pipeline = dlt.pipeline( | ||
pipeline_name="python_data_example", | ||
destination="duckdb", | ||
) | ||
|
||
</TabItem> | ||
</Tabs> | ||
load_info = pipeline.run(foo) | ||
``` | ||
|
||
Check out the [Python data structures tutorial](./tutorial/load-data-from-an-api) to learn about dlt fundamentals and advanced usage scenarios. | ||
|
||
## Why use `dlt`? | ||
</TabItem> | ||
|
||
- Automated maintenance - with schema inference and evolution and alerts, and with short declarative | ||
code, maintenance becomes simple. | ||
- Run it where Python runs - on Airflow, serverless functions, notebooks. No | ||
external APIs, backends, or containers, scales on micro and large infra alike. | ||
- User-friendly, declarative interface that removes knowledge obstacles for beginners | ||
while empowering senior professionals. | ||
</Tabs> | ||
|
||
## Getting started with `dlt` | ||
1. Dive into our [Getting started guide](getting-started.md) for a quick intro to the essentials of `dlt`. | ||
2. Play with the | ||
[Google Colab demo](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing). | ||
This is the simplest way to see `dlt` in action. | ||
3. Read the [Tutorial](tutorial/intro) to learn how to build a pipeline that loads data from an API. | ||
4. Check out the [How-to guides](walkthroughs/) for recipes on common use cases for creating, running, and deploying pipelines. | ||
5. Ask us on | ||
[Slack](https://dlthub.com/community) | ||
if you have any questions about use cases or the library. | ||
:::tip | ||
If you'd like to try out dlt without installing it on your machine, check out the [Google Colab demo](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing). | ||
::: | ||
|
||
## Join the `dlt` community | ||
## Join the dlt community | ||
|
||
1. Give the library a ⭐ and check out the code on [GitHub](https://github.com/dlt-hub/dlt). | ||
1. Ask questions and share how you use the library on | ||
[Slack](https://dlthub.com/community). | ||
1. Ask questions and share how you use the library on [Slack](https://dlthub.com/community). | ||
1. Report problems and make feature requests [here](https://github.com/dlt-hub/dlt/issues/new/choose). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.