Skip to content

Commit

Permalink
Docs: update the introduction, add the rest_api tutorial (#1729)
Browse files Browse the repository at this point in the history
Co-authored-by: Akela Drissner-Schmid <[email protected]>
  • Loading branch information
burnash and akelad authored Sep 14, 2024
1 parent 9580baf commit eb4b1ba
Show file tree
Hide file tree
Showing 10 changed files with 1,077 additions and 557 deletions.
177 changes: 96 additions & 81 deletions docs/website/docs/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,138 +6,153 @@ keywords: [introduction, who, what, how]

import snippets from '!!raw-loader!./intro-snippets.py';

# Introduction
# Getting started

![dlt pacman](/img/dlt-pacman.gif)

## What is `dlt`?
## What is dlt?

dlt is an open-source Python library that loads data from various, often messy data sources into well-structured, live datasets. It offers a lightweight interface for extracting data from [REST APIs](./tutorial/rest-api), [SQL databases](./tutorial/sql-database), [cloud storage](./tutorial/filesystem), [Python data structures](./tutorial/load-data-from-an-api), and [many more](./dlt-ecosystem/verified-sources).

dlt is designed to be easy to use, flexible, and scalable:

- dlt infers [schemas](./general-usage/schema) and [data types](./general-usage/schema/#data-types), [normalizes the data](./general-usage/schema/#data-normalizer), and handles nested data structures.
- dlt supports a variety of [popular destinations](./dlt-ecosystem/destinations/) and has an interface to add [custom destinations](./dlt-ecosystem/destinations/destination) to create reverse ETL pipelines.
- dlt can be deployed anywhere Python runs, be it on [Airflow](./walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer), [serverless functions](./walkthroughs/deploy-a-pipeline/deploy-with-google-cloud-functions) or any other cloud deployment of your choice.
- dlt automates pipeline maintenance with [schema evolution](./general-usage/schema-evolution) and [schema and data contracts](./general-usage/schema-contracts).

To get started with dlt, install the library using pip:

`dlt` is an open-source library that you can add to your Python scripts to load data
from various and often messy data sources into well-structured, live datasets. To get started, install it with:
```sh
pip install dlt
```
:::tip
We recommend using a clean virtual environment for your experiments! Here are [detailed instructions](/reference/installation).
We recommend using a clean virtual environment for your experiments! Read the [detailed instructions](./reference/installation) on how to set up one.
:::

Unlike other solutions, with dlt, there's no need to use any backends or containers. Simply import `dlt` in a Python file or a Jupyter Notebook cell, and create a pipeline to load data into any of the [supported destinations](dlt-ecosystem/destinations/). You can load data from any source that produces Python data structures, including APIs, files, databases, and more. `dlt` also supports building a [custom destination](dlt-ecosystem/destinations/destination.md), which you can use as reverse ETL.

The library will create or update tables, infer data types, and handle nested data automatically. Here are a few example pipelines:
## Load data with dlt from …

<Tabs
groupId="source-type"
defaultValue="api"
defaultValue="rest-api"
values={[
{"label": "Data from an API", "value": "api"},
{"label": "Data from a dlt Source", "value": "source"},
{"label": "Data from CSV/XLS/Pandas", "value": "csv"},
{"label": "Data from a Database", "value":"database"}
{"label": "REST APIs", "value": "rest-api"},
{"label": "SQL databases", "value": "sql-database"},
{"label": "Cloud storages or files", "value": "filesystem"},
{"label": "Python data structures", "value": "python-data"},
]}>
<TabItem value="api">
<TabItem value="rest-api">

:::tip
Looking to use a REST API as a source? Explore our new [REST API generic source](dlt-ecosystem/verified-sources/rest_api) for a declarative way to load data.
:::
Use dlt's [REST API source](./tutorial/rest-api) to extract data from any REST API. Define API endpoints you’d like to fetch data from, pagination method and authentication and dlt will handle the rest:

<!--@@@DLT_SNIPPET api-->
```py
import dlt
from dlt.sources.rest_api import rest_api_source

source = rest_api_source({
"client": {
"base_url": "https://api.example.com/",
"auth": {
"token": dlt.secrets["your_api_token"],
},
"paginator": {
"type": "json_response",
"next_url_path": "paging.next",
},
},
"resources": ["posts", "comments"],
})

pipeline = dlt.pipeline(
pipeline_name="rest_api_example",
destination="duckdb",
dataset_name="rest_api_data",
)

Copy this example to a file or a Jupyter Notebook and run it. To make it work with the DuckDB destination, you'll need to install the **duckdb** dependency (the default `dlt` installation is really minimal):
```sh
pip install "dlt[duckdb]"
load_info = pipeline.run(source)
```
Now **run** your Python file or Notebook cell.

How it works? The library extracts data from a [source](general-usage/glossary.md#source) (here: **chess.com REST API**), inspects its structure to create a
[schema](general-usage/glossary.md#schema), structures, normalizes, and verifies the data, and then
loads it into a [destination](general-usage/glossary.md#destination) (here: **duckdb**, into a database schema **player_data** and table name **player**).
Follow the [REST API source tutorial](./tutorial/rest-api) to learn more about the source configuration and pagination methods.
</TabItem>
<TabItem value="sql-database">

Use the [SQL source](./tutorial/sql-database) to extract data from the database like PostgreSQL, MySQL, SQLite, Oracle and more.

</TabItem>
```py
from dlt.sources.sql_database import sql_database

<TabItem value="source">
source = sql_database(
"mysql+pymysql://[email protected]:4497/Rfam"
)

Initialize the [Slack source](dlt-ecosystem/verified-sources/slack) with `dlt init` command:
pipeline = dlt.pipeline(
pipeline_name="sql_database_example",
destination="duckdb",
dataset_name="sql_data",
)

```sh
dlt init slack duckdb
load_info = pipeline.run(source)
```

Create and run a pipeline:
Follow the [SQL source tutorial](./tutorial/sql-database) to learn more about the source configuration and supported databases.

</TabItem>
<TabItem value="filesystem">

[Filesystem](./tutorial/filesystem) source extracts data from AWS S3, Google Cloud Storage, Google Drive, Azure, or a local file system.

```py
import dlt
from dlt.sources.filesystem import filesystem

from slack import slack_source
source = filesystem(
bucket_url="s3://example-bucket",
file_glob="*.csv"
)

pipeline = dlt.pipeline(
pipeline_name="slack",
pipeline_name="filesystem_example",
destination="duckdb",
dataset_name="slack_data"
)

source = slack_source(
start_date=datetime(2023, 9, 1),
end_date=datetime(2023, 9, 8),
page_size=100,
dataset_name="filesystem_data",
)

load_info = pipeline.run(source)
print(load_info)
```

</TabItem>
<TabItem value="csv">

Pass anything that you can load with Pandas to `dlt`

<!--@@@DLT_SNIPPET csv-->

Follow the [filesystem source tutorial](./tutorial/filesystem) to learn more about the source configuration and supported storage services.

</TabItem>
<TabItem value="database">
<TabItem value="python-data">

:::tip
Use our verified [SQL database source](dlt-ecosystem/verified-sources/sql_database)
to sync your databases with warehouses, data lakes, or vector stores.
:::
dlt is able to load data from Python generators or directly from Python data structures:

<!--@@@DLT_SNIPPET db-->
```py
import dlt

@dlt.resource
def foo():
for i in range(10):
yield {"id": i, "name": f"This is item {i}"}

Install **pymysql** driver:
```sh
pip install sqlalchemy pymysql
```
pipeline = dlt.pipeline(
pipeline_name="python_data_example",
destination="duckdb",
)

</TabItem>
</Tabs>
load_info = pipeline.run(foo)
```

Check out the [Python data structures tutorial](./tutorial/load-data-from-an-api) to learn about dlt fundamentals and advanced usage scenarios.

## Why use `dlt`?
</TabItem>

- Automated maintenance - with schema inference and evolution and alerts, and with short declarative
code, maintenance becomes simple.
- Run it where Python runs - on Airflow, serverless functions, notebooks. No
external APIs, backends, or containers, scales on micro and large infra alike.
- User-friendly, declarative interface that removes knowledge obstacles for beginners
while empowering senior professionals.
</Tabs>

## Getting started with `dlt`
1. Dive into our [Getting started guide](getting-started.md) for a quick intro to the essentials of `dlt`.
2. Play with the
[Google Colab demo](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing).
This is the simplest way to see `dlt` in action.
3. Read the [Tutorial](tutorial/intro) to learn how to build a pipeline that loads data from an API.
4. Check out the [How-to guides](walkthroughs/) for recipes on common use cases for creating, running, and deploying pipelines.
5. Ask us on
[Slack](https://dlthub.com/community)
if you have any questions about use cases or the library.
:::tip
If you'd like to try out dlt without installing it on your machine, check out the [Google Colab demo](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing).
:::

## Join the `dlt` community
## Join the dlt community

1. Give the library a ⭐ and check out the code on [GitHub](https://github.com/dlt-hub/dlt).
1. Ask questions and share how you use the library on
[Slack](https://dlthub.com/community).
1. Ask questions and share how you use the library on [Slack](https://dlthub.com/community).
1. Report problems and make feature requests [here](https://github.com/dlt-hub/dlt/issues/new/choose).
8 changes: 7 additions & 1 deletion docs/website/docs/reference/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,4 +137,10 @@ conda install -c conda-forge dlt

### 4. Done!

You are now ready to [build your first pipeline](../getting-started) :)
You are now ready to build your first pipeline with `dlt`. Check out these tutorials to get started:

- [Load data from a REST API](../tutorial/rest-api)
- [Load data from a SQL database](../tutorial/sql-database)
- [Load data from a cloud storage or a file system](../tutorial/filesystem)

Or read a more detailed tutorial on how to build a [custom data pipeline with dlt](../tutorial/load-data-from-an-api.md).
Loading

0 comments on commit eb4b1ba

Please sign in to comment.