Docs: update the introduction, add the rest_api tutorial (#1729)

Co-authored-by: Akela Drissner-Schmid <[email protected]>
dlt-hub · Sep 14, 2024 · eb4b1ba · eb4b1ba
1 parent 9580baf
commit eb4b1ba
Show file tree

Hide file tree

Showing 10 changed files with 1,077 additions and 557 deletions.
diff --git a/docs/website/docs/intro.md b/docs/website/docs/intro.md
@@ -6,138 +6,153 @@ keywords: [introduction, who, what, how]
 
 import snippets from '!!raw-loader!./intro-snippets.py';
 
-# Introduction
+# Getting started
 
 ![dlt pacman](/img/dlt-pacman.gif)
 
-## What is `dlt`?
+## What is dlt?
+
+dlt is an open-source Python library that loads data from various, often messy data sources into well-structured, live datasets. It offers a lightweight interface for extracting data from [REST APIs](./tutorial/rest-api), [SQL databases](./tutorial/sql-database), [cloud storage](./tutorial/filesystem), [Python data structures](./tutorial/load-data-from-an-api), and [many more](./dlt-ecosystem/verified-sources).
+
+dlt is designed to be easy to use, flexible, and scalable:
+
+- dlt infers [schemas](./general-usage/schema) and [data types](./general-usage/schema/#data-types), [normalizes the data](./general-usage/schema/#data-normalizer), and handles nested data structures.
+- dlt supports a variety of [popular destinations](./dlt-ecosystem/destinations/) and has an interface to add [custom destinations](./dlt-ecosystem/destinations/destination) to create reverse ETL pipelines.
+- dlt can be deployed anywhere Python runs, be it on [Airflow](./walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer), [serverless functions](./walkthroughs/deploy-a-pipeline/deploy-with-google-cloud-functions) or any other cloud deployment of your choice.
+- dlt automates pipeline maintenance with [schema evolution](./general-usage/schema-evolution) and [schema and data contracts](./general-usage/schema-contracts).
+
+To get started with dlt, install the library using pip:
 
-`dlt` is an open-source library that you can add to your Python scripts to load data
-from various and often messy data sources into well-structured, live datasets. To get started, install it with:
 ```sh
 pip install dlt
 ```
 :::tip
-We recommend using a clean virtual environment for your experiments! Here are [detailed instructions](/reference/installation).
+We recommend using a clean virtual environment for your experiments! Read the [detailed instructions](./reference/installation) on how to set up one.
 :::
 
-Unlike other solutions, with dlt, there's no need to use any backends or containers. Simply import `dlt` in a Python file or a Jupyter Notebook cell, and create a pipeline to load data into any of the [supported destinations](dlt-ecosystem/destinations/). You can load data from any source that produces Python data structures, including APIs, files, databases, and more. `dlt` also supports building a [custom destination](dlt-ecosystem/destinations/destination.md), which you can use as reverse ETL.
-
-The library will create or update tables, infer data types, and handle nested data automatically. Here are a few example pipelines:
+## Load data with dlt from …
 
 <Tabs
   groupId="source-type"
-  defaultValue="api"
+  defaultValue="rest-api"
   values={[
-    {"label": "Data from an API", "value": "api"},
-    {"label": "Data from a dlt Source", "value": "source"},
-    {"label": "Data from CSV/XLS/Pandas", "value": "csv"},
-    {"label": "Data from a Database", "value":"database"}
+    {"label": "REST APIs", "value": "rest-api"},
+    {"label": "SQL databases", "value": "sql-database"},
+    {"label": "Cloud storages or files", "value": "filesystem"},
+    {"label": "Python data structures", "value": "python-data"},
 ]}>
-  <TabItem value="api">
+  <TabItem value="rest-api">
 
-:::tip
-Looking to use a REST API as a source? Explore our new [REST API generic source](dlt-ecosystem/verified-sources/rest_api) for a declarative way to load data.
-:::
+Use dlt's [REST API source](./tutorial/rest-api) to extract data from any REST API. Define API endpoints you’d like to fetch data from, pagination method and authentication and dlt will handle the rest:
 
-<!--@@@DLT_SNIPPET api-->
+```py
+import dlt
+from dlt.sources.rest_api import rest_api_source
+
+source = rest_api_source({
+    "client": {
+        "base_url": "https://api.example.com/",
+        "auth": {
+            "token": dlt.secrets["your_api_token"],
+        },
+        "paginator": {
+            "type": "json_response",
+            "next_url_path": "paging.next",
+        },
+    },
+    "resources": ["posts", "comments"],
+})
 
+pipeline = dlt.pipeline(
+    pipeline_name="rest_api_example",
+    destination="duckdb",
+    dataset_name="rest_api_data",
+)
 
-Copy this example to a file or a Jupyter Notebook and run it. To make it work with the DuckDB destination, you'll need to install the **duckdb** dependency (the default `dlt` installation is really minimal):
-```sh
-pip install "dlt[duckdb]"
+load_info = pipeline.run(source)
 ```
-Now **run** your Python file or Notebook cell.
 
-How it works? The library extracts data from a [source](general-usage/glossary.md#source) (here: **chess.com REST API**), inspects its structure to create a
-[schema](general-usage/glossary.md#schema), structures, normalizes, and verifies the data, and then
-loads it into a [destination](general-usage/glossary.md#destination) (here: **duckdb**, into a database schema **player_data** and table name **player**).
+Follow the [REST API source tutorial](./tutorial/rest-api) to learn more about the source configuration and pagination methods.
+  </TabItem>
+  <TabItem value="sql-database">
 
+Use the [SQL source](./tutorial/sql-database) to extract data from the database like PostgreSQL, MySQL, SQLite, Oracle and more.
 
-  </TabItem>
+```py
+from dlt.sources.sql_database import sql_database
 
-  <TabItem value="source">
+source = sql_database(
+    "mysql+pymysql://[email protected]:4497/Rfam"
+)
 
-Initialize the [Slack source](dlt-ecosystem/verified-sources/slack) with `dlt init` command:
+pipeline = dlt.pipeline(
+    pipeline_name="sql_database_example",
+    destination="duckdb",
+    dataset_name="sql_data",
+)
 
-```sh
-dlt init slack duckdb
+load_info = pipeline.run(source)
 ```
 
-Create and run a pipeline:
+Follow the [SQL source tutorial](./tutorial/sql-database) to learn more about the source configuration and supported databases.
+
+  </TabItem>
+  <TabItem value="filesystem">
+
+[Filesystem](./tutorial/filesystem) source extracts data from AWS S3, Google Cloud Storage, Google Drive, Azure, or a local file system.
 
 ```py
-import dlt
+from dlt.sources.filesystem import filesystem
 
-from slack import slack_source
+source = filesystem(
+    bucket_url="s3://example-bucket",
+    file_glob="*.csv"
+)
 
 pipeline = dlt.pipeline(
-    pipeline_name="slack",
+    pipeline_name="filesystem_example",
     destination="duckdb",
-    dataset_name="slack_data"
-)
-
-source = slack_source(
-    start_date=datetime(2023, 9, 1),
-    end_date=datetime(2023, 9, 8),
-    page_size=100,
+    dataset_name="filesystem_data",
 )
 
 load_info = pipeline.run(source)
-print(load_info)
 ```
 
-  </TabItem>
-  <TabItem value="csv">
-
-  Pass anything that you can load with Pandas to `dlt`
-
-<!--@@@DLT_SNIPPET csv-->
-
+Follow the [filesystem source tutorial](./tutorial/filesystem) to learn more about the source configuration and supported storage services.
 
   </TabItem>
-  <TabItem value="database">
+  <TabItem value="python-data">
 
-:::tip
-Use our verified [SQL database source](dlt-ecosystem/verified-sources/sql_database)
-to sync your databases with warehouses, data lakes, or vector stores.
-:::
+dlt is able to load data from Python generators or directly from Python data structures:
 
-<!--@@@DLT_SNIPPET db-->
+```py
+import dlt
 
+@dlt.resource
+def foo():
+    for i in range(10):
+        yield {"id": i, "name": f"This is item {i}"}
 
-Install **pymysql** driver:
-```sh
-pip install sqlalchemy pymysql
-```
+pipeline = dlt.pipeline(
+    pipeline_name="python_data_example",
+    destination="duckdb",
+)
 
-  </TabItem>
-</Tabs>
+load_info = pipeline.run(foo)
+```
 
+Check out the [Python data structures tutorial](./tutorial/load-data-from-an-api) to learn about dlt fundamentals and advanced usage scenarios.
 
-## Why use `dlt`?
+  </TabItem>
 
-- Automated maintenance - with schema inference and evolution and alerts, and with short declarative
-code, maintenance becomes simple.
-- Run it where Python runs - on Airflow, serverless functions, notebooks. No
-external APIs, backends, or containers, scales on micro and large infra alike.
-- User-friendly, declarative interface that removes knowledge obstacles for beginners
-while empowering senior professionals.
+</Tabs>
 
-## Getting started with `dlt`
-1. Dive into our [Getting started guide](getting-started.md) for a quick intro to the essentials of `dlt`.
-2. Play with the
-[Google Colab demo](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing).
-This is the simplest way to see `dlt` in action.
-3. Read the [Tutorial](tutorial/intro) to learn how to build a pipeline that loads data from an API.
-4. Check out the [How-to guides](walkthroughs/) for recipes on common use cases for creating, running, and deploying pipelines.
-5. Ask us on
-[Slack](https://dlthub.com/community)
-if you have any questions about use cases or the library.
+:::tip
+If you'd like to try out dlt without installing it on your machine, check out the [Google Colab demo](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing).
+:::
 
-## Join the `dlt` community
+## Join the dlt community
 
 1. Give the library a ⭐ and check out the code on [GitHub](https://github.com/dlt-hub/dlt).
-1. Ask questions and share how you use the library on
-[Slack](https://dlthub.com/community).
+1. Ask questions and share how you use the library on [Slack](https://dlthub.com/community).
 1. Report problems and make feature requests [here](https://github.com/dlt-hub/dlt/issues/new/choose).
diff --git a/docs/website/docs/reference/installation.md b/docs/website/docs/reference/installation.md
@@ -137,4 +137,10 @@ conda install -c conda-forge dlt
 
 ### 4. Done!
 
-You are now ready to [build your first pipeline](../getting-started) :)
+You are now ready to build your first pipeline with `dlt`. Check out these tutorials to get started:
+
+- [Load data from a REST API](../tutorial/rest-api)
+- [Load data from a SQL database](../tutorial/sql-database)
+- [Load data from a cloud storage or a file system](../tutorial/filesystem)
+
+Or read a more detailed tutorial on how to build a [custom data pipeline with dlt](../tutorial/load-data-from-an-api.md).