Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prepare dataset release & docs updates #2126

Merged
merged 36 commits into from
Dec 15, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
eda6ad3
remove standalone dataset from exports
sh-rp Dec 9, 2024
fae7a2b
make pipeline dataset factory public
sh-rp Dec 9, 2024
86964ea
rework transformation section
sh-rp Dec 9, 2024
581de8e
fix some linting errors
sh-rp Dec 9, 2024
c4f19a4
add row counts feature for readabledataset
sh-rp Nov 19, 2024
1687a40
Merge branch 'devel' into feat/prepare-dataset-release
sh-rp Dec 11, 2024
c8fc3c0
add dataset access example to getting started scripts
sh-rp Dec 11, 2024
9962669
add notes about row_counts special query to datasets docs
sh-rp Dec 11, 2024
1fc9891
fix internal docusaurus links
sh-rp Dec 11, 2024
d6ceab0
Update docs/website/docs/intro.md
AstrakhantsevaAA Dec 13, 2024
6a13391
Update docs/website/docs/tutorial/load-data-from-an-api.md
AstrakhantsevaAA Dec 13, 2024
8068b7b
Update docs/website/docs/tutorial/load-data-from-an-api.md
AstrakhantsevaAA Dec 13, 2024
d8f5cf7
Update docs/website/docs/tutorial/load-data-from-an-api.md
AstrakhantsevaAA Dec 13, 2024
c2647a5
Update docs/website/docs/general-usage/dataset-access/dataset.md
AstrakhantsevaAA Dec 13, 2024
afa1633
Update docs/website/docs/general-usage/dataset-access/dataset.md
AstrakhantsevaAA Dec 13, 2024
65d3c74
Update docs/website/docs/dlt-ecosystem/transformations/index.md
AstrakhantsevaAA Dec 13, 2024
949fdca
Update docs/website/docs/dlt-ecosystem/transformations/index.md
AstrakhantsevaAA Dec 13, 2024
5918440
Update docs/website/docs/dlt-ecosystem/transformations/index.md
AstrakhantsevaAA Dec 13, 2024
ee36d4d
Update docs/website/docs/dlt-ecosystem/transformations/index.md
AstrakhantsevaAA Dec 13, 2024
b475cca
Update docs/website/docs/dlt-ecosystem/destinations/duckdb.md
AstrakhantsevaAA Dec 13, 2024
68c1db5
Update docs/website/docs/dlt-ecosystem/transformations/index.md
AstrakhantsevaAA Dec 13, 2024
986785f
Update docs/website/docs/dlt-ecosystem/transformations/index.md
AstrakhantsevaAA Dec 13, 2024
0c21712
Update docs/website/docs/dlt-ecosystem/transformations/python.md
AstrakhantsevaAA Dec 13, 2024
ce4c2ff
Update docs/website/docs/dlt-ecosystem/transformations/python.md
AstrakhantsevaAA Dec 13, 2024
3306e9a
Update docs/website/docs/dlt-ecosystem/transformations/python.md
AstrakhantsevaAA Dec 13, 2024
39f68d8
Update docs/website/docs/dlt-ecosystem/transformations/python.md
AstrakhantsevaAA Dec 13, 2024
4914b1c
Update docs/website/docs/dlt-ecosystem/transformations/python.md
AstrakhantsevaAA Dec 13, 2024
1464341
Update docs/website/docs/dlt-ecosystem/transformations/python.md
AstrakhantsevaAA Dec 13, 2024
316c549
Update docs/website/docs/dlt-ecosystem/transformations/python.md
AstrakhantsevaAA Dec 13, 2024
9471cdb
Update docs/website/docs/dlt-ecosystem/transformations/python.md
AstrakhantsevaAA Dec 13, 2024
615fdf4
Update docs/website/docs/dlt-ecosystem/transformations/sql.md
AstrakhantsevaAA Dec 13, 2024
4690bcf
Update docs/website/docs/dlt-ecosystem/transformations/sql.md
AstrakhantsevaAA Dec 13, 2024
e0f65cd
Update docs/website/docs/dlt-ecosystem/transformations/sql.md
AstrakhantsevaAA Dec 13, 2024
d537b1f
Update docs/website/docs/dlt-ecosystem/transformations/sql.md
AstrakhantsevaAA Dec 13, 2024
3d3b638
Update docs/website/docs/dlt-ecosystem/transformations/sql.md
AstrakhantsevaAA Dec 13, 2024
16954fb
Update docs/website/docs/general-usage/dataset-access/dataset.md
AstrakhantsevaAA Dec 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions dlt/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,6 @@
)
from dlt.pipeline import progress
from dlt import destinations
from dlt.destinations.dataset import dataset as _dataset

pipeline = _pipeline
current = _current
Expand Down Expand Up @@ -80,7 +79,6 @@
"TCredentials",
"sources",
"destinations",
"_dataset",
]

# verify that no injection context was created
Expand Down
2 changes: 1 addition & 1 deletion dlt/pipeline/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -1750,7 +1750,7 @@ def __getstate__(self) -> Any:
# pickle only the SupportsPipeline protocol fields
return {"pipeline_name": self.pipeline_name}

def _dataset(
def dataset(
self, schema: Union[Schema, str, None] = None, dataset_type: TDatasetType = "dbapi"
) -> SupportsReadableDataset:
"""Access helper to dataset"""
Expand Down
27 changes: 17 additions & 10 deletions docs/website/docs/build-a-pipeline-tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -262,20 +262,30 @@ In this example, the first pipeline loads the data using `pipedrive_source()`. T

#### [Using the `dlt` SQL client](dlt-ecosystem/transformations/sql.md)

Another option is to leverage the `dlt` SQL client to query the loaded data and perform transformations using SQL statements. You can execute SQL statements that change the database schema or manipulate data within tables. Here's an example of inserting a row into the `customers` table using the `dlt` SQL client:
Another option is to leverage the `dlt` SQL client to query the loaded data and perform transformations using SQL statements. You can execute SQL statements that change the database schema or manipulate data within tables. Here's an example of creating a new table with aggregated sales data in duckdb:

```py
pipeline = dlt.pipeline(destination="bigquery", dataset_name="crm")
pipeline = dlt.pipeline(destination="duckdb", dataset_name="crm")

with pipeline.sql_client() as client:
client.execute_sql(
"INSERT INTO customers VALUES (%s, %s, %s)", 10, "Fred", "[email protected]"
)
""" CREATE TABLE aggregated_sales AS
SELECT
category,
region,
SUM(amount) AS total_sales,
AVG(amount) AS average_sales
FROM
sales
GROUP BY
category,
region;
""")
```

In this example, the `execute_sql` method of the SQL client allows you to execute SQL statements. The statement inserts a row with values into the `customers` table.

#### [Using Pandas](dlt-ecosystem/transformations/pandas.md)
#### [Using Pandas](dlt-ecosystem/transformations/python.md)

You can fetch query results as Pandas data frames and perform transformations using Pandas functionalities. Here's an example of reading data from the `issues` table in DuckDB and counting reaction types using Pandas:

Expand All @@ -287,11 +297,8 @@ pipeline = dlt.pipeline(
dev_mode=True
)

with pipeline.sql_client() as client:
with client.execute_query(
'SELECT "reactions__+1", "reactions__-1", reactions__laugh, reactions__hooray, reactions__rocket FROM issues'
) as cursor:
reactions = cursor.df()
# get a dataframe of all reactions from the dataset
reactions = pipeline.dataset().issues.select("reactions__+1", "reactions__-1", "reactions__laugh", "reactions__hooray", "reactions__rocket").df()

counts = reactions.sum(0).sort_values(0, ascending=False)
```
Expand Down
2 changes: 1 addition & 1 deletion docs/website/docs/dlt-ecosystem/destinations/duckdb.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ to disable tz adjustments.

## Destination configuration

By default, a DuckDB database will be created in the current working directory with a name `<pipeline_name>.duckdb` (`chess.duckdb` in the example above). After loading, it is available in `read/write` mode via `with pipeline.sql_client() as con:`, which is a wrapper over `DuckDBPyConnection`. See [duckdb docs](https://duckdb.org/docs/api/python/overview#persistent-storage) for details.
By default, a DuckDB database will be created in the current working directory with a name `<pipeline_name>.duckdb` (`chess.duckdb` in the example above). After loading, it is available in `read/write` mode via `with pipeline.sql_client() as con:`, which is a wrapper over `DuckDBPyConnection`. See [duckdb docs](https://duckdb.org/docs/api/python/overview#persistent-storage) for details. If you want to read data, use [datasets](../general-usage/dataset-access/dataset) instead of the sql client.

The `duckdb` credentials do not require any secret values. [You are free to pass the credentials and configuration explicitly](../../general-usage/destination.md#pass-explicit-credentials). For example:
```py
Expand Down
8 changes: 4 additions & 4 deletions docs/website/docs/dlt-ecosystem/transformations/dbt/dbt.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
---
title: Transform the data with dbt
title: Transforming data with dbt
description: Transforming the data loaded by a dlt pipeline with dbt
keywords: [transform, dbt, runner]
---

# Transform the data with dbt
# Transforming data with dbt

[dbt](https://github.com/dbt-labs/dbt-core) is a framework that allows for the simple structuring of your transformations into DAGs. The benefits of using dbt include:

Expand Down Expand Up @@ -105,8 +105,8 @@ You can run the example with dbt debug log: `RUNTIME__LOG_LEVEL=DEBUG python dbt

## Other transforming tools

If you want to transform the data before loading, you can use Python. If you want to transform the data after loading, you can use dbt or one of the following:
If you want to transform your data before loading, you can use Python. If you want to transform your data after loading, you can use dbt or one of the following:

1. [`dlt` SQL client.](../sql.md)
2. [Pandas.](../pandas.md)
2. [Python with dataframes or arrow tables.](../python.md)

27 changes: 27 additions & 0 deletions docs/website/docs/dlt-ecosystem/transformations/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
title: Transforming your data
description: How to transform your data
keywords: [datasets, data, access, transformations]
---
import DocCardList from '@theme/DocCardList';

# Transforming data

If you'd like to transform your data after a pipeline load, you have 3 options available to you:

* [Using dbt](./dbt/dbt.md) - dlt provides a convenient dbt wrapper to make integration easier
AstrakhantsevaAA marked this conversation as resolved.
Show resolved Hide resolved
* [Using the `dlt` SQL client](./sql.md) - dlt exposes an sql client to transform data on your destination directly using sql
AstrakhantsevaAA marked this conversation as resolved.
Show resolved Hide resolved
* [Using python with dataframes or arrow tables](./python.md) - you can also transform your data using arrow tables and dataframes in python
AstrakhantsevaAA marked this conversation as resolved.
Show resolved Hide resolved

If you need to preprocess some of your data before it is loaded, you can learn about strategies to:

* [Rename columns](../general-usage/customising-pipelines/renaming_columns)
* [Pseudonymize columns](../general-usage/customising-pipelines/pseudonymizing_columns)
* [Remove columns](../general-usage/customising-pipelines/removing_columns)

This is particularly useful if you are trying to remove data related to PII or other sensitive data, you want to remove columns that are not needed for your use case or you are using a destination that does not support certain data types in your source data.


# Learn more
<DocCardList />

42 changes: 0 additions & 42 deletions docs/website/docs/dlt-ecosystem/transformations/pandas.md

This file was deleted.

109 changes: 109 additions & 0 deletions docs/website/docs/dlt-ecosystem/transformations/python.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
---
title: Transforming data in Python with arrow tables or dataframes
AstrakhantsevaAA marked this conversation as resolved.
Show resolved Hide resolved
description: Transforming data loaded by a dlt pipeline with pandas dataframes or arrow tables
keywords: [transform, pandas]
---

# Transforming data in python with dataframes or arrow tables
AstrakhantsevaAA marked this conversation as resolved.
Show resolved Hide resolved

You can transform your data in python using pandas dataframes or arrow tables. To get started, please read the [dataset docs](../general-usage/dataset-access/dataset).


## Interactively transforming your data in python
AstrakhantsevaAA marked this conversation as resolved.
Show resolved Hide resolved

Using the methods explained in the [dataset docs](../general-usage/dataset-access/dataset), you can fetch data from your destination into a dataframe or arrow table in your local python process and work with it interactively. This even works for filesystem destinations:


The example below reads GitHub reactions data from the `issues` table and
counts the reaction types.

```py
pipeline = dlt.pipeline(
pipeline_name="github_pipeline",
destination="duckdb",
dataset_name="github_reactions",
dev_mode=True
)

# get a dataframe of all reactions from the dataset
reactions = pipeline.dataset().issues.select("reactions__+1", "reactions__-1", "reactions__laugh", "reactions__hooray", "reactions__rocket").df()

# calculate and print out the sum of all reactions
counts = reactions.sum(0).sort_values(0, ascending=False)
print(counts)

# alternatively, you can fetch the data as an arrow table
reactions = pipeline.dataset().issues.select("reactions__+1", "reactions__-1", "reactions__laugh", "reactions__hooray", "reactions__rocket").arrow()
# ... do transformations on the arrow table
```

## Persisting your transformed data

Since dlt supports dataframes and arrow tables from resources directly, you can use the same pipeline to load the transformed data back into the destination.
AstrakhantsevaAA marked this conversation as resolved.
Show resolved Hide resolved


### A simple example

A simple example that creates a new table from an existing user table but only with columns that do not contain private information. Note that we use the iter_arrow() method on the relation to iterate over the arrow table instead of fetching it all at once.
AstrakhantsevaAA marked this conversation as resolved.
Show resolved Hide resolved

```py
pipeline = dlt.pipeline(
pipeline_name="users_pipeline",
destination="duckdb",
dataset_name="users_raw",
dev_mode=True
)

# get user relation with only a few columns selected, but omitting email and name
users = pipeline.dataset().users.select("age", "amount_spent", "country")

# load the data into a new table called users_clean in the same dataset
pipeline.run(users.iter_arrow(chunk_size=1000), table_name="users_clean")
```

### A more complex example

The example above could easily be done in SQL. Let's assume you'd like to actually do some in python arrow transformations. For this will create a resources from which we can yield the modified arrow tables. The same is possibly with dataframes.
AstrakhantsevaAA marked this conversation as resolved.
Show resolved Hide resolved

```py
import pyarrow.compute as pc

pipeline = dlt.pipeline(
pipeline_name="users_pipeline",
destination="duckdb",
dataset_name="users_raw",
dev_mode=True
)

# NOTE: this resource will work like a regular resource and support write_disposition, primary_key, etc.
# NOTE: For selecting only users above 18, we could also use the filter method on the relation with ibis expressions
@dlt.resource(table_name="users_clean")
def users_clean():
users = pipeline.dataset().users
for arrow_table in users.iter_arrow(chunk_size=1000):

# we want to filter out users under 18
age_filter = pc.greater_equal(arrow_table["age"], 18)
arrow_table = arrow_table.filter(age_filter)

# we want to hash the email column
arrow_table = arrow_table.append_column("email_hash", pc.sha256(arrow_table["email"]))

# we want to remove the email column and name column
arrow_table = arrow_table.drop(["email", "name"])

# yield the transformed arrow table
yield arrow_table


pipeline.run(users_clean())
```

## Other transforming tools

If you want to transform your data before loading, you can use Python. If you want to transform the
data after loading, you can use Pandas or one of the following:

1. [dbt.](dbt/dbt.md) (recommended)
2. [`dlt` SQL client.](sql.md)

55 changes: 37 additions & 18 deletions docs/website/docs/dlt-ecosystem/transformations/sql.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,52 @@
---
title: Transform the data with SQL
title: Transforming data with SQL
description: Transforming the data loaded by a dlt pipeline with the dlt SQL client
keywords: [transform, sql]
---

# Transform the data using the `dlt` SQL client
# Transforming data using the `dlt` SQL client

A simple alternative to dbt is to query the data using the `dlt` SQL client and then perform the
transformations using Python. The `execute_sql` method allows you to execute any SQL statement,
transformations using sql statements in python. The `execute_sql` method allows you to execute any SQL statement,
AstrakhantsevaAA marked this conversation as resolved.
Show resolved Hide resolved
including statements that change the database schema or data in the tables. In the example below, we
insert a row into the `customers` table. Note that the syntax is the same as for any standard `dbapi`
connection.

:::info
* This method will work for all sql destinations supported by `dlt`, but not for the filesystem destination.
AstrakhantsevaAA marked this conversation as resolved.
Show resolved Hide resolved
* Read the [sql client docs](../general-usage/dataset-access/dataset) for more information on how to access data with the sql client.
* If you are simply trying to read data, you should use the powerful [dataset interface](../general-usage/dataset-access/dataset) instead.
:::


Typically you will use this type of transformation if you can create or update tables directly from existing tables
without any need to insert data from your python environment.
AstrakhantsevaAA marked this conversation as resolved.
Show resolved Hide resolved

The example below creates a new table `aggregated_sales` that contains the total and average sales for each category and region


```py
pipeline = dlt.pipeline(destination="bigquery", dataset_name="crm")
try:
with pipeline.sql_client() as client:
client.execute_sql(
"INSERT INTO customers VALUES (%s, %s, %s)",
10,
"Fred",
"[email protected]"
)
except Exception:
...
pipeline = dlt.pipeline(destination="duckdb", dataset_name="crm")

# NOTE: this is the duckdb sql dialect, other destinations may use different expressions
with pipeline.sql_client() as client:
client.execute_sql(
""" CREATE OR REPLACE TABLE aggregated_sales AS
SELECT
category,
region,
SUM(amount) AS total_sales,
AVG(amount) AS average_sales
FROM
sales
GROUP BY
category,
region;
""")
```

In the case of SELECT queries, the data is returned as a list of rows, with the elements of a row
corresponding to selected columns.
You can also use the `execute_sql` method to run select queries. The data is returned as a list of rows, with the elements of a row
corresponding to selected columns. A more convenient way to extract data is to use dlt datasets.

```py
try:
Expand All @@ -44,9 +63,9 @@ except Exception:

## Other transforming tools

If you want to transform the data before loading, you can use Python. If you want to transform the
If you want to transform your data before loading, you can use Python. If you want to transform the
data after loading, you can use SQL or one of the following:

1. [dbt](dbt/dbt.md) (recommended).
2. [Pandas](pandas.md).
2. [Python with dataframes or arrow tables](python.md).
AstrakhantsevaAA marked this conversation as resolved.
Show resolved Hide resolved

Original file line number Diff line number Diff line change
Expand Up @@ -306,7 +306,7 @@ A resource configuration is used to define a [dlt resource](../../../general-usa
- `write_disposition`: The write disposition for the resource.
- `primary_key`: The primary key for the resource.
- `include_from_parent`: A list of fields from the parent resource to be included in the resource output. See the [resource relationships](#include-fields-from-the-parent-resource) section for more details.
- `processing_steps`: A list of [processing steps](#processing-steps-filter-and-transform-data) to filter and transform the data.
- `processing_steps`: A list of [processing steps](#processing-steps-filter-and-transform-data) to filter and transform your data.
- `selected`: A flag to indicate if the resource is selected for loading. This could be useful when you want to load data only from child resources and not from the parent resource.
- `auth`: An optional `AuthConfig` instance. If passed, is used over the one defined in the [client](#client) definition. Example:
```py
Expand Down
Loading
Loading