Skip to content

Commit

Permalink
Fix grammar
Browse files Browse the repository at this point in the history
  • Loading branch information
VioletM committed Mar 20, 2024
1 parent be22dd5 commit 65af4f1
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 38 deletions.
52 changes: 19 additions & 33 deletions docs/website/docs/dlt-ecosystem/destinations/destination.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,11 @@ keywords: [reverse etl, sink, function, decorator, destination, custom destinati

# Custom destination: Reverse ETL

The `dlt` destination decorator allows you to receive all data passing through your pipeline in a simple function. This can be extremely useful for
reverse ETL, where you are pushing data back to an API.
The `dlt` destination decorator allows you to receive all data passing through your pipeline in a simple function. This can be extremely useful for reverse ETL, where you are pushing data back to an API.

You can also use this for sending data to a queue or a simple database destination that is not
yet supported by `dlt`, be aware that you will have to manually handle your own migrations in this case.
You can also use this for sending data to a queue or a simple database destination that is not yet supported by `dlt`, although be aware that you will have to manually handle your own migrations in this case.

It will also allow you to simply get a path
to the files of your normalized data, so if you need direct access to parquet or jsonl files to copy them somewhere or push them to a database,
you can do this here too.
It will also allow you to simply get a path to the files of your normalized data. So, if you need direct access to parquet or jsonl files to copy them somewhere or push them to a database, you can do this here too.

## Install `dlt` for reverse ETL

Expand All @@ -25,8 +21,7 @@ pip install dlt

## Set up a destination function for your pipeline

The custom destination decorator differs from other destinations in that you do not need to provide connection credentials, but rather you provide a function which gets called for all items loaded during a pipeline run or load operation.
With the `@dlt.destination` you can convert any function that takes two arguments into a `dlt` destination.
The custom destination decorator differs from other destinations in that you do not need to provide connection credentials, but rather you provide a function which gets called for all items loaded during a pipeline run or load operation. With the `@dlt.destination`, you can convert any function that takes two arguments into a `dlt` destination.

A very simple dlt pipeline that pushes a list of items into a destination function might look like this:

Expand All @@ -45,11 +40,11 @@ pipeline.run([1, 2, 3], table_name="items")
```

:::tip
1. You can also remove the typing information (`TDataItems` and `TTableSchema`) from this example, typing generally are useful to know the shape of the incoming objects though.
1. You can also remove the typing information (`TDataItems` and `TTableSchema`) from this example. Typing is generally useful to know the shape of the incoming objects, though.
2. There are a few other ways for declaring custom destination functions for your pipeline described below.
:::

## `@dlt.destination`, custom destination function and signature
## `@dlt.destination`, custom destination function, and signature

The full signature of the destination decorator plus its function is the following:

Expand All @@ -67,27 +62,23 @@ def my_destination(items: TDataItems, table: TTableSchema) -> None:
```

### Decorator
* The `batch_size` parameter on the destination decorator defines how many items per function call are batched together and sent as an array. If you set a batch-size of `0`,
instead of passing in actual dataitems, you will receive one call per load job with the path of the file as the items argument. You can then open and process that file
in any way you like.
* The `loader_file_format` parameter on the destination decorator defines in which format files are stored in the load package before being sent to the destination function,
this can be `jsonl` or `parquet`.
* The `name` parameter on the destination decorator defines the name of the destination that get's created by the destination decorator.
* The `naming_convention` parameter on the destination decorator defines the name of the destination that gets created by the destination decorator. This controls how table and column names are normalized. The default is `direct` which will keep all names the same.
* The `batch_size` parameter on the destination decorator defines how many items per function call are batched together and sent as an array. If you set a batch-size of `0`, instead of passing in actual data items, you will receive one call per load job with the path of the file as the items argument. You can then open and process that file in any way you like.
* The `loader_file_format` parameter on the destination decorator defines in which format files are stored in the load package before being sent to the destination function. This can be `jsonl` or `parquet`.
* The `name` parameter on the destination decorator defines the name of the destination that gets created by the destination decorator.
* The `naming_convention` parameter on the destination decorator defines the name of the destination that gets created by the destination decorator. This controls how table and column names are normalized. The default is `direct`, which will keep all names the same.
* The `max_nesting_level` parameter on the destination decorator defines how deep the normalizer will go to normalize complex fields on your data to create subtables. This overwrites any settings on your `source` and is set to zero to not create any nested tables by default.
* The `skip_dlt_columns_and_tables` parameter on the destination decorator defines wether internal tables and columns will be fed into the custom destination function. This is set to `True` by default.
* The `skip_dlt_columns_and_tables` parameter on the destination decorator defines whether internal tables and columns will be fed into the custom destination function. This is set to `True` by default.

:::note
* The custom destination sets the `max_nesting_level` to 0 by default, which means no subtables will be generated during the normalization phase.
* The custom destination also skips all internal tables and columns by default, if you need these, set `skip_dlt_columns_and_tables` to False.
* The custom destination also skips all internal tables and columns by default. If you need these, set `skip_dlt_columns_and_tables` to False.
:::

### Custom destination function
* The `items` parameter on the custom destination function contains the items being sent into the destination function.
* The `table` parameter contains the schema table the current call belongs to including all table hints and columns. For example, the table name can be accessed with `table["name"]`.
* The `table` parameter contains the schema table the current call belongs to, including all table hints and columns. For example, the table name can be accessed with `table["name"]`.
* You can also add config values and secrets to the function arguments, see below!


## Adding config variables and secrets
The destination decorator supports settings and secrets variables. If you, for example, plan to connect to a service that requires an API secret or a login, you can do the following:

Expand All @@ -106,15 +97,11 @@ api_key="<my-api-key>"

## Destination state

The destination keeps a local record of how many `DataItems` were processed, so if you, for example, use the custom destination to push `DataItems` to a remote API, and this
API becomes unavailable during the load resulting in a failed `dlt` pipeline run, you can repeat the run of your pipeline at a later stage and the destination will continue
where it left of. For this reason, it makes sense to choose a batch size that you can process in one transaction (say one API request or one database transaction) so that if this
request or transaction fails repeatedly, you can repeat it at the next run without pushing duplicate data to your remote location.
The destination keeps a local record of how many `DataItems` were processed. So, if you, for example, use the custom destination to push `DataItems` to a remote API, and this API becomes unavailable during the load resulting in a failed `dlt` pipeline run, you can repeat the run of your pipeline at a later stage, and the destination will continue where it left off. For this reason, it makes sense to choose a batch size that you can process in one transaction (say one API request or one database transaction) so that if this request or transaction fails repeatedly, you can repeat it at the next run without pushing duplicate data to your remote location.

## Concurrency

Calls to the destination function by default will be executed on multiple threads, so you need to make sure you are not using any non-thread-safe nonlocal or global variables from outside
your destination function. If you need to have all calls be executed from the same thread, you can set the `workers` config variable of the load step to 1.
Calls to the destination function by default will be executed on multiple threads, so you need to make sure you are not using any non-thread-safe nonlocal or global variables from outside your destination function. If you need to have all calls be executed from the same thread, you can set the `workers` config variable of the load step to 1.

:::tip
For performance reasons, we recommend keeping the multithreaded approach and making sure that you, for example, are using threadsafe connection pools to a remote database or queue.
Expand All @@ -133,13 +120,13 @@ There are multiple ways to reference the custom destination function you want to
# reference function directly
p = dlt.pipeline("my_pipe", destination=local_destination_func)
```
- Directly via destination reference. In this case, don't use decorator for the destination function.
- Directly via destination reference. In this case, don't use the decorator for the destination function.
```py
# file my_destination.py

from dlt.common.destination import Destination

# don't use decorator
# don't use the decorator
def local_destination_func(items: TDataItems, table: TTableSchema) -> None:
...

Expand All @@ -151,7 +138,7 @@ There are multiple ways to reference the custom destination function you want to
)
)
```
- Via fully qualified string to function location (can be used from `config.toml` or ENV vars). Destination function should be located in another file.
- Via a fully qualified string to function location (can be used from `config.toml` or ENV vars). The destination function should be located in another file.
```py
# file my_pipeline.py

Expand All @@ -166,7 +153,6 @@ There are multiple ways to reference the custom destination function you want to
)
```


## Write disposition

`@dlt.destination` will forward all normalized `DataItems` encountered during a pipeline run to the custom destination function, so there is no notion of "write dispositions".
Expand All @@ -178,4 +164,4 @@ There are multiple ways to reference the custom destination function you want to
## What's next

* Check out our [Custom BigQuery Destination](../../examples/custom_destination_bigquery/) example.
* Need a help with building a custom destination? Ask your questions in our [Slack Community](https://dlthub.com/community) technical help channel.
* Need help with building a custom destination? Ask your questions in our [Slack Community](https://dlthub.com/community) technical help channel.
10 changes: 5 additions & 5 deletions docs/website/docs/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ pip install dlt
```
Unlike other solutions, with dlt, there's no need to use any backends or containers. Simply import `dlt` in a Python file or a Jupyter Notebook cell, and create a pipeline to load data into any of the [supported destinations](dlt-ecosystem/destinations/). You can load data from any source that produces Python data structures, including APIs, files, databases, and more. `dlt` also supports building a [custom destination](dlt-ecosystem/destinations/destination.md), which you can use as reverse ETL.

The library will create or update tables, infer data types and handle nested data automatically. Here are a few example pipelines:
The library will create or update tables, infer data types, and handle nested data automatically. Here are a few example pipelines:

<Tabs
groupId="source-type"
Expand Down Expand Up @@ -60,7 +60,7 @@ pip install "dlt[duckdb]"
Now **run** your Python file or Notebook cell.

How it works? The library extracts data from a [source](general-usage/glossary.md#source) (here: **chess.com REST API**), inspects its structure to create a
[schema](general-usage/glossary.md#schema), structures, normalizes and verifies the data, and then
[schema](general-usage/glossary.md#schema), structures, normalizes, and verifies the data, and then
loads it into a [destination](general-usage/glossary.md#destination) (here: **duckdb**, into a database schema **player_data** and table name **player**).


Expand Down Expand Up @@ -177,7 +177,7 @@ pip install sqlalchemy pymysql
- Automated maintenance - with schema inference and evolution and alerts, and with short declarative
code, maintenance becomes simple.
- Run it where Python runs - on Airflow, serverless functions, notebooks. No
external APIs, backends or containers, scales on micro and large infra alike.
external APIs, backends, or containers, scales on micro and large infra alike.
- User-friendly, declarative interface that removes knowledge obstacles for beginners
while empowering senior professionals.

Expand All @@ -187,7 +187,7 @@ while empowering senior professionals.
[Google Colab demo](https://colab.research.google.com/drive/1NfSB1DpwbbHX9_t5vlalBTf13utwpMGx?usp=sharing).
This is the simplest way to see `dlt` in action.
3. Read the [Tutorial](tutorial/intro) to learn how to build a pipeline that loads data from an API.
4. Check out the [How-to guides](walkthroughs/) for recepies on common use cases for creating, running and deploying pipelines.
4. Check out the [How-to guides](walkthroughs/) for recipes on common use cases for creating, running, and deploying pipelines.
5. Ask us on
[Slack](https://dlthub.com/community)
if you have any questions about use cases or the library.
Expand All @@ -197,4 +197,4 @@ if you have any questions about use cases or the library.
1. Give the library a ⭐ and check out the code on [GitHub](https://github.com/dlt-hub/dlt).
1. Ask questions and share how you use the library on
[Slack](https://dlthub.com/community).
1. Report problems and make feature requests [here](https://github.com/dlt-hub/dlt/issues/new/choose).
1. Report problems and make feature requests [here](https://github.com/dlt-hub/dlt/issues/new/choose).

0 comments on commit 65af4f1

Please sign in to comment.