Skip to content

Commit

Permalink
Add an example for the incremental configuration to the rest_api docs
Browse files Browse the repository at this point in the history
  • Loading branch information
burnash committed Jun 24, 2024
1 parent 8eba834 commit 9b1004d
Showing 1 changed file with 65 additions and 33 deletions.
98 changes: 65 additions & 33 deletions docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -560,49 +560,81 @@ This will include the `id`, `title`, and `created_at` fields from the `issues` r
Some APIs provide a way to fetch only new or changed data (most often by using a timestamp field like `updated_at`, `created_at`, or incremental IDs).
This is called [incremental loading](../../general-usage/incremental-loading.md) and is very useful as it allows you to reduce the load time and the amount of data transferred.

When the API endpoint supports incremental loading, you can configure the source to load only the new or changed data using these two methods:
When the API endpoint supports incremental loading, you can configure dlt to load only the new or changed data using these two methods:

1. Defining a special parameter in the `params` section of the [endpoint configuration](#endpoint-configuration):
1. Defining a special parameter in the `params` section.
2. Specifying the `incremental` field.

```py
{
"<parameter_name>": {
"type": "incremental",
"cursor_path": "<path_to_cursor_field>",
"initial_value": "<initial_value>",
},
}
```
Both are configured in the [endpoint configuration](#endpoint-configuration). Let's start with the first method.

For example, in the `issues` resource configuration in the GitHub example, we have:
### Incremental loading in `params`

```py
{
"since": {
Imagine we have the following endpoint `https://api.example.com/posts` and it:
1. Accepts a `created_since` query parameter to fetch posts created after a certain date.
2. Returns a list of posts with the `created_at` field for each post.

For example, if we query the endpoint with `https://api.example.com/posts?created_since=2024-01-25`, we get the following response:

```json
{
"results": [
{"id": 1, "title": "Post 1", "created_at": "2024-01-26"},
{"id": 2, "title": "Post 2", "created_at": "2024-01-27"},
{"id": 3, "title": "Post 3", "created_at": "2024-01-28"}
]
}
```

To enable the incremental loading for this endpoint, you can use the following configuration:

```py
{
"path": "posts",
"data_selector": "results", # Optional JSONPath to select the list of posts
"params": {
"created_since": {
"type": "incremental",
"cursor_path": "updated_at",
"initial_value": "2024-01-25T11:21:28Z",
"cursor_path": "created_at", # The JSONPath to the field we want to track in each post
"initial_value": "2024-01-25",
},
}
```
},
}
```

This configuration tells the source to create an incremental object that will keep track of the `updated_at` field in the response and use it as a value for the `since` parameter in subsequent requests.
After you run the pipeline, dlt will keep track of the last `created_at` from all the posts fetched and use it as the `created_since` parameter in the next request.
So in our case, the next request will be made to `https://api.example.com/posts?created_since=2024-01-28` to fetch only the new posts created after `2024-01-28`.

2. Specifying the `incremental` field in the [endpoint configuration](#endpoint-configuration):
Now, let's break down the configuration. The `created_since` parameter is defined as an incremental parameter with the following fields:

```py
{
"incremental": {
"start_param": "<parameter_name>",
"end_param": "<parameter_name>",
"cursor_path": "<path_to_cursor_field>",
"initial_value": "<initial_value>",
"end_value": "<end_value>",
}
}
```
```py
{
"<parameter_name>": {
"type": "incremental",
"cursor_path": "<path_to_cursor_field>",
"initial_value": "<initial_value>",
},
}
```

- `type`: The type of the incremental parameter. Set to `incremental`.
- `cursor_path`: The JSONPath to the field within each item in the list that will be used as the cursor value. In this case, it's `created_at`. Note that the path starts from the root of the item (dict) and not from the root of the response.
- `initial_value`: The initial value for the cursor. This is the value that will initialize the state of incremental loading. In this case, it's `2024-01-25`.

### Incremental loading using the `incremental` field

This configuration is more flexible and allows you to specify the start and end conditions for the incremental loading.
The alternative method is to use the `incremental` field in the [endpoint configuration](#endpoint-configuration). This method is more flexible and allows you to specify the start and end conditions for the incremental loading:

```py
{
"incremental": {
"start_param": "<parameter_name>",
"end_param": "<parameter_name>",
"cursor_path": "<path_to_cursor_field>",
"initial_value": "<initial_value>",
"end_value": "<end_value>",
}
}
```

See the [incremental loading](../../general-usage/incremental-loading.md#incremental-loading-with-a-cursor-field) guide for more details.

Expand Down

0 comments on commit 9b1004d

Please sign in to comment.