From 931111ad406d52ea2fb2033a74ffe630586de16a Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Wed, 22 May 2024 12:00:07 +0200 Subject: [PATCH 1/6] Add more seo keywords --- docs/website/docs/general-usage/http/rest-client.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/general-usage/http/rest-client.md b/docs/website/docs/general-usage/http/rest-client.md index 19cc95bf78..45f7ce2fe7 100644 --- a/docs/website/docs/general-usage/http/rest-client.md +++ b/docs/website/docs/general-usage/http/rest-client.md @@ -1,7 +1,7 @@ --- title: RESTClient description: Learn how to use the RESTClient class to interact with RESTful APIs -keywords: [api, http, rest, request, extract, restclient, client, pagination, json, response, data_selector, session, auth, paginator, jsonresponsepaginator, headerlinkpaginator, offsetpaginator, jsonresponsecursorpaginator, queryparampaginator, bearer, token, authentication] +keywords: [api, http, rest, request, extract, restclient, client, pagination, json, response, data_selector, session, auth, paginator, jsonresponsepaginator, headerlinkpaginator, offsetpaginator, jsonresponsecursorpaginator, queryparampaginator, bearer, token, authentication, reverse etl, json path, openapi, swagger] --- The `RESTClient` class offers an interface for interacting with RESTful APIs, including features like: From 373b8b3dfeeabfee380da5163f8e2e7b6104548d Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Wed, 22 May 2024 12:08:39 +0200 Subject: [PATCH 2/6] Add a new section about resource_defaults --- .../docs/general-usage/http/rest-client.md | 24 +++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/docs/website/docs/general-usage/http/rest-client.md b/docs/website/docs/general-usage/http/rest-client.md index 45f7ce2fe7..8c2317f6f9 100644 --- a/docs/website/docs/general-usage/http/rest-client.md +++ b/docs/website/docs/general-usage/http/rest-client.md @@ -80,7 +80,7 @@ For example, if the API response looks like this: } ``` -The `data_selector` should be set to `"posts"` to extract the list of posts from the response. +The `data_selector` should be set to `"posts"` or `"$.posts"` to extract the list of posts from the response. For a nested structure like this: @@ -96,7 +96,7 @@ For a nested structure like this: } ``` -The `data_selector` needs to be set to `"results.posts"`. Read more about [JSONPath syntax](https://github.com/h2non/jsonpath-ng?tab=readme-ov-file#jsonpath-syntax) to learn how to write selectors. +The `data_selector` needs to be set to `"results.posts"` or `"$.results.posts"`. Read more about [JSONPath syntax](https://github.com/h2non/jsonpath-ng?tab=readme-ov-file#jsonpath-syntax) to learn how to write selectors. ### PageData @@ -432,6 +432,26 @@ for page in client.paginate("/protected/resource"): print(page) ``` +## Common resource defaults + +In `RESTAPIConfig` you can provide via `resource_defaults` which will then be applied to all requests + +```py +my_params = { + "from_year": 2018, + "end_year": 2024, +} + +source_config: RESTAPIConfig = { + "client": {...}, + "resource_defaults": { + "endpoint": { + "params": my_params, + } + } +} +``` + ### API key authentication API Key Authentication (`ApiKeyAuth`) is an auth method where the client sends an API key in a custom header (e.g. `X-API-Key: `, or as a query parameter). From 8b88d653f12a53bcd7649f846f440eb7a6de007a Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Wed, 22 May 2024 12:27:51 +0200 Subject: [PATCH 3/6] Add new section about incremental loading --- docs/website/docs/general-usage/http/rest-client.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/website/docs/general-usage/http/rest-client.md b/docs/website/docs/general-usage/http/rest-client.md index 8c2317f6f9..f6ea552614 100644 --- a/docs/website/docs/general-usage/http/rest-client.md +++ b/docs/website/docs/general-usage/http/rest-client.md @@ -536,6 +536,10 @@ def custom_response_handler(response): client.paginate("/posts", hooks={"response": [custom_response_handler]}) ``` +### Incremental loading + +TODO + The handler function may raise `IgnoreResponseException` to exit the pagination loop early. This is useful for the enpoints that return a 404 status code when there are no items to paginate. ## Shortcut for paginating API responses From 415c3c6c9f160343e908e3c89c3fdd37e7c2e16b Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Wed, 22 May 2024 14:20:37 +0200 Subject: [PATCH 4/6] Add incremental loading examples --- .../docs/general-usage/http/rest-client.md | 124 ++++++++++++++---- 1 file changed, 99 insertions(+), 25 deletions(-) diff --git a/docs/website/docs/general-usage/http/rest-client.md b/docs/website/docs/general-usage/http/rest-client.md index f6ea552614..97a0668c63 100644 --- a/docs/website/docs/general-usage/http/rest-client.md +++ b/docs/website/docs/general-usage/http/rest-client.md @@ -1,10 +1,18 @@ --- title: RESTClient description: Learn how to use the RESTClient class to interact with RESTful APIs -keywords: [api, http, rest, request, extract, restclient, client, pagination, json, response, data_selector, session, auth, paginator, jsonresponsepaginator, headerlinkpaginator, offsetpaginator, jsonresponsecursorpaginator, queryparampaginator, bearer, token, authentication, reverse etl, json path, openapi, swagger] +keywords: + [ + api, http, rest, request, extract, restclient, client, + pagination, json, response, data_selector, session, auth, + paginator, jsonresponsepaginator, headerlinkpaginator, offsetpaginator, + jsonresponsecursorpaginator, queryparampaginator, bearer, token, + authentication, reverse etl, json path, openapi, swagger + ] --- The `RESTClient` class offers an interface for interacting with RESTful APIs, including features like: + - automatic pagination, - various authentication mechanisms, - customizable request/response handling. @@ -72,11 +80,11 @@ For example, if the API response looks like this: ```json { - "posts": [ - {"id": 1, "title": "Post 1"}, - {"id": 2, "title": "Post 2"}, - {"id": 3, "title": "Post 3"} - ] + "posts": [ + { "id": 1, "title": "Post 1" }, + { "id": 2, "title": "Post 2" }, + { "id": 3, "title": "Post 3" } + ] } ``` @@ -86,13 +94,13 @@ For a nested structure like this: ```json { - "results": { - "posts": [ - {"id": 1, "title": "Post 1"}, - {"id": 2, "title": "Post 2"}, - {"id": 3, "title": "Post 3"} - ] - } + "results": { + "posts": [ + { "id": 1, "title": "Post 1" }, + { "id": 2, "title": "Post 2" }, + { "id": 3, "title": "Post 3" } + ] + } } ``` @@ -133,14 +141,14 @@ Suppose the API response for `https://api.example.com/posts` looks like this: ```json { - "data": [ - {"id": 1, "title": "Post 1"}, - {"id": 2, "title": "Post 2"}, - {"id": 3, "title": "Post 3"} - ], - "pagination": { - "next": "https://api.example.com/posts?page=2" - } + "data": [ + { "id": 1, "title": "Post 1" }, + { "id": 2, "title": "Post 2" }, + { "id": 3, "title": "Post 3" } + ], + "pagination": { + "next": "https://api.example.com/posts?page=2" + } } ``` @@ -161,7 +169,6 @@ def get_data(): yield page ``` - #### HeaderLinkPaginator This paginator handles pagination based on a link to the next page in the response headers (e.g., the `Link` header, as used by GitHub). @@ -536,11 +543,78 @@ def custom_response_handler(response): client.paginate("/posts", hooks={"response": [custom_response_handler]}) ``` +The handler function may raise `IgnoreResponseException` to exit the pagination loop early. This is useful for the enpoints that return a 404 status code when there are no items to paginate. + ### Incremental loading -TODO +It is often needed to load only the new data based on some incremental property be it timestamp, date and time, integer identifier or a cursor value. +Fortunately our `RESTClient` allows you to elegantly express this behavior. -The handler function may raise `IgnoreResponseException` to exit the pagination loop early. This is useful for the enpoints that return a 404 status code when there are no items to paginate. +Let's use our example response json and we want to load new posts as they appear without complete reload of data. + +```json +{ + "data": [ + { "id": 1, "title": "Post 1", "created_at": "2010-08-21T17:11:27-0400" }, + { "id": 2, "title": "Post 2", "created_at": "2010-09-21T17:11:27-0400" }, + { "id": 3, "title": "Post 3", "created_at": "2010-10-21T17:11:27-0400" } + ], + "pagination": { + "next": "https://api.example.com/posts?page=2" + } +} +``` + +To achive our objective we need to use `endpoint.params` by adding the incremental type. +In the following examples we use `id` - primary key and `created_at` - creation datetime. + +**Incremental loading by id** + +```py +source_config: RESTAPIConfig = { + "resources": [ + { + "name": "get_posts_list", + "table_name": "posts", + "endpoint": { + "data_selector": "$.data", + "path": "/posts", + "params": { + "post_id": { + "type": "incremental", + "cursor_path": "id", + "initial_value": 1, + } + }, + }, + } + ] +} +``` + +**Incremental loading by creation data** + +```py +source_config: RESTAPIConfig = { + "resources": [ + { + "name": "get_posts_list", + "table_name": "posts", + "endpoint": { + "data_selector": "$.data", + "path": "/posts", + "params": { + "creation_date": { + "type": "incremental", + "cursor_path": "created_at", + "initial_value": "2010-08-21T17:11:27-0400", + } + }, + }, + } + ] +} +``` ## Shortcut for paginating API responses @@ -584,7 +658,7 @@ RUNTIME__LOG_LEVEL=INFO python my_script.py ``` 2. Use the [`PageData`](#pagedata) instance to inspect the [request](https://docs.python-requests.org/en/latest/api/#requests.Request) -and [response](https://docs.python-requests.org/en/latest/api/#requests.Response) objects: + and [response](https://docs.python-requests.org/en/latest/api/#requests.Response) objects: ```py from dlt.sources.helpers.rest_client import RESTClient From 0e1e7b129da248870d4c3519dcc2dea89693c0e1 Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Wed, 22 May 2024 14:34:37 +0200 Subject: [PATCH 5/6] Add custom combined auth example --- .../docs/general-usage/http/rest-client.md | 31 ++++++++++++++----- 1 file changed, 24 insertions(+), 7 deletions(-) diff --git a/docs/website/docs/general-usage/http/rest-client.md b/docs/website/docs/general-usage/http/rest-client.md index 97a0668c63..9ea87ecb9a 100644 --- a/docs/website/docs/general-usage/http/rest-client.md +++ b/docs/website/docs/general-usage/http/rest-client.md @@ -508,11 +508,13 @@ response = client.get("/protected/resource") You can implement custom authentication by subclassing the `AuthConfigBase` class and implementing the `__call__` method: +**Custom bearer auth:** + ```py from dlt.sources.helpers.rest_client.auth import AuthConfigBase class CustomAuth(AuthConfigBase): - def __init__(self, token): + def __init__(self, token: str): self.token = token def __call__(self, request): @@ -521,6 +523,24 @@ class CustomAuth(AuthConfigBase): return request ``` +**Custom combined auth:** +Sometimes you need to pass authentication parameters via headers as well as query params + +```py +from dlt.sources.helpers.rest_client.auth import AuthConfigBase + +class CombinedAuth(AuthConfigBase): + def __init__(self, client_id: str, client_secret: str): + self.client_id = client_id + self.client_secret = client_secret + + def __call__(self, request): + # Modify the request object to include the necessary authentication headers and request params + request.headers["Authorization"] = f"Bearer {self.client_secret}" + request.prepare_url(request.url, {"client_id": self.client_id}) + return request + + Then, you can use your custom authentication class with the `RESTClient`: ```py @@ -550,7 +570,7 @@ The handler function may raise `IgnoreResponseException` to exit the pagination It is often needed to load only the new data based on some incremental property be it timestamp, date and time, integer identifier or a cursor value. Fortunately our `RESTClient` allows you to elegantly express this behavior. -Let's use our example response json and we want to load new posts as they appear without complete reload of data. +Let's use our slightly modified example response json and we want to load new posts as they appear without complete reload of data. ```json { @@ -558,10 +578,7 @@ Let's use our example response json and we want to load new posts as they appear { "id": 1, "title": "Post 1", "created_at": "2010-08-21T17:11:27-0400" }, { "id": 2, "title": "Post 2", "created_at": "2010-09-21T17:11:27-0400" }, { "id": 3, "title": "Post 3", "created_at": "2010-10-21T17:11:27-0400" } - ], - "pagination": { - "next": "https://api.example.com/posts?page=2" - } + ] } ``` @@ -592,7 +609,7 @@ source_config: RESTAPIConfig = { } ``` -**Incremental loading by creation data** +**Incremental loading by creation date** ```py source_config: RESTAPIConfig = { From 7c6dcbd4ec69f470c7feaf7b6436127e805dab4e Mon Sep 17 00:00:00 2001 From: Sultan Iman Date: Wed, 22 May 2024 14:35:16 +0200 Subject: [PATCH 6/6] Close code block --- docs/website/docs/general-usage/http/rest-client.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/general-usage/http/rest-client.md b/docs/website/docs/general-usage/http/rest-client.md index 9ea87ecb9a..8aa4d93da7 100644 --- a/docs/website/docs/general-usage/http/rest-client.md +++ b/docs/website/docs/general-usage/http/rest-client.md @@ -539,7 +539,7 @@ class CombinedAuth(AuthConfigBase): request.headers["Authorization"] = f"Bearer {self.client_secret}" request.prepare_url(request.url, {"client_id": self.client_id}) return request - +``` Then, you can use your custom authentication class with the `RESTClient`: