Restructure the rest-client docs and add new sections covering data_s…

…elector and custom auth
dlt-hub · May 7, 2024 · 9a3d9bb · 9a3d9bb
1 parent 8c72b69
commit 9a3d9bb
Show file tree

Hide file tree

Showing 5 changed files with 600 additions and 78 deletions.
diff --git a/docs/website/docs/general-usage/extract-data-from-api/overview.md b/docs/website/docs/general-usage/extract-data-from-api/overview.md
@@ -0,0 +1,87 @@
+---
+title: Extract data from an API
+description: Learn how to extract data from an API using dlt
+keywords: [api, http, rest, restful, requests, restclient, paginate, pagination, json]
+---
+
+dlt has built-in support for fetching data from APIs:
+- [RESTClient](./rest-client.md) for interacting with RESTful APIs and paginating the results
+- [Requests wrapper](./requests.md) for making simple HTTP requests with automatic retries and timeouts
+
+## Quick example
+
+Here's a simple pipeline that reads issues from the [dlt GitHub repository](https://github.com/dlt-hub/dlt/issues). The API endpoint is https://api.github.com/repos/dlt-hub/dlt/issues. The result is "paginated", meaning that the API returns a limited number of issues per page. The `paginate()` method iterates over all pages and yields the results which are then processed by the pipeline.
+
+```py
+import dlt
+from dlt.sources.helpers.rest_client import RESTClient
+
+github_client = RESTClient(base_url="https://api.github.com")  # (1)
+
+@dlt.resource
+def get_issues():
+    for page in github_client.paginate(                        # (2)
+        "/repos/dlt-hub/dlt/issues",                           # (3)
+        params={                                               # (4)
+            "per_page": 100,
+            "sort": "updated",
+            "direction": "desc",
+        },
+    ):
+        yield page                                             # (5)
+
+
+pipeline = dlt.pipeline(
+    pipeline_name="github_issues",
+    destination="duckdb",
+    dataset_name="github_data",
+)
+load_info = pipeline.run(get_issues)
+print(load_info)
+```
+
+Here's what the code does:
+1. We create a `RESTClient` instance with the base URL of the API: in this case, the GitHub API (https://api.github.com).
+2. Issues endpoint returns a list of issues. Since there could be hundreds of issues, the API "paginates" the results: it returns a limited number of issues in each response along with a link to the next batch of issues (or "page"). The `paginate()` method iterates over all pages and yields the batches of issues.
+3. Here we specify the address of the endpoint we want to read from: `/repos/dlt-hub/dlt/issues`.
+4. We pass the parameters to the actual API call to control the data we get back. In this case, we ask for 100 issues per page (`"per_page": 100`), sorted by the last update date (`"sort": "updated"`) in descending order (`"direction": "desc"`).
+5. We yield the page from the resource function to the pipeline. The `page` is an instance of the [`PageData`](#pagedata) and contains the data from the current page of the API response and some metadata.
+
+Note that we do not explicitly specify the pagination parameters in the example. The `paginate()` method handles pagination automatically: it detects the pagination mechanism used by the API from the response. What if you need to specify the pagination method and parameters explicitly? Let's see how to do that in a different example below.
+
+## Explicitly specifying pagination parameters
+
+```py
+import dlt
+from dlt.sources.helpers.rest_client import RESTClient
+from dlt.sources.helpers.rest_client.paginators import JSONResponsePaginator
+
+github_client = RESTClient(
+    base_url="https://pokeapi.co/api/v2",
+    paginator=JSONResponsePaginator(next_url_path="next")    # (1)
+    data_selector="results",                                 # (2)
+)
+
+@dlt.resource
+def get_pokemons():
+    for page in github_client.paginate(
+        "/pokemon",
+        params={
+            "limit": 100,                                    # (3)
+        },
+    ):
+        yield page
+
+pipeline = dlt.pipeline(
+    pipeline_name="get_pokemons",
+    destination="duckdb",
+    dataset_name="github_data",
+)
+load_info = pipeline.run(get_pokemons)
+print(load_info)
+```
+
+In the example above:
+1. We create a `RESTClient` instance with the base URL of the API: in this case, the [PokéAPI](https://pokeapi.co/). We also specify the paginator to use explicitly: `JSONResponsePaginator` with the `next_url_path` set to `"next"`. This tells the paginator to look for the next page URL in the `next` key of the JSON response.
+2. In `data_selector` we specify the JSON path to extract the data from the response. This is used to extract the data from the response JSON.
+3. By default the number of items per page is limited to 20. We override this by specifying the `limit` parameter in the API call.
diff --git a/docs/website/docs/general-usage/extract-data-from-api/requests.md b/docs/website/docs/general-usage/extract-data-from-api/requests.md
@@ -0,0 +1,100 @@
+---
+title: Requests wrapper
+description: Use the dlt requests wrapper to make HTTP requests with automatic retries and timeouts
+keywords: [http, requests, retry, timeout]
+---
+
+`dlt` provides a customized [Python Requests](https://requests.readthedocs.io/en/latest/) client with automatic retries and configurable timeouts.
+
+We recommend using this to make API calls in your sources as it makes your pipeline more resilient to intermittent network errors and other random glitches which otherwise can cause the whole pipeline to fail.
+
+The dlt requests client will additionally set the default user-agent header to `dlt/{DLT_VERSION_NAME}`.
+
+For most use cases this is a drop in replacement for `requests`, so in places where you would normally do:
+
+```py
+import requests
+```
+
+You can instead do:
+
+```py
+from dlt.sources.helpers import requests
+```
+
+And use it just like you would use `requests`:
+
+```py
+response = requests.get(
+    'https://example.com/api/contacts',
+    headers={'Authorization': MY_API_KEY}
+)
+data = response.json()
+...
+```
+
+### Retry rules
+
+By default failing requests are retried up to 5 times with an exponentially increasing delay. That means the first retry will wait 1 second and the fifth retry will wait 16 seconds.
+
+If all retry attempts fail the corresponding requests exception is raised. E.g. `requests.HTTPError` or `requests.ConnectionTimeout`
+
+All standard HTTP server errors trigger a retry. This includes:
+
+* Error status codes:
+
+    All status codes in the `500` range and `429` (too many requests).
+    Commonly servers include a `Retry-After` header with `429` and `503` responses.
+    When detected this value supersedes the standard retry delay.
+
+* Connection and timeout errors
+
+    When the remote server is unreachable, the connection is unexpectedly dropped or when the request takes longer than the configured `timeout`.
+
+### Customizing retry settings
+
+Many requests settings can be added to the runtime section in your `config.toml`. For example:
+
+```toml
+[runtime]
+request_max_attempts = 10  # Stop after 10 retry attempts instead of 5
+request_backoff_factor = 1.5  # Multiplier applied to the exponential delays. Default is 1
+request_timeout = 120  # Timeout in seconds
+request_max_retry_delay = 30  # Cap exponential delay to 30 seconds
+```
+
+For more control you can create your own instance of `dlt.sources.requests.Client` and use that instead of the global client.
+
+This lets you customize which status codes and exceptions to retry on:
+
+```py
+from dlt.sources.helpers import requests
+
+http_client = requests.Client(
+    status_codes=(403, 500, 502, 503),
+    exceptions=(requests.ConnectionError, requests.ChunkedEncodingError)
+)
+```
+
+and you may even supply a custom retry condition in the form of a predicate.
+This is sometimes needed when loading from non-standard APIs which don't use HTTP error codes.
+
+For example:
+
+```py
+from dlt.sources.helpers import requests
+
+def retry_if_error_key(response: Optional[requests.Response], exception: Optional[BaseException]) -> bool:
+    """Decide whether to retry the request based on whether
+    the json response contains an `error` key
+    """
+    if response is None:
+        # Fall back on the default exception predicate.
+        return False
+    data = response.json()
+    return 'error' in data
+
+http_client = Client(
+    retry_condition=retry_if_error_key
+)
+```