-
Notifications
You must be signed in to change notification settings - Fork 198
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add the first version of RESTClient topic guide
- Loading branch information
Showing
1 changed file
with
137 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,137 @@ | ||
# Reading data from RESTful APIs | ||
|
||
RESTful APIs are a common way to interact with web services. They are based on the HTTP protocol and are used to read and write data from and to a web server. dlt provides a simple way to read data from RESTful APIs using two helper methods: a wrapper around the `requests` library and a `RESTClient` class. | ||
|
||
:::tip | ||
There's also shorthand function to read from paginated APIs. Check out the [paginate()](#shortcut-for-paginating-api-responses) function. | ||
::: | ||
|
||
|
||
The `RESTClient` class offers a powerful interface for interacting with RESTful APIs, supporting features like | ||
- automatic pagination, | ||
- various authentication mechanisms, | ||
- customizable request/response handling. | ||
|
||
This guide demonstrates how to use the `RESTClient` class to read data APIs focusing on its `paginate()` method to fetch data from paginated API responses. | ||
|
||
## Quick example | ||
|
||
Here's a simple pipeline that reads issues from the [dlt GitHub repository](https://github.com/dlt-hub/dlt/issues). The API endpoint is `https://api.github.com/repos/dlt-hub/dlt/issues`. The result is "paginated", meaning that the API returns a limited number of issues per page. The `paginate()` iterates over all pages and yields the results which are then processed by the pipeline. | ||
|
||
```python | ||
import dlt | ||
from dlt.sources.helpers.rest_client import RESTClient | ||
|
||
github_client = RESTClient(base_url="https://api.github.com") # (1) | ||
|
||
@dlt.resource | ||
def get_issues(): | ||
for page in github_client.paginate( # (2) | ||
"/repos/dlt-hub/dlt/issues", # (3) | ||
params={ # (4) | ||
"per_page": 100, | ||
"sort": "updated", | ||
"direction": "desc", | ||
}, | ||
): | ||
yield page # (5) | ||
|
||
|
||
pipeline = dlt.pipeline( | ||
pipeline_name="github_issues", | ||
destination="duckdb", | ||
dataset_name="github_data", | ||
) | ||
load_info = pipeline.run(get_issues) | ||
print(load_info) | ||
``` | ||
|
||
Here's what the code does: | ||
1. We create a `RESTClient` instance with the base URL of the API: in this case, the GitHub API (https://api.github.com). | ||
2. Issues endpoint returns a list of issues. Since there could be hundreds of issues, the API "paginates" the results: it returns a limited number of issues in each response along with a link to the next batch of issues (or "page"). The `paginate()` method iterates over all pages and yields the batches of issues. Note that we do not explicitly specify the pagination parameters here; the `paginate()` method handles this automatically. | ||
3. Here we specify the address of the endpoint we want to read from: `/repos/dlt-hub/dlt/issues`. | ||
4. We pass the parameters to the actual API call to control the data we get back. In this case, we ask for 100 issues per page (`"per_page": 100`), sorted by the last update date (`"sort": "updated"`) in descending order (`"direction": "desc"`). | ||
5. We yield the page from the resource function to the pipeline. | ||
|
||
## Understanding the `RESTClient` Class | ||
|
||
The `RESTClient` class is initialized with parameters that define its behavior for making API requests: | ||
|
||
- `base_url`: The root URL of the API. All requests will be made relative to this URL. | ||
- `headers`: Default headers to include in every request. This can be used to set common headers like `User-Agent` or other custom headers. | ||
- `auth`: The authentication configuration. See the [Authentication](#authentication) section for more details. | ||
- `paginator`: A paginator instance for handling paginated responses. See the [Paginators](#paginators) below. | ||
- `data_selector`: A [JSONPath selector](https://github.com/h2non/jsonpath-ng?tab=readme-ov-file#jsonpath-syntax) for extracting data from the responses. This defines a way to extract the data from the response JSON. Only used when paginating. | ||
- `session`: An HTTP session for making requests. This is a custom session object that can be used to set up custom behavior for requests. | ||
|
||
## Making Basic Requests | ||
|
||
To perform basic GET and POST requests, use the get() and post() methods respectively. This works similarly to the requests library: | ||
|
||
```python | ||
client = RESTClient(base_url="https://api.example.com") | ||
response = client.get("/posts/1") | ||
``` | ||
|
||
## Paginating API Responses | ||
|
||
The `RESTClient.paginate()` method is specifically designed to handle paginated responses, yielding `PageData` objects for each page: | ||
|
||
```python | ||
for page in client.paginate("/posts"): | ||
print(page) | ||
``` | ||
|
||
If `paginator` is not specified, the `paginate()` method will attempt to automatically detect the pagination mechanism used by the API. If the API uses a standard pagination mechanism like having a `next` link in the response's headers or JSON body, the `paginate()` method will handle this automatically. Otherwise, you can specify a paginator object explicitly or implement a custom paginator. | ||
|
||
### PageData Object | ||
|
||
Each `PageData` object contains the data for a single page, along with context like the original request and response objects, allowing for detailed inspection. The `PageData` is a list-like object that contains the following attributes: | ||
|
||
- `request`: The original request object. | ||
- `response`: The response object. | ||
- `paginator`: The paginator object used to paginate the response. | ||
- `auth`: The authentication object used for the request. | ||
|
||
### Paginators | ||
|
||
Paginators are used to handle paginated responses. The `RESTClient` class comes with built-in paginators for common pagination mechanisms: | ||
- `JSONResponsePaginator`: Handles pagination based on a link to the next page in the JSON response. | ||
- `HeaderLinkPaginator`: Handles pagination based on a link to the next page in the response headers (e.g., the `Link` header, as used by GitHub). | ||
- `OffsetPaginator`: Handles pagination based on an offset and limit in the query parameters. This works only if the API returns the total number of items in the response. | ||
- `JSONResponseCursorPaginator`: Handles pagination based on a cursor in the JSON response. | ||
|
||
### Authentication | ||
|
||
The RESTClient supports various authentication strategies, such as bearer tokens, API keys, and HTTP basic auth, configured through the `auth` parameter of both the RESTClient and the paginate() method. | ||
|
||
The available authentication methods are: | ||
- `BearerTokenAuth`: For authenticating with a bearer token in the `Authorization` header. Example header: `Authorization: Bearer <token>` | ||
- `ApiKeyAuth`: For authenticating with an API key like `X-API-Key`. | ||
- `HttpBasicAuth`: For authenticating with HTTP basic auth. | ||
|
||
## Advanced Usage | ||
|
||
RESTClient.paginate() allows to specify a custom hook function that can be used to modify the response objects. For example, to handle specific HTTP status codes gracefully: | ||
|
||
```python | ||
def custom_response_handler(response): | ||
if response.status_code == 404: | ||
# Handle not found | ||
pass | ||
|
||
client.paginate("/posts", hooks={"response": [custom_response_handler]}) | ||
``` | ||
|
||
The handler function may raise `IgnoreResponseException` to exit the pagination loop early. This is useful for the enpoints | ||
|
||
## Shortcut for Paginating API Responses | ||
|
||
The `paginate()` helper function provides a shorthand for paginating API responses. It takes the same parameters as the `RESTClient.paginate()` method but automatically creates a RESTClient instance with the specified base URL: | ||
|
||
```python | ||
from dlt.sources.helpers.requests import paginate | ||
|
||
for page in paginate("https://api.example.com/posts"): | ||
print(page) | ||
``` |