Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

master merge for 0.5.4 release #1756

Merged
merged 37 commits into from
Aug 28, 2024
Merged
Changes from 10 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
8b4fc8c
RangePaginator: Stops pagination in case of page without data items
willi-mueller Aug 8, 2024
8d4ffa9
Defaults RangePaginator to stop after having received an empty page
willi-mueller Aug 9, 2024
e9ecf88
Documents how to stop paginator, updates docs on json_link
willi-mueller Aug 12, 2024
5e78dcc
Either total_path or maximum_value or stop_after_empty_pages is required
willi-mueller Aug 12, 2024
44b8274
updates docs to new type signature
willi-mueller Aug 12, 2024
e42f4d7
Updated the docs: Using pipeline.default_schema.toprettyyaml() (#1660)
dat-a-man Aug 14, 2024
a9c2958
Add `storage_options` to `DeltaTable.create` (#1686)
jorritsandbrink Aug 14, 2024
61fc190
documents pluggable custom auth
willi-mueller Aug 14, 2024
9bd0b2e
bumps to pre release 0.5.4a0 (#1689)
burnash Aug 14, 2024
122fc7f
Allow different from credentials project_id for BigQuery (#1680)
VioletM Aug 14, 2024
982b448
improves formatting in error message
willi-mueller Aug 15, 2024
a4dbd5d
fix delta table dangling parquet file bug (#1695)
jorritsandbrink Aug 15, 2024
01423f7
Add `delta` table partitioning support (#1696)
jorritsandbrink Aug 15, 2024
49b45fb
sets default argument to None
willi-mueller Aug 16, 2024
1f26fe7
passes non-empty list to paginator.update_state() and interprets both…
willi-mueller Aug 16, 2024
5bf78ae
fixes load job counter (#1702)
rudolfix Aug 16, 2024
2b9a422
Merge pull request #1690 from dlt-hub/feat/524-rest_api-pluggable-cus…
willi-mueller Aug 19, 2024
83bab15
refactors magic to telling name
willi-mueller Aug 19, 2024
d448122
Merge pull request #1677 from dlt-hub/feat/1637_stop-pagination-after…
willi-mueller Aug 19, 2024
843b658
Enable `scd2` record reinsert (#1707)
jorritsandbrink Aug 21, 2024
6f778eb
`scd2` custom "valid from" / "valid to" value feature (#1709)
jorritsandbrink Aug 22, 2024
49dabb8
Make `make lint` fail on `black` format diff (#1716)
jorritsandbrink Aug 22, 2024
c51445c
Docs/issue 1661 add tip to source docs and update weaviate docs (#1662)
dat-a-man Aug 23, 2024
6f7591e
Add custom parent-child relationships example (#1678)
dat-a-man Aug 23, 2024
d9a7b93
Correct the library name for mem stats to `psutil` (#1733)
deepyaman Aug 25, 2024
7d7c14f
Replaced "full_refresh" with "dev_mode" (#1735)
dat-a-man Aug 25, 2024
011d7ff
feat/1681 collects load job metrics and adds remote uri (#1708)
rudolfix Aug 25, 2024
2788235
Update snowflake.md
akelad Aug 26, 2024
935dc09
Feat/1711 create with not exists dlt tables (#1740)
rudolfix Aug 26, 2024
08e5e7a
Enable schema evolution for `merge` write disposition with `delta` ta…
jorritsandbrink Aug 27, 2024
e337cca
provides detail exception messages when cursor stored value cannot be…
rudolfix Aug 27, 2024
817d51d
Merge pull request #1747 from dlt-hub/akelad-patch-1
akelad Aug 28, 2024
98ca505
Expose staging tables truncation to config (#1717)
VioletM Aug 28, 2024
4e1c607
enables external location and named credential in databricks (#1755)
rudolfix Aug 28, 2024
63f8954
bumps dlt version to 0.5.4
rudolfix Aug 28, 2024
b48c7c3
runs staging tests on athena (#1764)
rudolfix Aug 28, 2024
e9c9ecf
fixes staging tests for athena
rudolfix Aug 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion dlt/sources/helpers/rest_client/client.py
Original file line number Diff line number Diff line change
@@ -225,7 +225,7 @@ def raise_for_status(response: Response, *args: Any, **kwargs: Any) -> None:

if paginator is None:
paginator = self.detect_paginator(response, data)
paginator.update_state(response)
paginator.update_state(response, data)
paginator.update_request(request)

# yield data with context
86 changes: 55 additions & 31 deletions dlt/sources/helpers/rest_client/paginators.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import warnings
from abc import ABC, abstractmethod
from typing import Optional, Dict, Any
from typing import Any, Dict, List, Optional
from urllib.parse import urlparse, urljoin

from requests import Response, Request
@@ -39,7 +39,7 @@ def init_request(self, request: Request) -> None: # noqa: B027, optional overri
pass

@abstractmethod
def update_state(self, response: Response) -> None:
def update_state(self, response: Response, data: Optional[List[Any]] = None) -> None:
"""Updates the paginator's state based on the response from the API.

This method should extract necessary pagination details (like next page
@@ -73,7 +73,7 @@ def __str__(self) -> str:
class SinglePagePaginator(BasePaginator):
"""A paginator for single-page API responses."""

def update_state(self, response: Response) -> None:
def update_state(self, response: Response, data: Optional[List[Any]] = None) -> None:
self._has_next_page = False

def update_request(self, request: Request) -> None:
@@ -96,6 +96,7 @@ def __init__(
maximum_value: Optional[int] = None,
total_path: Optional[jsonpath.TJsonPath] = None,
error_message_items: str = "items",
stop_after_empty_page: Optional[bool] = True,
):
"""
Args:
@@ -116,44 +117,55 @@ def __init__(
If not provided, `maximum_value` must be specified.
error_message_items (str): The name of the items in the error message.
Defaults to 'items'.
stop_after_empty_page (bool): Whether pagination should stop when
a page contains no result items. Defaults to `True`.
"""
super().__init__()
if total_path is None and maximum_value is None:
raise ValueError("Either `total_path` or `maximum_value` must be provided.")
if total_path is None and maximum_value is None and not stop_after_empty_page:
raise ValueError(
"Either `total_path` or `maximum_value` or `stop_after_empty_page` must be provided."
)
self.param_name = param_name
self.current_value = initial_value
self.value_step = value_step
self.base_index = base_index
self.maximum_value = maximum_value
self.total_path = jsonpath.compile_path(total_path) if total_path else None
self.error_message_items = error_message_items
self.stop_after_empty_page = stop_after_empty_page

def init_request(self, request: Request) -> None:
if request.params is None:
request.params = {}

request.params[self.param_name] = self.current_value

def update_state(self, response: Response) -> None:
total = None
if self.total_path:
response_json = response.json()
values = jsonpath.find_values(self.total_path, response_json)
total = values[0] if values else None
if total is None:
self._handle_missing_total(response_json)

try:
total = int(total)
except ValueError:
self._handle_invalid_total(total)

self.current_value += self.value_step

if (total is not None and self.current_value >= total + self.base_index) or (
self.maximum_value is not None and self.current_value >= self.maximum_value
):
def update_state(self, response: Response, data: Optional[List[Any]] = None) -> None:
if self._stop_after_this_page(data):
self._has_next_page = False
else:
total = None
if self.total_path:
response_json = response.json()
values = jsonpath.find_values(self.total_path, response_json)
total = values[0] if values else None
if total is None:
self._handle_missing_total(response_json)

try:
total = int(total)
except ValueError:
self._handle_invalid_total(total)

self.current_value += self.value_step

if (total is not None and self.current_value >= total + self.base_index) or (
self.maximum_value is not None and self.current_value >= self.maximum_value
):
self._has_next_page = False

def _stop_after_this_page(self, data: Optional[List[Any]]=None) -> bool:
return self.stop_after_empty_page and not data

def _handle_missing_total(self, response_json: Dict[str, Any]) -> None:
raise ValueError(
@@ -229,6 +241,7 @@ def __init__(
page_param: str = "page",
total_path: jsonpath.TJsonPath = "total",
maximum_page: Optional[int] = None,
stop_after_empty_page: Optional[bool] = True,
):
"""
Args:
@@ -246,9 +259,13 @@ def __init__(
will stop once this page is reached or exceeded, even if more
data is available. This allows you to limit the maximum number
of pages for pagination. Defaults to None.
stop_after_empty_page (bool): Whether pagination should stop when
a page contains no result items. Defaults to `True`.
"""
if total_path is None and maximum_page is None:
raise ValueError("Either `total_path` or `maximum_page` must be provided.")
if total_path is None and maximum_page is None and not stop_after_empty_page:
raise ValueError(
"Either `total_path` or `maximum_page` or `stop_after_empty_page` must be provided."
)

page = page if page is not None else base_page

@@ -260,6 +277,7 @@ def __init__(
value_step=1,
maximum_value=maximum_page,
error_message_items="pages",
stop_after_empty_page=stop_after_empty_page,
)

def __str__(self) -> str:
@@ -330,6 +348,7 @@ def __init__(
limit_param: str = "limit",
total_path: jsonpath.TJsonPath = "total",
maximum_offset: Optional[int] = None,
stop_after_empty_page: Optional[bool] = True,
) -> None:
"""
Args:
@@ -347,15 +366,20 @@ def __init__(
pagination will stop once this offset is reached or exceeded,
even if more data is available. This allows you to limit the
maximum range for pagination. Defaults to None.
stop_after_empty_page (bool): Whether pagination should stop when
a page contains no result items. Defaults to `True`.
"""
if total_path is None and maximum_offset is None:
raise ValueError("Either `total_path` or `maximum_offset` must be provided.")
if total_path is None and maximum_offset is None and not stop_after_empty_page:
raise ValueError(
"Either `total_path` or `maximum_offset` or `stop_after_empty_page` must be provided."
)
super().__init__(
param_name=offset_param,
initial_value=offset,
total_path=total_path,
value_step=limit,
maximum_value=maximum_offset,
stop_after_empty_page=stop_after_empty_page,
)
self.limit_param = limit_param
self.limit = limit
@@ -484,7 +508,7 @@ def __init__(self, links_next_key: str = "next") -> None:
super().__init__()
self.links_next_key = links_next_key

def update_state(self, response: Response) -> None:
def update_state(self, response: Response, data: Optional[List[Any]] = None) -> None:
"""Extracts the next page URL from the 'Link' header in the response."""
self._next_reference = response.links.get(self.links_next_key, {}).get("url")

@@ -539,7 +563,7 @@ def __init__(
super().__init__()
self.next_url_path = jsonpath.compile_path(next_url_path)

def update_state(self, response: Response) -> None:
def update_state(self, response: Response, data: Optional[List[Any]] = None) -> None:
"""Extracts the next page URL from the JSON response."""
values = jsonpath.find_values(self.next_url_path, response.json())
self._next_reference = values[0] if values else None
@@ -618,7 +642,7 @@ def __init__(
self.cursor_path = jsonpath.compile_path(cursor_path)
self.cursor_param = cursor_param

def update_state(self, response: Response) -> None:
def update_state(self, response: Response, data: Optional[List[Any]] = None) -> None:
"""Extracts the cursor value from the JSON response."""
values = jsonpath.find_values(self.cursor_path, response.json())
self._next_reference = values[0] if values else None
8 changes: 4 additions & 4 deletions docs/website/docs/dlt-ecosystem/verified-sources/rest_api.md
Original file line number Diff line number Diff line change
@@ -371,7 +371,7 @@ You can configure the pagination for the `posts` resource like this:
{
"path": "posts",
"paginator": {
"type": "json_response",
"type": "json_link",
"next_url_path": "pagination.next",
}
}
@@ -380,7 +380,7 @@ You can configure the pagination for the `posts` resource like this:
Alternatively, you can use the paginator instance directly:

```py
from dlt.sources.helpers.rest_client.paginators import JSONResponsePaginator
from dlt.sources.helpers.rest_client.paginators import JSONLinkPaginator

# ...

@@ -402,8 +402,8 @@ These are the available paginators:
| ------------ | -------------- | ----------- |
| `json_link` | [JSONLinkPaginator](../../general-usage/http/rest-client.md#jsonresponsepaginator) | The link to the next page is in the body (JSON) of the response.<br/>*Parameters:*<ul><li>`next_url_path` (str) - the JSONPath to the next page URL</li></ul> |
| `header_link` | [HeaderLinkPaginator](../../general-usage/http/rest-client.md#headerlinkpaginator) | The links to the next page are in the response headers.<br/>*Parameters:*<ul><li>`link_header` (str) - the name of the header containing the links. Default is "next".</li></ul> |
| `offset` | [OffsetPaginator](../../general-usage/http/rest-client.md#offsetpaginator) | The pagination is based on an offset parameter. With total items count either in the response body or explicitly provided.<br/>*Parameters:*<ul><li>`limit` (int) - the maximum number of items to retrieve in each request</li><li>`offset` (int) - the initial offset for the first request. Defaults to `0`</li><li>`offset_param` (str) - the name of the query parameter used to specify the offset. Defaults to "offset"</li><li>`limit_param` (str) - the name of the query parameter used to specify the limit. Defaults to "limit"</li><li>`total_path` (str) - a JSONPath expression for the total number of items. If not provided, pagination is controlled by `maximum_offset`</li><li>`maximum_offset` (int) - optional maximum offset value. Limits pagination even without total count</li></ul> |
| `page_number` | [PageNumberPaginator](../../general-usage/http/rest-client.md#pagenumberpaginator) | The pagination is based on a page number parameter. With total pages count either in the response body or explicitly provided.<br/>*Parameters:*<ul><li>`base_page` (int) - the starting page number. Defaults to `0`</li><li>`page_param` (str) - the query parameter name for the page number. Defaults to "page"</li><li>`total_path` (str) - a JSONPath expression for the total number of pages. If not provided, pagination is controlled by `maximum_page`</li><li>`maximum_page` (int) - optional maximum page number. Stops pagination once this page is reached</li></ul> |
| `offset` | [OffsetPaginator](../../general-usage/http/rest-client.md#offsetpaginator) | The pagination is based on an offset parameter. With total items count either in the response body or explicitly provided.<br/>*Parameters:*<ul><li>`limit` (int) - the maximum number of items to retrieve in each request</li><li>`offset` (int) - the initial offset for the first request. Defaults to `0`</li><li>`offset_param` (str) - the name of the query parameter used to specify the offset. Defaults to "offset"</li><li>`limit_param` (str) - the name of the query parameter used to specify the limit. Defaults to "limit"</li><li>`total_path` (str) - a JSONPath expression for the total number of items. If not provided, pagination is controlled by `maximum_offset` and `stop_after_empty_page`</li><li>`maximum_offset` (int) - optional maximum offset value. Limits pagination even without total count</li><li>`stop_after_empty_page` (bool) - Whether pagination should stop when a page contains no result items. Defaults to `True`</li></ul> |
| `page_number` | [PageNumberPaginator](../../general-usage/http/rest-client.md#pagenumberpaginator) | The pagination is based on a page number parameter. With total pages count either in the response body or explicitly provided.<br/>*Parameters:*<ul><li>`base_page` (int) - the starting page number. Defaults to `0`</li><li>`page_param` (str) - the query parameter name for the page number. Defaults to "page"</li><li>`total_path` (str) - a JSONPath expression for the total number of pages. If not provided, pagination is controlled by `maximum_page` and `stop_after_empty_page`</li><li>`maximum_page` (int) - optional maximum page number. Stops pagination once this page is reached</li><li>`stop_after_empty_page` (bool) - Whether pagination should stop when a page contains no result items. Defaults to `True`</li></ul> |
| `cursor` | [JSONResponseCursorPaginator](../../general-usage/http/rest-client.md#jsonresponsecursorpaginator) | The pagination is based on a cursor parameter. The value of the cursor is in the response body (JSON).<br/>*Parameters:*<ul><li>`cursor_path` (str) - the JSONPath to the cursor value. Defaults to "cursors.next"</li><li>`cursor_param` (str) - the query parameter name for the cursor. Defaults to "after"</li></ul> |
| `single_page` | SinglePagePaginator | The response will be interpreted as a single-page response, ignoring possible pagination metadata. |
| `auto` | `None` | Explicitly specify that the source should automatically detect the pagination method. |
Loading