Simple source rate limiting #2149

sh-rp · 2024-12-15T19:13:23Z

Description

This PR adds a simple rate limiting mechanism controlled by the PipeIterator which should be very useful for certain api based use-cases.

netlify · 2024-12-15T19:13:43Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`9b9c9ec`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/675f4962dd7ecd000859ce8c

# Conflicts: # docs/website/docs/general-usage/source.md

sh-rp · 2024-12-16T09:34:58Z

The timing on mac is again working quite differently from the other platforms. Maybe we should add sth to exec_info to be able to determine mac. and specifically allow more generous timeframes if tests are run there.

joscha · 2024-12-16T12:28:07Z

dlt/extract/pipe_iterator.py

@@ -52,6 +54,8 @@ class PipeIteratorConfiguration(BaseConfiguration):
        futures_poll_interval: float = 0.01
        copy_on_fork: bool = False
        next_item_mode: str = "round_robin"
+        rate_limit: Optional[float] = None


could this be a callback? Or even better a generic RateLimiter instance for example that can be shared between different pipes and dynamically updated?

I am thinking about #1485 (comment)

which describes an API that defines a global rate limit across all api endpoints.

If I have different resources (real-world example: https://github.com/dlt-hub/verified-sources/pull/587/files#diff-0fc4db143e89ab087ef737341bc4d93a631141a0bb633c31831ca22bb072c146R100-R105) that are in different pipes but share one limit, then I can't easily express this with this API as I wouldn't know how many other requests are made in other pipes (you can see from the code above that the pipes even may depend on the user configuration).
It also wouldn't be possible to make the rate limit dynamic. E.g. let's say I start a pipeline run that takes 2h but during that time an external system uses a certain amount of the resource for a while, then I'd like to adjust the rate limiting here. I.e.:

t0: I start pipe 1, I have 900 requests/s t1: another system starts working, uses 100 requests/s, I want to reduce pipe 1 to 800r/s t2: the other system stops it's work, I want to up the requests of pipe 1 to 900 aain

or

t0: I start pipe1 and pipe2 parallel, I have 900 req/s in total, so they share 450req/s each t1: pipe2 is much faster and finishes, I want to increase the req/s for pipe1 to 900

@joscha could you maybe collect your requirements for rate limiting in a ticket. I'm not sure if this change here will make it into the code soon, but it is meant as fairly simple rate limiting mechanism to address requirements we hear a lot from the community. The idea to use exponential backoff and to respect well defined rate limiting header also is under consideration and would go into the rest api layer.

I can most certainly put it in a ticket. I am also not suggesting that having rate liming here is bad, I am just not sure the current implementation is flexible enough for more complex systems. Having one way to define rate limits and exponential backoff would be amazing, as it then would mean we only have one way to define it and next time we add the feature to another layer (like the RESTClient for example) we can just reuse what's already known.

Thinking in terms of how sources and/or pipes are structured and limiting them does also not always easily translate to the underlying system these source(s)/resource(s) pull data from. For resources based on an external REST API for example we possibly wouldn't be interested in limiting the work in the resource, but only how many requests are leaving the system (possibly even based on the endpoint as often APIs define different limits for different endpoints but a dlt resource might be using more than one type)?

sh-rp added 2 commits December 15, 2024 20:04

add simple rate limiting for pipe iterator

0194dd4

add docs entry for rate liming

4eb730b

Merge branch 'devel' into feat/simple-rate-limiting

3e6118f

# Conflicts: # docs/website/docs/general-usage/source.md

sh-rp marked this pull request as ready for review December 15, 2024 19:14

sh-rp mentioned this pull request Dec 15, 2024

[experiment] Add resource time limit and rate limiting #1485

Closed

sh-rp added 2 commits December 15, 2024 21:07

make rate limit tests less strict

b510882

make tests even less strict..

9b9c9ec

sh-rp force-pushed the feat/simple-rate-limiting branch from bac447c to 9b9c9ec Compare December 15, 2024 21:25

sh-rp requested a review from rudolfix December 16, 2024 09:35

sh-rp self-assigned this Dec 16, 2024

joscha reviewed Dec 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple source rate limiting #2149

Simple source rate limiting #2149

sh-rp commented Dec 15, 2024

netlify bot commented Dec 15, 2024 •

edited

Loading

sh-rp commented Dec 16, 2024

joscha Dec 16, 2024

sh-rp Dec 16, 2024

joscha Dec 16, 2024

Simple source rate limiting #2149

Are you sure you want to change the base?

Simple source rate limiting #2149

Conversation

sh-rp commented Dec 15, 2024

Description

netlify bot commented Dec 15, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

sh-rp commented Dec 16, 2024

joscha Dec 16, 2024

Choose a reason for hiding this comment

sh-rp Dec 16, 2024

Choose a reason for hiding this comment

joscha Dec 16, 2024

Choose a reason for hiding this comment

netlify bot commented Dec 15, 2024 •

edited

Loading