Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KV] "failed to get rate limit token" when running under load #6645

Open
arielshaqed opened this issue Sep 23, 2023 · 5 comments
Open

[KV] "failed to get rate limit token" when running under load #6645

arielshaqed opened this issue Sep 23, 2023 · 5 comments
Labels
no stale Using this label will prevent items from being marked as stale

Comments

@arielshaqed
Copy link
Contributor

Under load KV gets "DynamoDB: GetItem, failed to get rate limit token". But capacity appears to be available on this DynamoDB.

Message:

{
  "error": "get item: operation error DynamoDB: GetItem, failed to get rate limit token, retry quota exceeded, 4 available, 5 requested",
  "file": "build/pkg/api/controller.go:2118",
  "func": "pkg/api.(*Controller).handleAPIErrorCallback",
  "host": "***.us-east-1.lakefscloud.io",
  "level": "error",
  "method": "GET",
  "msg": "API call returned status internal server error",
  "operation_id": "StatObject",
  "path": "/api/v1/repositories/REPO/refs/REF/objects/stat?path=/PATH/TO/SOME/PARTITIONED/PARQUET&user_metadata=false&presign=false",
  "request_id": "12345-678",
  "service": "api_gateway",
  "service_name": "rest_api",
  "time": "2023-09-23THH:MM:SS",
  "user": "USER"
}

AFAIU this is an internal client-side error coming from the AWS SDK. I don't understand how this can work without knowing the desired rate, which of course depends on how many RCUs we've purchased or even how many RCUs DynamoDB autoscaling has given us.

@arielshaqed
Copy link
Contributor Author

arielshaqed commented Sep 23, 2023

We see limited throttling but before this message. I believe that a client-side rate limit is actively harmful here. But if we do keep it, we obviously need a much more generous back-off.

@N-o-Z
Copy link
Member

N-o-Z commented Sep 28, 2023

@arielshaqed How is this different from #6621 (aws/aws-sdk-go-v2#1665)?
Is there a different call to action here?

Copy link

This issue is now marked as stale after 90 days of inactivity, and will be closed soon. To keep it, mark it with the "no stale" label.

@github-actions github-actions bot added the stale label Dec 28, 2023
@arielshaqed
Copy link
Contributor Author

This issue is now marked as stale after 90 days of inactivity, and will be closed soon. To keep it, mark it with the "no stale" label.

SRSLY?

@arielshaqed arielshaqed added the no stale Using this label will prevent items from being marked as stale label Jan 3, 2024
@arielshaqed
Copy link
Contributor Author

@arielshaqed How is this different from #6621 (aws/aws-sdk-go-v2#1665)? Is there a different call to action here?

That added some retries, which fixed a P0: it was happening all the time under load. The underlying issue remains: lakeFS should manage client-side rate limiting in a better way. It is incorrect to fail a user operation because of client-side rate limiting.

@github-actions github-actions bot removed the stale label Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no stale Using this label will prevent items from being marked as stale
Projects
None yet
Development

No branches or pull requests

2 participants