Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qps keep falling to zero until the fault recovered when injection pd leader io delay 500ms last for 10mins,we need some tuning parameters for this issue #8852

Open
Lily2025 opened this issue Nov 26, 2024 · 3 comments
Assignees
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@Lily2025
Copy link

Bug Report

What did you do?

1、run tpcc
2、injection pd leader io delay 500ms last for 10mins

What did you expect to see?

qps can recover within 5mins

What did you see instead?

qps keep falling to zero which last for 10mins until the fault recovered
Image

What version of PD are you using (pd-server -V)?

./pd-server -V
Release Version: v8.5.0-alpha-32-g90cc61b4
Edition: Community
Git Commit Hash: 90cc61b
Git Branch: HEAD
UTC Build Time: 2024-11-21 08:41:25
2024-11-22T03:48:28.199+0800

@Lily2025 Lily2025 added the type/bug The issue is confirmed as a bug. label Nov 26, 2024
@Lily2025
Copy link
Author

/type enhancement
/remove-type bug

@ti-chi-bot ti-chi-bot bot added type/enhancement The issue or PR belongs to an enhancement. and removed type/bug The issue is confirmed as a bug. labels Nov 26, 2024
@Lily2025
Copy link
Author

/assign JmPotato

@Lily2025 Lily2025 changed the title qps keep falling to zero until the fault recovered when injection pd leader io delay 500ms last for 10mins qps keep falling to zero until the fault recovered when injection pd leader io delay 500ms last for 10mins,we need some tuning parameters for this issue Nov 26, 2024
@JmPotato
Copy link
Member

After investigation of the logs and metrics, in this case, the continuous drop to zero in QPS is that pd-1, injected with IO latency, did not lose its etcd leader status. As a result, the PD leader was repeatedly elected on this faulty node. Also, because the time for 3 re-elections happened to exceed 5 minutes, it did not trigger the condition of "being evicted as etcd leader if elected 3 consecutive times within 5 minutes". Ultimately, the unavailability persisted until the IO injection ended.

Perhaps we should provide a configurable option for the election circuit breaker threshold, e.g., 3 times within 10 minutes rather than just 5 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

No branches or pull requests

2 participants