Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

puller: fix retry logic when check store version failed #11903

Merged
merged 7 commits into from
Dec 24, 2024

Conversation

lidezhu
Copy link
Collaborator

@lidezhu lidezhu commented Dec 17, 2024

What problem does this PR solve?

Issue Number: close #11766

What is changed and how it works?

Change the retry logic to reload region when client.GetStore failed.

Check List

Tests

  • Unit test

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Fix the problem that changefeed may get stuck after scaling out new tikv nodes.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Dec 17, 2024
Copy link

codecov bot commented Dec 17, 2024

Codecov Report

Attention: Patch coverage is 67.56757% with 12 lines in your changes missing coverage. Please review.

Project coverage is 55.2238%. Comparing base (0bb4977) to head (2441bec).
Report is 9 commits behind head on master.

✅ All tests successful. No failed tests found.

Additional details and impacted files
Components Coverage Δ
cdc 59.7324% <67.5675%> (+0.1366%) ⬆️
dm 50.0278% <ø> (-0.0366%) ⬇️
engine 53.2223% <ø> (+0.0112%) ⬆️
Flag Coverage Δ
unit 55.2238% <67.5675%> (+0.0551%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

@@               Coverage Diff                @@
##             master     #11903        +/-   ##
================================================
+ Coverage   55.1686%   55.2238%   +0.0551%     
================================================
  Files          1003       1003                
  Lines        137493     137524        +31     
================================================
+ Hits          75853      75946        +93     
+ Misses        56092      56019        -73     
- Partials       5548       5559        +11     

@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 17, 2024
@lidezhu
Copy link
Collaborator Author

lidezhu commented Dec 18, 2024

/retest

1 similar comment
@lidezhu
Copy link
Collaborator Author

lidezhu commented Dec 18, 2024

/retest

@lidezhu lidezhu changed the title fix retry logic when check store version failed puller: fix retry logic when check store version failed Dec 18, 2024
@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. and removed release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/needs-linked-issue labels Dec 18, 2024
@lidezhu lidezhu added needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. labels Dec 18, 2024
@lidezhu lidezhu requested review from hicqu and asddongmen December 18, 2024 03:04
cdc/kv/shared_stream.go Outdated Show resolved Hide resolved
@asddongmen
Copy link
Contributor

Below is my understanding of how this PR works. Please correct me if I am wrong.

Initial Problem:

  1. When the region fails to connect to the current store, the system attempts to switch to the next store address.

  2. After switching, the system enters the newStream function and calls stream.run.

  3. Since stream.run fails, the system retrieves the next store address from the region cache.

  4. However, due to incorrect information in the region cache, the retrieved store is always unreachable, causing the system to fall into an infinite loop, repeatedly performing the above 1,2,3 steps.

Fix:

  1. When the region fails to connect to the current store, log the error and attempt to switch to the next store address.

  2. After switching to the next store address, proceed to the newStream function.

  3. Within the newStream function, call stream.run. If stream.run fails, handle the error based on its type.

  4. If the error type indicates that the information in the region cache might be incorrect, reset the region cache to ensure that the next retrieved store address is accurate.

@lidezhu lidezhu requested a review from hicqu December 24, 2024 00:48
@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Dec 24, 2024
@ti-chi-bot ti-chi-bot bot added the approved label Dec 24, 2024
@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 24, 2024
Copy link
Contributor

ti-chi-bot bot commented Dec 24, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-12-24 03:08:32.268387808 +0000 UTC m=+1531102.357190352: ☑️ agreed by asddongmen.
  • 2024-12-24 03:17:16.622940252 +0000 UTC m=+1531626.711742792: ☑️ agreed by 3AceShowHand.

cdc/kv/shared_stream.go Outdated Show resolved Hide resolved
Copy link
Contributor

ti-chi-bot bot commented Dec 24, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 3AceShowHand, asddongmen, hicqu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [3AceShowHand,asddongmen,hicqu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lidezhu
Copy link
Collaborator Author

lidezhu commented Dec 24, 2024

/retest

@lidezhu
Copy link
Collaborator Author

lidezhu commented Dec 24, 2024

/test dm-integration-test

@ti-chi-bot ti-chi-bot bot merged commit 4624acb into master Dec 24, 2024
28 checks passed
@ti-chi-bot ti-chi-bot bot deleted the fix-get-store-fail branch December 24, 2024 05:38
ti-chi-bot pushed a commit to ti-chi-bot/tiflow that referenced this pull request Dec 24, 2024
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-6.5: #11928.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.1: #11929.

ti-chi-bot pushed a commit to ti-chi-bot/tiflow that referenced this pull request Dec 24, 2024
ti-chi-bot pushed a commit to ti-chi-bot/tiflow that referenced this pull request Dec 24, 2024
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.5: #11930.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.1: #11931.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #11932.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cdc: fix usage of tikv go-client
5 participants