[fix](move-memtable) immediately return error when close wait failed #44344

kaijchen · 2024-11-20T09:40:04Z

What problem does this PR solve?

Related PR: #38003

Problem Summary:

#38003 introduced a problem where the last sink node could report success even when close wait timeout, which may cause data loss.

Previously we made that change hoping to tolerate minority replica failure in this step.
However, it turns out the last sink node could miss tablet reports from downstreams in case of close wait failure.

This PR fixes the problem by return the close_wait error immediately.
The most common error in close wait is timeout, and it should not be fault tolerant on a replica basis anyways.

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

doris-robot · 2024-11-20T09:40:09Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

kaijchen · 2024-11-20T09:42:02Z

run buildall

github-actions · 2024-11-20T09:44:23Z

clang-tidy review says "All clean, LGTM! 👍"

doris-robot · 2024-11-20T10:16:26Z

TeamCity be ut coverage result:
Function Coverage: 38.02% (9900/26039)
Line Coverage: 29.21% (82824/283546)
Region Coverage: 28.34% (42529/150085)
Branch Coverage: 24.90% (21558/86590)
Coverage Report: http://coverage.selectdb-in.cc/coverage/c040ae01a6cbdc13de30d953d2f666b82aa887b1_c040ae01a6cbdc13de30d953d2f666b82aa887b1/report/index.html

liaoxin01 · 2024-11-20T10:57:38Z

run buildall

doris-robot · 2024-11-20T11:24:17Z

TeamCity be ut coverage result:
Function Coverage: 38.02% (9899/26039)
Line Coverage: 29.20% (82790/283546)
Region Coverage: 28.34% (42528/150085)
Branch Coverage: 24.90% (21559/86590)
Coverage Report: http://coverage.selectdb-in.cc/coverage/c040ae01a6cbdc13de30d953d2f666b82aa887b1_c040ae01a6cbdc13de30d953d2f666b82aa887b1/report/index.html

liaoxin01

LGTM

github-actions · 2024-11-20T12:08:15Z

PR approved by at least one committer and no changes requested.

github-actions · 2024-11-20T12:08:17Z

PR approved by anyone and no changes requested.

dataroaring

LGTM

…44344) Problem Summary: #38003 introduced a problem where the last sink node could report success even when close wait timeout, which may cause data loss. Previously we made that change hoping to tolerate minority replica failure in this step. However, it turns out the last sink node could miss tablet reports from downstreams in case of close wait failure. This PR fixes the problem by return the close_wait error immediately. The most common error in close wait is timeout, and it should not be fault tolerant on a replica basis anyways.

…wait failed #44344 (#44386) Cherry-picked from #44344 Co-authored-by: Kaijie Chen <[email protected]>

…pache#44344) Problem Summary: apache#38003 introduced a problem where the last sink node could report success even when close wait timeout, which may cause data loss. Previously we made that change hoping to tolerate minority replica failure in this step. However, it turns out the last sink node could miss tablet reports from downstreams in case of close wait failure. This PR fixes the problem by return the close_wait error immediately. The most common error in close wait is timeout, and it should not be fault tolerant on a replica basis anyways.

…wait failed #44344 (#44387) Cherry-picked from #44344 Co-authored-by: Kaijie Chen <[email protected]>

…44344) Problem Summary: #38003 introduced a problem where the last sink node could report success even when close wait timeout, which may cause data loss. Previously we made that change hoping to tolerate minority replica failure in this step. However, it turns out the last sink node could miss tablet reports from downstreams in case of close wait failure. This PR fixes the problem by return the close_wait error immediately. The most common error in close wait is timeout, and it should not be fault tolerant on a replica basis anyways.

…44552) `test_writer_v2_fault_injection` did not assert Exception being thrown, which caused false positives. i.e. Code bugged but test passed. This PR fixed this problem. Run this test without #44344, it will now report errors as expected. ``` 2024-11-25 17:03:25.463 ERROR [non-concurrent-thread-1] (ScriptContext.groovy:122) - Run test_writer_v2_fault_injection in ./doris/regression-test/suites/fault_injection_p0/test_writer_v2_fault_injection.groovy failed org.opentest4j.AssertionFailedError: expected Exception 'load timed out before close waiting', actual success ==> expected: <true> but was: <false> ```

Related PR: #44344 `VTabletWriterV2::_select_streams()` is already checking if there is enough downstream BE to meet the replication requirements. `VTabletWriterV2::close()` should tolerate those non-open streams on close wait. Debug point `VTabletWriterV2._open_streams.skip_two_backends` is added along with `VTabletWriterV2._open_streams.skip_one_backend` to check this behavior.

…pache#44552) `test_writer_v2_fault_injection` did not assert Exception being thrown, which caused false positives. i.e. Code bugged but test passed. This PR fixed this problem. Run this test without apache#44344, it will now report errors as expected. ``` 2024-11-25 17:03:25.463 ERROR [non-concurrent-thread-1] (ScriptContext.groovy:122) - Run test_writer_v2_fault_injection in ./doris/regression-test/suites/fault_injection_p0/test_writer_v2_fault_injection.groovy failed org.opentest4j.AssertionFailedError: expected Exception 'load timed out before close waiting', actual success ==> expected: <true> but was: <false> ```

[fix](move-memtable) immediately return error when close wait failed

c040ae0

liaoxin01 added dev/2.1.x dev/3.0.x labels Nov 20, 2024

dataroaring added the p0_w label Nov 20, 2024

liaoxin01 approved these changes Nov 20, 2024

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 20, 2024

github-actions bot added the reviewed label Nov 20, 2024

dataroaring approved these changes Nov 21, 2024

View reviewed changes

kaijchen added a commit to kaijchen/doris that referenced this pull request Nov 21, 2024

[test](load) add docker test for apache#44344

caf4e29

liaoxin01 merged commit babd6ce into apache:master Nov 21, 2024
34 of 37 checks passed

github-actions bot mentioned this pull request Nov 21, 2024

branch-3.0: [fix](move-memtable) immediately return error when close wait failed #44344 #44386

Merged

github-actions bot mentioned this pull request Nov 21, 2024

branch-2.1: [fix](move-memtable) immediately return error when close wait failed #44344 #44387

Merged

dataroaring pushed a commit that referenced this pull request Nov 21, 2024

branch-3.0: [fix](move-memtable) immediately return error when close …

cd59bc3

…wait failed #44344 (#44386) Cherry-picked from #44344 Co-authored-by: Kaijie Chen <[email protected]>

dataroaring added dev/3.0.3-merged and removed dev/3.0.x labels Nov 21, 2024

kaijchen added a commit to kaijchen/doris that referenced this pull request Nov 22, 2024

[test](load) add docker test for apache#44344

bf7f834

yiguolei pushed a commit that referenced this pull request Nov 22, 2024

branch-2.1: [fix](move-memtable) immediately return error when close …

d21940e

…wait failed #44344 (#44387) Cherry-picked from #44344 Co-authored-by: Kaijie Chen <[email protected]>

yiguolei added dev/2.1.8-merged and removed dev/2.1.x labels Nov 22, 2024

This was referenced Nov 25, 2024

[test](move-memtable) fix false positives in sinkv2 injection tests #44552

Merged

[fix](move-memtable) tolerate non-open streams in close wait #44680

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix](move-memtable) immediately return error when close wait failed #44344

[fix](move-memtable) immediately return error when close wait failed #44344

kaijchen commented Nov 20, 2024 •

edited

Loading

doris-robot commented Nov 20, 2024

kaijchen commented Nov 20, 2024

github-actions bot commented Nov 20, 2024

doris-robot commented Nov 20, 2024

liaoxin01 commented Nov 20, 2024

doris-robot commented Nov 20, 2024

liaoxin01 left a comment

github-actions bot commented Nov 20, 2024

github-actions bot commented Nov 20, 2024

dataroaring left a comment

[fix](move-memtable) immediately return error when close wait failed #44344

[fix](move-memtable) immediately return error when close wait failed #44344

Conversation

kaijchen commented Nov 20, 2024 • edited Loading

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

doris-robot commented Nov 20, 2024

kaijchen commented Nov 20, 2024

github-actions bot commented Nov 20, 2024

doris-robot commented Nov 20, 2024

liaoxin01 commented Nov 20, 2024

doris-robot commented Nov 20, 2024

liaoxin01 left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 20, 2024

github-actions bot commented Nov 20, 2024

dataroaring left a comment

Choose a reason for hiding this comment

kaijchen commented Nov 20, 2024 •

edited

Loading