-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix](move-memtable) immediately return error when close wait failed #44344
[fix](move-memtable) immediately return error when close wait failed #44344
Conversation
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
TeamCity be ut coverage result: |
run buildall |
TeamCity be ut coverage result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR approved by at least one committer and no changes requested. |
PR approved by anyone and no changes requested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…44344) Problem Summary: #38003 introduced a problem where the last sink node could report success even when close wait timeout, which may cause data loss. Previously we made that change hoping to tolerate minority replica failure in this step. However, it turns out the last sink node could miss tablet reports from downstreams in case of close wait failure. This PR fixes the problem by return the close_wait error immediately. The most common error in close wait is timeout, and it should not be fault tolerant on a replica basis anyways.
…44344) Problem Summary: #38003 introduced a problem where the last sink node could report success even when close wait timeout, which may cause data loss. Previously we made that change hoping to tolerate minority replica failure in this step. However, it turns out the last sink node could miss tablet reports from downstreams in case of close wait failure. This PR fixes the problem by return the close_wait error immediately. The most common error in close wait is timeout, and it should not be fault tolerant on a replica basis anyways.
…wait failed #44344 (#44386) Cherry-picked from #44344 Co-authored-by: Kaijie Chen <[email protected]>
…pache#44344) Problem Summary: apache#38003 introduced a problem where the last sink node could report success even when close wait timeout, which may cause data loss. Previously we made that change hoping to tolerate minority replica failure in this step. However, it turns out the last sink node could miss tablet reports from downstreams in case of close wait failure. This PR fixes the problem by return the close_wait error immediately. The most common error in close wait is timeout, and it should not be fault tolerant on a replica basis anyways.
…wait failed #44344 (#44387) Cherry-picked from #44344 Co-authored-by: Kaijie Chen <[email protected]>
…44344) Problem Summary: #38003 introduced a problem where the last sink node could report success even when close wait timeout, which may cause data loss. Previously we made that change hoping to tolerate minority replica failure in this step. However, it turns out the last sink node could miss tablet reports from downstreams in case of close wait failure. This PR fixes the problem by return the close_wait error immediately. The most common error in close wait is timeout, and it should not be fault tolerant on a replica basis anyways.
…44344) Problem Summary: #38003 introduced a problem where the last sink node could report success even when close wait timeout, which may cause data loss. Previously we made that change hoping to tolerate minority replica failure in this step. However, it turns out the last sink node could miss tablet reports from downstreams in case of close wait failure. This PR fixes the problem by return the close_wait error immediately. The most common error in close wait is timeout, and it should not be fault tolerant on a replica basis anyways.
…44552) `test_writer_v2_fault_injection` did not assert Exception being thrown, which caused false positives. i.e. Code bugged but test passed. This PR fixed this problem. Run this test without #44344, it will now report errors as expected. ``` 2024-11-25 17:03:25.463 ERROR [non-concurrent-thread-1] (ScriptContext.groovy:122) - Run test_writer_v2_fault_injection in ./doris/regression-test/suites/fault_injection_p0/test_writer_v2_fault_injection.groovy failed org.opentest4j.AssertionFailedError: expected Exception 'load timed out before close waiting', actual success ==> expected: <true> but was: <false> ```
…44552) `test_writer_v2_fault_injection` did not assert Exception being thrown, which caused false positives. i.e. Code bugged but test passed. This PR fixed this problem. Run this test without #44344, it will now report errors as expected. ``` 2024-11-25 17:03:25.463 ERROR [non-concurrent-thread-1] (ScriptContext.groovy:122) - Run test_writer_v2_fault_injection in ./doris/regression-test/suites/fault_injection_p0/test_writer_v2_fault_injection.groovy failed org.opentest4j.AssertionFailedError: expected Exception 'load timed out before close waiting', actual success ==> expected: <true> but was: <false> ```
Related PR: #44344 `VTabletWriterV2::_select_streams()` is already checking if there is enough downstream BE to meet the replication requirements. `VTabletWriterV2::close()` should tolerate those non-open streams on close wait. Debug point `VTabletWriterV2._open_streams.skip_two_backends` is added along with `VTabletWriterV2._open_streams.skip_one_backend` to check this behavior.
Related PR: #44344 `VTabletWriterV2::_select_streams()` is already checking if there is enough downstream BE to meet the replication requirements. `VTabletWriterV2::close()` should tolerate those non-open streams on close wait. Debug point `VTabletWriterV2._open_streams.skip_two_backends` is added along with `VTabletWriterV2._open_streams.skip_one_backend` to check this behavior.
Related PR: #44344 `VTabletWriterV2::_select_streams()` is already checking if there is enough downstream BE to meet the replication requirements. `VTabletWriterV2::close()` should tolerate those non-open streams on close wait. Debug point `VTabletWriterV2._open_streams.skip_two_backends` is added along with `VTabletWriterV2._open_streams.skip_one_backend` to check this behavior.
…pache#44552) `test_writer_v2_fault_injection` did not assert Exception being thrown, which caused false positives. i.e. Code bugged but test passed. This PR fixed this problem. Run this test without apache#44344, it will now report errors as expected. ``` 2024-11-25 17:03:25.463 ERROR [non-concurrent-thread-1] (ScriptContext.groovy:122) - Run test_writer_v2_fault_injection in ./doris/regression-test/suites/fault_injection_p0/test_writer_v2_fault_injection.groovy failed org.opentest4j.AssertionFailedError: expected Exception 'load timed out before close waiting', actual success ==> expected: <true> but was: <false> ```
What problem does this PR solve?
Related PR: #38003
Problem Summary:
#38003 introduced a problem where the last sink node could report success even when close wait timeout, which may cause data loss.
Previously we made that change hoping to tolerate minority replica failure in this step.
However, it turns out the last sink node could miss tablet reports from downstreams in case of close wait failure.
This PR fixes the problem by return the close_wait error immediately.
The most common error in close wait is timeout, and it should not be fault tolerant on a replica basis anyways.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)