Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SendBatch: Retry retryable errors #235

Merged
merged 2 commits into from
Oct 23, 2023
Merged

Conversation

aaronbee
Copy link
Collaborator

@aaronbee aaronbee commented Oct 6, 2023

To match the behavior of SendRPC, SendBatch should retry RPCs that hit retryable errors: region.RetryableError, region.ServerError, and region.NotServingRegionError.

SendBatch will now retry each RPC that hits a retryable error. What used to be a single step through of assigning regions to RPCs, grouping them by region server and then dispatching the RPCs to their respective servers, is now done in a loop. The first iteration of the loop operates on the entire batch. Later iterations operate on the set of RPCs that failed with retryable errors in the previous batch.

@codecov
Copy link

codecov bot commented Oct 6, 2023

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (f750f62) 70.10% compared to head (2711130) 70.24%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #235      +/-   ##
==========================================
+ Coverage   70.10%   70.24%   +0.14%     
==========================================
  Files          27       27              
  Lines        3733     3771      +38     
==========================================
+ Hits         2617     2649      +32     
- Misses       1001     1003       +2     
- Partials      115      119       +4     
Files Coverage Δ
rpc.go 84.86% <92.85%> (+0.12%) ⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TestConcurrentRetryableError leaks an establishRegion goroutine trying
to discover the meta region. It repeats endlessly in a loop because
zookeeper is mocked to always return an error.

This leaked goroutine breaks a test I add in the next change.
func (c *client) waitForCompletion(ctx context.Context, rc hrpc.RegionClient,
rpcs []hrpc.Call, results []hrpc.RPCResult, rpcToRes map[hrpc.Call]int) bool {
rpcs []hrpc.Call, results []hrpc.RPCResult, rpcToRes map[hrpc.Call]int) (
retryables []hrpc.Call, shouldBackoff, unretryableError, ok bool) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a comment about about the unretryableError in the doc comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

rpc.go Outdated
}()

// Exit retry loop if no RPCs are retryable because they all
// succeeded or hit unretryable errors, or the context is done
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to clarify - if we have seen an unretryable error, should this be checked here and also a condition to break out of the loop? Does checking len(retries) encompass this, or do we also need to check unretryableSeen? This comment makes me think we do want to break here if there are any unretryable errors seen, but previous comments make it unclear if the intention is to still retry any retryable errors (even if another rpc has seen an unretryable error). Although I don't think we should retry a batch that has any unretryable errors, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided not to break here if unretryableErrorSeen because we can still retry the retryable errors. That way as many RPCs as possible in the batch are able to succeed.
In other words if your batch has 4 RPCs, the first 2 get retryable errors and the last 2 get non-retryable errors, we will retry the first 2 until they succeed and only return errors for the last 2.

Checking for len(retries) == 0 makes sure we only exit the loop when there are no more RPCs with retryable errors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

To match the behavior of SendRPC, SendBatch should retry RPCs that hit
retryable errors: region.RetryableError, region.ServerError, and
region.NotServingRegionError.

SendBatch will now retry each RPC that hits a retryable error. What
used to be a single step through of assigning regions to RPCs,
grouping them by region server and then dispatching the RPCs to their
respective servers, is now done in a loop. The first iteration of
the loop operates on the entire batch. Later iterations operate on the
set of RPCs that failed with retryable errors in the previous batch.
@aaronbee aaronbee merged commit 75354d5 into tsuna:master Oct 23, 2023
4 checks passed
@aaronbee aaronbee deleted the batchretry branch October 23, 2023 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants