-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gohbase doesn't handle active hbasemaster going away/changing #265
Labels
Comments
dethi
added a commit
that referenced
this issue
Jul 12, 2024
When receiving ServerNotRunningYetException, a client shouldn't retry to send the request to the same server. Instead, the client should be closed and the region lookup should happen again. There is two case when ServerNotRunningYetException is returned to the client: - when the RegionServer is not online yet: in the case, retrying the RPC on the same server may succeed if the Regionserver starts and if the region is still assign to it, but most likely the region would have been reassigned to another Regionserver and thus returning NotServingRegionException. If the Regionserver is stuck in startup phase, it could also cause the client to be stuck in retry loop whereas HBasemaster may have detected the issue and correctly moved the region to another Regionserver already. - when the HBasemaster server is currently not the active one: in that case, retrying the RPC on the same server is almost guaranteed to never succeed and we will be stuck in a forever retrying loop until the context is canceled. Also add MasterStoppedException that was missing from the list. Fix #265
dethi
added a commit
that referenced
this issue
Jul 13, 2024
When receiving ServerNotRunningYetException, a client shouldn't retry to send the request to the same server. Instead, the client should be closed and the region lookup should happen again. There is two cases when ServerNotRunningYetException is returned: - when the RegionServer is listening but not online yet: in that case, retrying the RPC on the same server may succeed if the Regionserver become ready and if the region is indeed assigned to it. But most likely the region would have been reassigned to another Regionserver and thus it will return NotServingRegionException in the following request. If the Regionserver is stuck in startup phase, it could also cause the client to be stuck in retry loop whereas HBasemaster may have detected the issue and correctly moved the region to another Regionserver already. - when the HBasemaster server is currently not active: in that case, retrying the RPC on the same server is guaranteed to fail until a failover. The client will be stuck in a forever retrying loop. Also add MasterStoppedException that was missing from the list. Fix #265
dethi
added a commit
that referenced
this issue
Jul 13, 2024
When receiving ServerNotRunningYetException, a client shouldn't retry to send the request to the same server. Instead, the client should be closed and the region lookup should happen again. There is two cases when ServerNotRunningYetException is returned: - when the RegionServer is listening but not online yet: in that case, retrying the RPC on the same server may succeed if the Regionserver become ready and if the region is indeed assigned to it. But most likely the region would have been reassigned to another Regionserver and thus it will return NotServingRegionException in the following request. If the Regionserver is stuck in startup phase, it could also cause the client to be stuck in retry loop whereas HBasemaster may have detected the issue and correctly moved the region to another Regionserver already. - when the HBasemaster server is currently not active: in that case, retrying the RPC on the same server is guaranteed to fail until a failover. The client will be stuck in a forever retrying loop. If we receive multiple ServerError for the same RPC, we will backoff before retrying. This is to avoid overwhelming HBase. Scenario where this could happen is a cluster that is recovering from catastrophic failure, with all HBasemaster still trying to start (like recovering WALs or what not). Also add MasterStoppedException and PleaseHoldException to the list of known exception that can be returned by HBase. Fix #265
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This sometimes cause the hbase-k8s-operator to get stuck in gohbase retry loop forever when the active master is restarted
timoha/hbase-k8s-operator#19
The text was updated successfully, but these errors were encountered: