region/client: re-establish connection on ServerNotRunningYetException #266

dethi · 2024-07-13T00:36:47Z

When receiving ServerNotRunningYetException, a client shouldn't retry to send the request to the same server. Instead, the client should be closed and the region lookup should happen again.

There is two cases when ServerNotRunningYetException is returned:

when the RegionServer is listening but not online yet: in that case, retrying the RPC on the same server may succeed if the Regionserver become ready and if the region is indeed assigned to it. But most likely the region would have been reassigned to another Regionserver and thus it will return NotServingRegionException in the following request. If the Regionserver is stuck in startup phase, it could also cause the client to be stuck in retry loop whereas HBasemaster may have detected the issue and correctly moved the region to another Regionserver already.
when the HBasemaster server is currently not active: in that case, retrying the RPC on the same server is guaranteed to fail until a failover. The client will be stuck in a forever retrying loop.

If we receive multiple ServerError for the same RPC, we will backoff before retrying. This is to avoid overwhelming HBase. Scenario where this could happen is a cluster that is recovering from catastrophic failure, with all HBasemaster still trying to start (like recovering WALs or what not).

Also add MasterStoppedException and PleaseHoldException to the list of known exception that can be returned by HBase.

Fix #265

When receiving ServerNotRunningYetException, a client shouldn't retry to send the request to the same server. Instead, the client should be closed and the region lookup should happen again. There is two cases when ServerNotRunningYetException is returned: - when the RegionServer is listening but not online yet: in that case, retrying the RPC on the same server may succeed if the Regionserver become ready and if the region is indeed assigned to it. But most likely the region would have been reassigned to another Regionserver and thus it will return NotServingRegionException in the following request. If the Regionserver is stuck in startup phase, it could also cause the client to be stuck in retry loop whereas HBasemaster may have detected the issue and correctly moved the region to another Regionserver already. - when the HBasemaster server is currently not active: in that case, retrying the RPC on the same server is guaranteed to fail until a failover. The client will be stuck in a forever retrying loop. If we receive multiple ServerError for the same RPC, we will backoff before retrying. This is to avoid overwhelming HBase. Scenario where this could happen is a cluster that is recovering from catastrophic failure, with all HBasemaster still trying to start (like recovering WALs or what not). Also add MasterStoppedException and PleaseHoldException to the list of known exception that can be returned by HBase. Fix #265

dethi · 2024-07-13T00:42:10Z

Ready for review but not for merging, I want to add to try to add a test for this before (which is a bit painful because it needs hadoop hdfs to run to be able to have multiple hbasemaster)

tsuna approved these changes Jul 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

region/client: re-establish connection on ServerNotRunningYetException #266

region/client: re-establish connection on ServerNotRunningYetException #266

dethi commented Jul 13, 2024

dethi commented Jul 13, 2024

region/client: re-establish connection on ServerNotRunningYetException #266

Are you sure you want to change the base?

region/client: re-establish connection on ServerNotRunningYetException #266

Conversation

dethi commented Jul 13, 2024

dethi commented Jul 13, 2024