Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gohbase doesn't handle active hbasemaster going away/changing #265

Open
dethi opened this issue Jul 12, 2024 · 0 comments · May be fixed by #266
Open

gohbase doesn't handle active hbasemaster going away/changing #265

dethi opened this issue Jul 12, 2024 · 0 comments · May be fixed by #266
Assignees
Labels

Comments

@dethi
Copy link
Collaborator

dethi commented Jul 12, 2024

This sometimes cause the hbase-k8s-operator to get stuck in gohbase retry loop forever when the active master is restarted

timoha/hbase-k8s-operator#19

2024-01-29 11:08:27.276    goroutine 406 [sync.Cond.Wait, 46 minutes]:
...
2024-01-29 11:08:27.194    [github.com/timoha/hbase-k8s-operator/controllers.(*HBaseReconciler).pickRegionServerToDelete(0xc0007f1880](http://github.com/timoha/hbase-k8s-operator/controllers.%28*HBaseReconciler%29.pickRegionServerToDelete%280xc0007f1880), {0x1a91cb8?, 0xc001616b70?}, {0xc001d42c00, 0x4?, 0x4?}, {0x0, 0x0, 0xc0004ff590?})
2024-01-29 11:08:27.194        /go/pkg/mod/[github.com/tsuna/[email protected]/admin_client.go:277](http://github.com/tsuna/[email protected]/admin_client.go:277) +0x27 fp=0xc0000eccd8 sp=0xc0000eccb0 pc=0x14acd07
2024-01-29 11:08:27.194    [github.com/tsuna/gohbase.(*client).SetBalancer(0x11](http://github.com/tsuna/gohbase.%28*client%29.SetBalancer%280x11)?, 0x1?)
2024-01-29 11:08:27.194        /go/pkg/mod/[github.com/tsuna/[email protected]/rpc.go:100](http://github.com/tsuna/[email protected]/rpc.go:100) +0x31e fp=0xc0000eccb0 sp=0xc0000ecb38 pc=0x14b04de
2024-01-29 11:08:27.194    [github.com/tsuna/gohbase.(*client).SendRPC(0xc000c2da00](http://github.com/tsuna/gohbase.%28*client%29.SendRPC%280xc000c2da00), {0x1a982c0, 0xc000a65380})
2024-01-29 11:08:27.194        /go/pkg/mod/[github.com/tsuna/[email protected]/rpc.go:602](http://github.com/tsuna/[email protected]/rpc.go:602) +0x96 fp=0xc0000ecb38 sp=0xc0000ecac8 pc=0x14b5096
2024-01-29 11:08:27.194    [github.com/tsuna/gohbase.sleepAndIncreaseBackoff({0x1a91cb8](http://github.com/tsuna/gohbase.sleepAndIncreaseBackoff%28%7B0x1a91cb8), 0xc000a3de90}, 0x7ba65ba00)
2024-01-29 11:08:27.194        /usr/local/go/src/runtime/select.go:328 +0x7bc fp=0xc0000ecac8 sp=0xc0000ec988 pc=0x44afdc
2024-01-29 11:08:27.194    runtime.selectgo(0xc0000ecb08, 0xc0000ecaf8, 0xc000a65380?, 0x0, 0x1a982c0?, 0x1)
2024-01-29 11:08:27.194        /usr/local/go/src/runtime/proc.go:363 +0xd6 fp=0xc0000ec988 sp=0xc0000ec968 pc=0x43bb76
2024-01-29 11:08:27.194    runtime.gopark(0xc0000ecb08?, 0x2?, 0x0?, 0x30?, 0xc0000ecafc?)
@dethi dethi added the bug label Jul 12, 2024
@dethi dethi self-assigned this Jul 12, 2024
dethi added a commit that referenced this issue Jul 12, 2024
When receiving ServerNotRunningYetException, a client shouldn't retry to
send the request to the same server. Instead, the client should be
closed and the region lookup should happen again.

There is two case when ServerNotRunningYetException is returned to the client:
- when the RegionServer is not online yet: in the case, retrying the RPC
  on the same server may succeed if the Regionserver starts and if the
  region is still assign to it, but most likely the region would have
  been reassigned to another Regionserver and thus returning
  NotServingRegionException. If the Regionserver is stuck in startup
  phase, it could also cause the client to be stuck in retry loop
  whereas HBasemaster may have detected the issue and correctly moved
  the region to another Regionserver already.

- when the HBasemaster server is currently not the active one: in that
  case, retrying the RPC on the same server is almost guaranteed to
  never succeed and we will be stuck in a forever retrying loop until
  the context is canceled.

Also add MasterStoppedException that was missing from the list.

Fix #265
dethi added a commit that referenced this issue Jul 13, 2024
When receiving ServerNotRunningYetException, a client shouldn't retry to
send the request to the same server. Instead, the client should be
closed and the region lookup should happen again.

There is two cases when ServerNotRunningYetException is returned:
- when the RegionServer is listening but not online yet: in that case,
  retrying the RPC on the same server may succeed if the Regionserver
  become ready and if the region is indeed assigned to it. But most
  likely the region would have been reassigned to another Regionserver
  and thus it will return NotServingRegionException in the following
  request. If the Regionserver is stuck in startup phase, it could also
  cause the client to be stuck in retry loop whereas HBasemaster may
  have detected the issue and correctly moved the region to another
  Regionserver already.

- when the HBasemaster server is currently not active: in that case,
  retrying the RPC on the same server is guaranteed to fail until a
  failover. The client will be stuck in a forever retrying loop.

Also add MasterStoppedException that was missing from the list.

Fix #265
dethi added a commit that referenced this issue Jul 13, 2024
When receiving ServerNotRunningYetException, a client shouldn't retry to
send the request to the same server. Instead, the client should be
closed and the region lookup should happen again.

There is two cases when ServerNotRunningYetException is returned:
- when the RegionServer is listening but not online yet: in that case,
  retrying the RPC on the same server may succeed if the Regionserver
  become ready and if the region is indeed assigned to it. But most
  likely the region would have been reassigned to another Regionserver
  and thus it will return NotServingRegionException in the following
  request. If the Regionserver is stuck in startup phase, it could also
  cause the client to be stuck in retry loop whereas HBasemaster may
  have detected the issue and correctly moved the region to another
  Regionserver already.

- when the HBasemaster server is currently not active: in that case,
  retrying the RPC on the same server is guaranteed to fail until a
  failover. The client will be stuck in a forever retrying loop.

If we receive multiple ServerError for the same RPC, we will backoff
before retrying. This is to avoid overwhelming HBase. Scenario where
this could happen is a cluster that is recovering from catastrophic
failure, with all HBasemaster still trying to start (like recovering
WALs or what not).

Also add MasterStoppedException and PleaseHoldException to the list of
known exception that can be returned by HBase.

Fix #265
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant