Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue while scanning large datasets #65

Closed
jonbonazza opened this issue Jul 27, 2017 · 13 comments
Closed

Issue while scanning large datasets #65

jonbonazza opened this issue Jul 27, 2017 · 13 comments

Comments

@jonbonazza
Copy link
Contributor

I am using the new scanner api like so:

func ScanCh(scanner hrpc.Scanner, errCallbacks ...func(row *hrpc.Result, err error)) <-chan *hrpc.Result {
	ch := make(chan *hrpc.Result)
	go func() {
		defer close(ch)
		defer scanner.Close()
		var row *hrpc.Result
		var err error
		for err != io.EOF {
			if row, err = scanner.Next(); err == nil {
				ch <- row
			} else if err != io.EOF {
				for _, f := range errCallbacks {
					f(row, err)
				}
			}
		}
	}()
	return ch
}

After some minutes and several hundred rows being scanned, I am seeing the following error from HBase:

ERRO[0249] failed to close scanner                       err="HBase Java exception org.apache.hadoop.hbase.UnknownScannerException:
org.apache.hadoop.hbase.UnknownScannerException: Name: 1282974, already closed?
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2128)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32205)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
        at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
        at java.lang.Thread.run(Thread.java:745)
" scannerID=1282974

It should also be noted that this error occurs well before all of the rows are scanned, so I never get to scan all rows.

@jonbonazza
Copy link
Contributor Author

@timoha this also occurs when using the scanner API directly and not using a channel. Any idea what is going on here? For context, for each cell in the scan, I am doing a cockroachdb database insertion. Is it possible that this operation is taking too long and the client is timing out?

@timoha
Copy link
Collaborator

timoha commented Aug 9, 2017

I'll try to investigate soon, been a bit occupied. The error that you are seeing should not affect anything in your code, as it happens in the background. Would be more useful if you provided the error that is returned by call to Next().

@jonbonazza
Copy link
Contributor Author

@timoha Interesting... I thought that was the error that was being returned by the Next() call. Unless, maybe, an io.EOF error is being returned. I'll investigate more once I get the chance.

@jonbonazza
Copy link
Contributor Author

jonbonazza commented Oct 25, 2017

@timoha Sorry for the (very) late reply here.
So basically, this is the error that is being returned from the Next() call and then immedately following that, Next() returns io.EOF.

@jonbonazza
Copy link
Contributor Author

jonbonazza commented Oct 25, 2017

@timoha the docs for UnknownScannerException seem to suggest the client isn't "checking in" with the server and the server is closing the connection.

Thrown if a region server is passed an unknown scanner id. Usually means the client has take too long between checkins and so the scanner lease on the serverside has expired OR the serverside is closing down and has cancelled all leases.

@jonbonazza
Copy link
Contributor Author

jonbonazza commented Oct 25, 2017

More information. Looks like when I see this UnknownScannerException, only that one rpc fails. If there are more RPCs to make, then they will continue on, but I will have lost the data (save some partial data that is returned, if any) for the failed RPC. Not really sure how to recover from this.

So my statement a couple comments up is not entirely correct. The immediately following error is not necessarily io.EOF, it just so happened that during that particular scan, there were no more RPCs to make, so the next call to Next did return io.EOF.

@timoha
Copy link
Collaborator

timoha commented Nov 6, 2017

Yeah, sound like you either take a long time between calling Next() or your regionserver died in the process of scanning.

For the first case, we could implement periodic scanner lease renewal: https://github.com/tsuna/gohbase/blob/master/pb/Client.proto#L285

For the second case, need to spend some time to better handle error cases for scanner and retry in case we get this exception gracefully. I might have some time soon, to take a stab at it.

@jonbonazza
Copy link
Contributor Author

@timoha I was able to confirm that it is the former. In this event will it miss data? If so, I am not sure what I can do here as we have, like, gigs of data that is being processed concurrently, distributed across nodes, but in order to not destroy our memory, we throttle the number of goroutines with a semaphore. If we crank this sempahore up too high, we end up pegging the CPU and stalling the system, so we have to have a balance.

How difficult would it be to implement lease renewal?

@jonbonazza
Copy link
Contributor Author

@timoha We are again encountering this time, we are not doing anything heavy, but we are doing a lot of scanning and inserting. I am worried we are overloading the cluster and I was wondering.. What happens in the event that the HBase cluster is overloaded and a request to the cluster takes a long time? Will the lease still expire because Next() hasn't been called in a bit? If so, what can we do here? (Aside from throw more resources at the HBase cluster.)

@jonbonazza
Copy link
Contributor Author

Also, @timoha Do you think you could give me a run down of how scanner lease renewal should work? I'd be happy to submit a PR for this, but don't have the necessary understanding, I fear. Is there any documentation on the HBase protocol that I could use to discern such information?

@tsuna
Copy link
Owner

tsuna commented Nov 12, 2021

UnknownScannerException is handled gracefully both in AsyncHBase and in the standard HBase client, so maybe we should do that too here.

@timoha
Copy link
Collaborator

timoha commented Nov 13, 2021

Haven't checked AsyncHBase code or standard client code in a while, but previously neither actually handled the case of partial row scanners. The problem is that if the scanner times out in the middle of the row, there's no way to safely (preserving row atomicity) restart scanning from the middle of the row. The best option has always been for clients to explicitly keep track of the last row they've scanned and restart scanning from the beginning of the row when exception happens. That way each client can take care of duplicates their own way.

That being said, maybe there's an API that somehow uses MVCC to properly address this now, but scanners can blow up with OOM if you don't rely on partial row scanning feature.

@dethi
Copy link
Collaborator

dethi commented Mar 28, 2023

Closing in favour of #91

@dethi dethi closed this as not planned Won't fix, can't repro, duplicate, stale Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants