Issue while scanning large datasets #65

jonbonazza · 2017-07-27T23:54:56Z

I am using the new scanner api like so:

func ScanCh(scanner hrpc.Scanner, errCallbacks ...func(row *hrpc.Result, err error)) <-chan *hrpc.Result {
	ch := make(chan *hrpc.Result)
	go func() {
		defer close(ch)
		defer scanner.Close()
		var row *hrpc.Result
		var err error
		for err != io.EOF {
			if row, err = scanner.Next(); err == nil {
				ch <- row
			} else if err != io.EOF {
				for _, f := range errCallbacks {
					f(row, err)
				}
			}
		}
	}()
	return ch
}

After some minutes and several hundred rows being scanned, I am seeing the following error from HBase:

ERRO[0249] failed to close scanner                       err="HBase Java exception org.apache.hadoop.hbase.UnknownScannerException:
org.apache.hadoop.hbase.UnknownScannerException: Name: 1282974, already closed?
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2128)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32205)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
        at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
        at java.lang.Thread.run(Thread.java:745)
" scannerID=1282974

It should also be noted that this error occurs well before all of the rows are scanned, so I never get to scan all rows.

The text was updated successfully, but these errors were encountered:

jonbonazza · 2017-07-31T18:51:35Z

@timoha this also occurs when using the scanner API directly and not using a channel. Any idea what is going on here? For context, for each cell in the scan, I am doing a cockroachdb database insertion. Is it possible that this operation is taking too long and the client is timing out?

timoha · 2017-08-09T23:59:52Z

I'll try to investigate soon, been a bit occupied. The error that you are seeing should not affect anything in your code, as it happens in the background. Would be more useful if you provided the error that is returned by call to Next().

jonbonazza · 2017-08-17T20:41:15Z

@timoha Interesting... I thought that was the error that was being returned by the Next() call. Unless, maybe, an io.EOF error is being returned. I'll investigate more once I get the chance.

jonbonazza · 2017-10-25T20:37:43Z

@timoha Sorry for the (very) late reply here.
So basically, this is the error that is being returned from the Next() call and then immedately following that, Next() returns io.EOF.

jonbonazza · 2017-10-25T21:22:19Z

@timoha the docs for UnknownScannerException seem to suggest the client isn't "checking in" with the server and the server is closing the connection.

Thrown if a region server is passed an unknown scanner id. Usually means the client has take too long between checkins and so the scanner lease on the serverside has expired OR the serverside is closing down and has cancelled all leases.

jonbonazza · 2017-10-25T23:19:38Z

More information. Looks like when I see this UnknownScannerException, only that one rpc fails. If there are more RPCs to make, then they will continue on, but I will have lost the data (save some partial data that is returned, if any) for the failed RPC. Not really sure how to recover from this.

So my statement a couple comments up is not entirely correct. The immediately following error is not necessarily io.EOF, it just so happened that during that particular scan, there were no more RPCs to make, so the next call to Next did return io.EOF.

timoha · 2017-11-06T19:17:11Z

Yeah, sound like you either take a long time between calling Next() or your regionserver died in the process of scanning.

For the first case, we could implement periodic scanner lease renewal: https://github.com/tsuna/gohbase/blob/master/pb/Client.proto#L285

For the second case, need to spend some time to better handle error cases for scanner and retry in case we get this exception gracefully. I might have some time soon, to take a stab at it.

jonbonazza · 2017-11-06T21:47:24Z

@timoha I was able to confirm that it is the former. In this event will it miss data? If so, I am not sure what I can do here as we have, like, gigs of data that is being processed concurrently, distributed across nodes, but in order to not destroy our memory, we throttle the number of goroutines with a semaphore. If we crank this sempahore up too high, we end up pegging the CPU and stalling the system, so we have to have a balance.

How difficult would it be to implement lease renewal?

jonbonazza · 2018-05-13T18:04:38Z

@timoha We are again encountering this time, we are not doing anything heavy, but we are doing a lot of scanning and inserting. I am worried we are overloading the cluster and I was wondering.. What happens in the event that the HBase cluster is overloaded and a request to the cluster takes a long time? Will the lease still expire because Next() hasn't been called in a bit? If so, what can we do here? (Aside from throw more resources at the HBase cluster.)

jonbonazza · 2018-05-15T17:30:16Z

Also, @timoha Do you think you could give me a run down of how scanner lease renewal should work? I'd be happy to submit a PR for this, but don't have the necessary understanding, I fear. Is there any documentation on the HBase protocol that I could use to discern such information?

tsuna · 2021-11-12T14:08:36Z

UnknownScannerException is handled gracefully both in AsyncHBase and in the standard HBase client, so maybe we should do that too here.

timoha · 2021-11-13T02:50:08Z

Haven't checked AsyncHBase code or standard client code in a while, but previously neither actually handled the case of partial row scanners. The problem is that if the scanner times out in the middle of the row, there's no way to safely (preserving row atomicity) restart scanning from the middle of the row. The best option has always been for clients to explicitly keep track of the last row they've scanned and restart scanning from the beginning of the row when exception happens. That way each client can take care of duplicates their own way.

That being said, maybe there's an API that somehow uses MVCC to properly address this now, but scanners can blow up with OOM if you don't rely on partial row scanning feature.

dethi · 2023-03-28T19:10:56Z

Closing in favour of #91

dethi closed this as not planned Won't fix, can't repro, duplicate, stale Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue while scanning large datasets #65

Issue while scanning large datasets #65

jonbonazza commented Jul 27, 2017

jonbonazza commented Jul 31, 2017

timoha commented Aug 9, 2017

jonbonazza commented Aug 17, 2017

jonbonazza commented Oct 25, 2017 •

edited

Loading

jonbonazza commented Oct 25, 2017 •

edited

Loading

jonbonazza commented Oct 25, 2017 •

edited

Loading

timoha commented Nov 6, 2017

jonbonazza commented Nov 6, 2017

jonbonazza commented May 13, 2018

jonbonazza commented May 15, 2018

tsuna commented Nov 12, 2021

timoha commented Nov 13, 2021

dethi commented Mar 28, 2023

Issue while scanning large datasets #65

Issue while scanning large datasets #65

Comments

jonbonazza commented Jul 27, 2017

jonbonazza commented Jul 31, 2017

timoha commented Aug 9, 2017

jonbonazza commented Aug 17, 2017

jonbonazza commented Oct 25, 2017 • edited Loading

jonbonazza commented Oct 25, 2017 • edited Loading

jonbonazza commented Oct 25, 2017 • edited Loading

timoha commented Nov 6, 2017

jonbonazza commented Nov 6, 2017

jonbonazza commented May 13, 2018

jonbonazza commented May 15, 2018

tsuna commented Nov 12, 2021

timoha commented Nov 13, 2021

dethi commented Mar 28, 2023

jonbonazza commented Oct 25, 2017 •

edited

Loading

jonbonazza commented Oct 25, 2017 •

edited

Loading

jonbonazza commented Oct 25, 2017 •

edited

Loading