Incremental Co-operative Rebalancing Support for HDFS Connector (#625) #712
+46
−11
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a copy of #711, but from my personal email, with which I can sign the CLA.
Problem
See #625.
If
consumer.partition.assignment.strategy
is set toorg.apache.kafka.clients.consumer.CooperativeStickyAssignor
inconfig/connect-distributed.properties
, after a partial partition revocation (say, a new worker joins and takes over some partitions from some other worker) in a task it will be killed due to suchNullPointerException
:The reason for that is that during
onPartitionsRevoked
callback theDataWriter
currently closes and removes all of itsTopicPartitionWriter
s, probably assuming eager rebalancing.If the consumer only gets some partitions revoked and none new assigned, the
onPartitionsAssigned
will be called with empty collection and thetopicPartitionWriters
collection will remain empty.When new data arrives (from partitions that the consumer kept ownership of) to
HdfsSinkTask#put
, theNullPointerException
will be thrown when accessing thetopicPartitionsWriters
.Solution
Only close and remove from
topicPartitionsWriters
those partitions retrieved fromHdfsSinkTask#close(Collection<TopicPartition>)
.Note: there is a 9 years old comment explaining why should we close all of the topic partition writers, which I didn't really understand. This solution simply ignores and removes it. @ewencp, can you please review and put some comments if it would be safe to do so?
Does this solution apply anywhere else?
If yes, where?
Test Strategy
Added
HdfsSinkTaskTest#testPartialRevocation
unit test simulating a partial revocation that throwsNullPointerException
without the fix.Plus, tested manually (after adding a second worker (partial revocation happens) and writing some data to topic no NPE is thrown).
Testing done:
Release Plan
Please see a similar issue for Connect framework itself, KAFKA-12487. I was testing the fix on Kafka
7.7.2-18-ccs
where this fix is already present, but haven't tested on earlier versions without it. Should I do it?