-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some consumers in a group reverted to an older offsets shortly after deploy #198
Comments
If your session timeout is 5 minutes, that means that a consumer can theoretically process messages for 5 minutes before figuring out that the group has been re-synchronized. At that point, the consumer cannot commit its offsets, so the messages will be re-processed. What is the offset commit config for your consumers? |
Also: by opting to use rolling updates you'll get these sorts of issues. I'm also not quite sure why you'd get more availability – the whole group needs to re-sync every time a member comes or goes, so you'll have a prolonged period of chaos, during which double-processing will be more likely than not. |
The odd thing is that the consumer reverted back to an offset that it successfully processed 17 hours ago. This large change in offset is outside standard re-processing during rebalancing. I wonder if the consumer is going back to the beginning of the partition. Is there a bug or flow that could cause this? Could you point me to the relevant rebalancing logic? Note that I don't have a good way of identifying the first offset for this partition, at the time of the incident. Technically, our retention policy is 3 days, so if this went back to the beginning, it should have gone even further back.
We don't custom offset commit config, and we use the default offset_commit_interval 10 seconds config from Racecar. Is that the info you were looking for?
I'll remove our override, and change this back to the default of 30 seconds. We added this config for Racecar version < 2.0.0 to avoid cases where the whole batch could not be processed in time.
We were expecting to some of the consumers working correctly and committing their offsets, and slight progress, would be worth it. If you think that this makes it more likely to cause bigger issues then I think it makes sense to |
Try to simplify your config and see if that doesn't resolve the issue. It's tricky to debug these things, unfortunately. The problem could be in Racecar, rdkafka-ruby, or in librdkafka. |
With the rebalancing during a rolling restart, could something like this be the issue: confluentinc/librdkafka#2782 I have not run the rdkafka based Racecar v2, but from looking at the source it appears that offsets are manually committed as in the steps described in the issue above. The issue above was fixed in librdkafka v1.4.2, but the current rdkafka gem uses v1.4.0. |
Great find @tjwp! Looking through confluentinc/librdkafka#2782 confluentinc/librdkafka#1767 and confluentinc/confluent-kafka-go#209, this does seem like the same issue. I guess now we wait for rdkafka-ruby to update to I have removed the Thanks for looking into it, and please close this unless you want to keep it open as a reference. |
We had an incident yesterday with this, but without any deployment. The code was running for weeks (with perhaps the occasional dyno restart). A single partition was reset and suddenly reported a 1.6M messages lag. I checked and we didn't have a spike of messages being produced, so it was definitely the offset for that partition getting somehow reset to a very old offset. Given there was no deployment event I suspect it might indeed be confluentinc/librdkafka#2782 or a similar issue caused by rebalancing. So there's no recourse to this aside from manually resetting the offset back until we get the upstream bug fix onto |
@Draiken but racecar uses |
Oh damn, that was my bad. I looked at the commit that closed the parent issue, saw TYVM @mensfeld |
I made the same mistake and started updating rdkafka to 1.9.2 🙄 Patch to this is already in place in rdkafka so after the upgrade you should see it go away. |
We had an incident where 2 consumers in the same consumer reprocessed messages with older offsets (they moved back around 500,000 messages each). This happened shortly after a deploy. I am posting about this here incase someone has ideas on why this could have happened, or if this could point out a bug in racecar / rdkafka-ruby. This could be a bug with how rebalancing is handled, but I haven't been able to identify the code path that could cause this.
Our Racecar setup
2.1.0
config.session_timeout
is 5.minutes (default is 30 seconds)rolling update
on deploys instead ofRecreate
. We know that the docs suggest to recreate, but we have opted for higher availability at the cost of double processing during rebalancing. This may be the root cause, but it could also be a red herring.Details of what happened
Last week, we saw that the consumer lag spiked by 1 million shortly after a deploy. We noticed that consumers working on
partition 3
andpartition 6
were re-processing old messages.Partition 3 behaviour
7:18:00 AM
Pod A was working on offset 1669811707:18:00 AM
and7:18:02 AM
Pod B worked on offset 166429818 -166429915
7:18:02 AM
Pod B logged that it was gracefully shutting downPartition 6 behaviour
7:17:49 AM
7:18:30 AM
and7:18:31 AM
Pod B worked on offset 167276532-167276679
For Partition 3, the offset issue happened when another consumer took over. For partition 6, the offset issue happened after a proper shutdown. We have not noticed any broker issues that would cause this, so the current theory is that this is related to our Kafka consumer. Would love some thoughts / ideas
The text was updated successfully, but these errors were encountered: