Some consumers in a group reverted to an older offsets shortly after deploy #198

Hammadk · 2020-10-16T14:02:55Z

We had an incident where 2 consumers in the same consumer reprocessed messages with older offsets (they moved back around 500,000 messages each). This happened shortly after a deploy. I am posting about this here incase someone has ideas on why this could have happened, or if this could point out a bug in racecar / rdkafka-ruby. This could be a bug with how rebalancing is handled, but I haven't been able to identify the code path that could cause this.

Our Racecar setup

Racecar version 2.1.0
config.session_timeout is 5.minutes (default is 30 seconds)
We do a rolling update on deploys instead of Recreate. We know that the docs suggest to recreate, but we have opted for higher availability at the cost of double processing during rebalancing. This may be the root cause, but it could also be a red herring.
We have topic with 8 partitions, and we have a Racecar consumer that only subscribes to this topic.
We deploy this consumer with Kubernetes and this consumer has 8 replicas (same as # partitions)

Details of what happened

Last week, we saw that the consumer lag spiked by 1 million shortly after a deploy. We noticed that consumers working on partition 3 and partition 6 were re-processing old messages.

Partition 3 behaviour

Pod A was initially working on partition 3
A rolling deploying started
At 7:18:00 AM Pod A was working on offset 166981170
Here Pod B took over, and it started working on a message roughly 500,000k behind
Then between 7:18:00 AM and 7:18:02 AM Pod B worked on offset 166429818 -166429915
At 7:18:02 AM Pod B logged that it was gracefully shutting down
After that, another Pod started working on the partition

Partition 6 behaviour

Pod A was initially working on partition 6 on offset 167800011
Pod A was shutdown 7:17:49 AM
Here Pod B took over, and it started working on a message roughly 500,000k behind
Between 7:18:30 AM and 7:18:31 AM Pod B worked on offset 167276532-167276679

For Partition 3, the offset issue happened when another consumer took over. For partition 6, the offset issue happened after a proper shutdown. We have not noticed any broker issues that would cause this, so the current theory is that this is related to our Kafka consumer. Would love some thoughts / ideas

The text was updated successfully, but these errors were encountered:

dasch · 2020-10-19T13:58:02Z

If your session timeout is 5 minutes, that means that a consumer can theoretically process messages for 5 minutes before figuring out that the group has been re-synchronized. At that point, the consumer cannot commit its offsets, so the messages will be re-processed.

What is the offset commit config for your consumers?

dasch · 2020-10-19T13:59:13Z

Also: by opting to use rolling updates you'll get these sorts of issues. I'm also not quite sure why you'd get more availability – the whole group needs to re-sync every time a member comes or goes, so you'll have a prolonged period of chaos, during which double-processing will be more likely than not.

Hammadk · 2020-10-19T19:47:39Z

If your session timeout is 5 minutes that means that a consumer can theoretically process messages for 5 minutes before figuring out that the group has been re-synchronized. At that point, the consumer cannot commit its offsets, so the messages will be re-processed.

The odd thing is that the consumer reverted back to an offset that it successfully processed 17 hours ago. This large change in offset is outside standard re-processing during rebalancing. I wonder if the consumer is going back to the beginning of the partition. Is there a bug or flow that could cause this? Could you point me to the relevant rebalancing logic?

Note that I don't have a good way of identifying the first offset for this partition, at the time of the incident. Technically, our retention policy is 3 days, so if this went back to the beginning, it should have gone even further back.

What is the offset commit config for your consumers?

We don't custom offset commit config, and we use the default offset_commit_interval 10 seconds config from Racecar. Is that the info you were looking for?

session timeout is 5 minutes

I'll remove our override, and change this back to the default of 30 seconds. We added this config for Racecar version < 2.0.0 to avoid cases where the whole batch could not be processed in time.

by opting to use rolling updates you'll get these sorts of issues. I'm also not quite sure why you'd get more availability – the whole group needs to re-sync every time a member comes or goes, so you'll have a prolonged period of chaos, during which double-processing will be more likely than not.

We were expecting to some of the consumers working correctly and committing their offsets, and slight progress, would be worth it. If you think that this makes it more likely to cause bigger issues then I think it makes sense to recreate pods instead.

dasch · 2020-10-20T10:25:23Z

Try to simplify your config and see if that doesn't resolve the issue. It's tricky to debug these things, unfortunately. The problem could be in Racecar, rdkafka-ruby, or in librdkafka.

tjwp · 2020-10-20T10:36:34Z

With the rebalancing during a rolling restart, could something like this be the issue: confluentinc/librdkafka#2782

I have not run the rdkafka based Racecar v2, but from looking at the source it appears that offsets are manually committed as in the steps described in the issue above.

The issue above was fixed in librdkafka v1.4.2, but the current rdkafka gem uses v1.4.0.

Hammadk · 2020-10-20T16:57:56Z

Great find @tjwp! Looking through confluentinc/librdkafka#2782 confluentinc/librdkafka#1767 and confluentinc/confluent-kafka-go#209, this does seem like the same issue. I guess now we wait for rdkafka-ruby to update to librdkafka v1.4.2 before this is resolved on racecar as well.

I have removed the session_timeout override in our config (we haven't needed it since racecar version > 2.0.0).

Thanks for looking into it, and please close this unless you want to keep it open as a reference.

Draiken · 2022-08-03T12:09:06Z

We had an incident yesterday with this, but without any deployment. The code was running for weeks (with perhaps the occasional dyno restart).

A single partition was reset and suddenly reported a 1.6M messages lag. I checked and we didn't have a spike of messages being produced, so it was definitely the offset for that partition getting somehow reset to a very old offset.

Given there was no deployment event I suspect it might indeed be confluentinc/librdkafka#2782 or a similar issue caused by rebalancing.

So there's no recourse to this aside from manually resetting the offset back until we get the upstream bug fix onto rdkafka-ruby and finally into racecar?

mensfeld · 2022-08-03T12:56:39Z

@Draiken but racecar uses 1.9.0 that has offset reset capabilities: https://github.com/edenhill/librdkafka/releases/tag/v1.9.0 and as far as I remember there is an error when offset cannot be stored.

Draiken · 2022-08-03T13:49:19Z

Oh damn, that was my bad. I looked at the commit that closed the parent issue, saw v1.9.2 and assumed it was there only on that release...
You are correct and it's been released on 1.9.0 so I can update racecar and get this fix.

TYVM @mensfeld

mensfeld · 2022-08-03T14:11:50Z

Oh damn, that was my bad. I looked at the commit that closed the parent issue, saw v1.9.2 and assumed it was there only on that release...

I made the same mistake and started updating rdkafka to 1.9.2 🙄

Patch to this is already in place in rdkafka so after the upgrade you should see it go away.

deankarn mentioned this issue Feb 4, 2022

Invalid Offsets Committed during a rebalance on shutdown confluentinc/librdkafka#3710

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some consumers in a group reverted to an older offsets shortly after deploy #198

Some consumers in a group reverted to an older offsets shortly after deploy #198

Hammadk commented Oct 16, 2020 •

edited

Loading

dasch commented Oct 19, 2020

dasch commented Oct 19, 2020

Hammadk commented Oct 19, 2020

dasch commented Oct 20, 2020

tjwp commented Oct 20, 2020

Hammadk commented Oct 20, 2020

Draiken commented Aug 3, 2022 •

edited

Loading

mensfeld commented Aug 3, 2022

Draiken commented Aug 3, 2022 •

edited

Loading

mensfeld commented Aug 3, 2022

Some consumers in a group reverted to an older offsets shortly after deploy #198

Some consumers in a group reverted to an older offsets shortly after deploy #198

Comments

Hammadk commented Oct 16, 2020 • edited Loading

Our Racecar setup

Details of what happened

dasch commented Oct 19, 2020

dasch commented Oct 19, 2020

Hammadk commented Oct 19, 2020

dasch commented Oct 20, 2020

tjwp commented Oct 20, 2020

Hammadk commented Oct 20, 2020

Draiken commented Aug 3, 2022 • edited Loading

mensfeld commented Aug 3, 2022

Draiken commented Aug 3, 2022 • edited Loading

mensfeld commented Aug 3, 2022

Hammadk commented Oct 16, 2020 •

edited

Loading

Draiken commented Aug 3, 2022 •

edited

Loading

Draiken commented Aug 3, 2022 •

edited

Loading