Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Producer doesn't handle not_leader_for_partition errors #396

Open
shamilpd opened this issue Jan 21, 2020 · 1 comment
Open

Producer doesn't handle not_leader_for_partition errors #396

shamilpd opened this issue Jan 21, 2020 · 1 comment

Comments

@shamilpd
Copy link

If the connection to a partition leader is broken, KafkaEx handles it by triggering a metadata request and updating its cached metadata. It is possible that the leader can change while the broker connection is intact. For example you can manually trigger a leader election, reassign partitions, or there can be a network partition with zookeeper that makes the controller think that a partition leader dropped out of the cluster.

If the metadata is not up-to-date in such cases, KafkaEx will attempt to produce to the old partition leader, and it will get a not_leader_for_partition error back. But this error is not handled (https://github.com/kafkaex/kafka_ex/blob/master/lib/kafka_ex/server.ex#L481). KafkaEx should trigger a metadata update when this happens and try the produce request again. It wasn't easy but I did actually manage to produce the bug locally.

With the default metadata_update_interval of 30s, produce requests could fail for up to 30s before the issue is fixed which is not acceptable for apps that have high produce rates.

Once this problem is fixed, we would love a config to be able to disable the periodic metadata updates. If an app only produces messages and at a relatively high rate, these metadata updates don't add any value. Any change in the metadata would be noticed by the first produce request and subsequently updated. What do you think?

I started working on a PR to handle not_leader_for_partition. Had a question about this code https://github.com/kafkaex/kafka_ex/blob/master/lib/kafka_ex/server.ex#L430-L439
Why does it call retrieve_metadata() and then update_metadata() after. Seems like the first call is redundant?

@joshuawscott
Copy link
Member

@shamilpd My apologies, I thought I had already responded to this 😞

The fixes you're proposing seem reasonable.

As far as the code you mention in server.ex, that code is a bit of a mess, so I'm frankly not surprised there's a redundant call. I don't see any reason for it, although I might be missing something. If the tests pass, then I would feel pretty comfortable with removing that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants