Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACK timeout kills connection without getting restarted #106

Open
D4no0 opened this issue Sep 24, 2021 · 4 comments
Open

ACK timeout kills connection without getting restarted #106

D4no0 opened this issue Sep 24, 2021 · 4 comments
Labels
Kind:Bug Something isn't working

Comments

@D4no0
Copy link

D4no0 commented Sep 24, 2021

versions:
broadway: 1.0.0
bradway_rabbitmq: 0.7.0
amqp: 2.1
elixir: 1.12
otp: 24.0.5

I have some long-running tasks that sometime may time-out the consumer_timeout from rabbitmq with message:

09:38:26.044 [warn] AMQP channel went down with reason: {:shutdown, {:server_initiated_close, 406, "PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 7200000 ms. This timeout value can be configured, see consumers doc guide to learn more"}}

The expected behavior would be to reestablish a new connection, kill the timed-out processors and rabbitmq to redeliver messages.

The current behavior is that the GenServer is killed and broadway can no longer send messages to rabbitmq. This is fixed only by restarting the broadway process.

@josevalim
Copy link
Member

Thanks for the report. The log you are seeing immediately causes the client to reconnect, so i am assuming there is something more at play here: https://github.com/dashbitco/broadway_rabbitmq/blob/master/lib/broadway_rabbitmq/producer.ex#L527-L536

@D4no0
Copy link
Author

D4no0 commented Sep 27, 2021

The error after that is related to a genserver call, with the genserver down:

07:57:38.686 [error] ** (exit) exited in: :gen_server.call(#PID<0.6829.0>, {:call, {:"basic.ack", 6, false}, :none, #PID<0.2649.0>}, 70000)
** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

@whatyouhide
Copy link
Collaborator

So, we ack from a different process than the RabbitMQ producer (the processor or batcher acks). I don't think we can "save" the ack if the channel is down. What we can do, however, is have a better error message from Broadway, which is what I did with #122. I think for now that's pretty much it. 😞 Eventually the producer should reconnect.

@v-anyukov
Copy link

We fall into related issue: ack timeout -> channel closed by rabbitMQ server -> while broadway reconnects there is several log messages about unable to ack/reject messages because of dead channel -> more ack timeouts growing every 30 minutes (default rabbitMQ consumer timeout) -> eventually we have a lot of channel reconnects but worst thing is that it appears rabbitMQ will keep all mnesia segments containing unacked messages, with 30 minutes timeout and high throughput it eats disk space pretty wild. We are going to try short timeout as our ingestion is intended to be pretty fast.

Regarding the topic: does it makes any sense to retry ack/reject several times when channel is not alive? Another option would be to at least give some control over messages broadway is unable to ack/reject, something like handle_ack_error or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Kind:Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants