Increase kafka-python resilience in lambda-like environments #2386

petterroea · 2023-08-09T01:58:56Z

I am using kafka-python in AWS Lambda for the purpose of producing messages to a Kafka cluster. My lambda is ran every 5 minutes, and only for a few ms at a time. The execution environment is paused between each lambda invocation, including background threads. AWS recommends you keep database connections and similar open between executions, so you don't have to re-establish them before each invocation. See https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtime-environment.html#runtimes-lifecycle-shutdown

kafka-python currently doesn't handle this kind of execution environment that well. If the connection takes several lambda executions to establish, even if the connection state progresses towards being connected, as long as it isn't CONNECTED by the end of conn.py's connect() (https://github.com/dpkp/kafka-python/blob/master/kafka/conn.py#L359), the connection will time out. This function is for all practical intents and purposes called in a loop until the connection is established or fails.

The naive fix is to increase request_timeout_ms. However, this configuration variable is shared with other features, such as sending messages. When in lambda, it can be useful to allow kafka-python to spend a long time connecting, but have it fail quickly when sending in order to quickly detect a broken connection. We flush kafka-python at the end of every lambda execution using a decorator, which should force the library to attempt establishing connection and then sending if needed.

Suggested fix

I suggest two fixes that together help minimize the problem:

Add a separate configuration variable, connection_timeout_ms, that optionally configures the timeout of establishing connections. If not set, it can default to request_timeout_ms, to ensure backwards compatability. This ensures we can be extra lenient with establishing connections when we are running in Lambda.
Update self.last_attempt multiple times during the connection phase, such that we become more tolerant of the connection taking time to establish as long as progress is made. The variable could probably need a rename to reflect the change in meaning.

These two fixes have been implemented and tested with my lambdas and solve the issue for me. I can create a PR with them, or discuss other solutions.

Thanks!
Liam

The text was updated successfully, but these errors were encountered:

petterroea · 2023-08-09T02:01:23Z

In addition to this, the connection is closed from the Kafka-side after 10 minutes(connections.max.idle.ms controls this). This is no biggie because the connection can be re-established, but it creates some noise in our logs. The naive solution here is to increase the timeout(This increases the chance that one of the lambda invocations gives the kafka thread enough time to send some packets), but is there a better option? A wait_for_stable_connections function sounds nice, but would probably problematic in the case of a Kafka outage. Disabling errors on connections closed from the other side?

petterroea linked a pull request Aug 9, 2023 that will close this issue

Add connection_timeout_ms and reset the timeout counter more often #2388

Open

wbarnha linked a pull request Aug 9, 2023 that will close this issue

Add connection_timeout_ms and reset the timeout counter more often #2388

Open

wbarnha added the enhancement label Aug 12, 2023

wbarnha mentioned this issue Mar 8, 2024

Add connection_timeout_ms and reset the timeout counter more often wbarnha/kafka-python-ng#132

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase kafka-python resilience in lambda-like environments #2386

Increase kafka-python resilience in lambda-like environments #2386

petterroea commented Aug 9, 2023 •

edited

Loading

petterroea commented Aug 9, 2023 •

edited

Loading

Increase kafka-python resilience in lambda-like environments #2386

Increase kafka-python resilience in lambda-like environments #2386

Comments

petterroea commented Aug 9, 2023 • edited Loading

Suggested fix

petterroea commented Aug 9, 2023 • edited Loading

petterroea commented Aug 9, 2023 •

edited

Loading

petterroea commented Aug 9, 2023 •

edited

Loading