Attempt of better connection error handling / reconnect #347

ukrbublik · 2019-05-05T20:44:00Z

Related to issues #298, #190

When kafka connection is lost, handle relative connection errors gracefully.
Currently supervisor tree will try to recreate itself without timeout which leads to quick exhaustion of max restarts and complete shutdown of application :kafka_ex.
I added sleep (config sleep_for_reconnect, default 400 ms) in server's init at places that throws errors without connection to kafka.

Tested with docker-compose down && docker-compose up.
Application :kafka_ex will not crash but contantly (with accordingly configured max_restarts/max_seconds/sleep_for_reconnect, 3/1/500 for example) tries to reconnect.
Or if configured with max_restarts/max_seconds/sleep_for_reconnect, 20/10/400 for example will try to reconnect t Kafka in ~10s and then shutdown.

Tests pass.

sourcelevel-bot · 2019-05-05T20:44:09Z

Hello, @ukrbublik! This is your first Pull Request that will be reviewed by Ebert, an automatic Code Review service. It will leave comments on this diff with potential issues and style violations found in the code as you push new commits. You can also see all the issues found on this Pull Request on its review page. Please check our documentation for more information.

- When kafka connection is lost, handle relative errors gracefully. - Retry to recreate supervision tree with timeout. Added config sleep_for_reconnect, default 400. - Also fixed termination of ConsumerGroup.Manager to guaranteed stop worker.

joshuawscott · 2019-05-08T19:48:36Z

lib/kafka_ex/consumer_group/heartbeat.ex

@@ -79,6 +79,11 @@ defmodule KafkaEx.ConsumerGroup.Heartbeat do
      %HeartbeatResponse{error_code: error_code} ->
        Logger.warn("Heartbeat failed, got error code #{error_code}")
        {:stop, {:shutdown, {:error, error_code}}, state}
+
+      {:error, reason} ->


It's not possible for this value to be returned from KafkaEx.heartbeat/2 according to dialyzer.

I'm sure it's possible, I've checked this on live application.

KafkaEx.heartbeat is called here -

kafka_ex/lib/kafka_ex/consumer_group/heartbeat.ex

Line 69 in a746114

case KafkaEx.heartbeat(heartbeat_request, worker_name: worker_name) do

->
KafkaEx.heartbeat

kafka_ex/lib/kafka_ex.ex

Line 146 in a746114

Server.call(worker_name, {:heartbeat, request, timeout}, opts)

->
KafkaEx.Server.handle_call({:heartbeat... -

kafka_ex/lib/kafka_ex/server.ex

Line 336 in a746114

kafka_server_heartbeat(request, network_timeout, state)

->
KafkaEx.Server0P9P0.kafka_server_heartbeat -

kafka_ex/lib/kafka_ex/server_0_p_9_p_0.ex

Line 186 in a746114

consumer_group_sync_request(

->
KafkaEx.Server0P9P0.consumer_group_sync_request -

kafka_ex/lib/kafka_ex/server_0_p_9_p_0.ex

Line 228 in a746114

{{:error, reason}, state_out}

heh, that's unfortunate. I suspect we need to fix the @spec for KafkaEx.heartbeat then, so that dialyzer will pass.

joshuawscott · 2019-05-08T19:49:22Z

lib/kafka_ex/consumer_group/manager.ex

+      %{error_code: error_code} ->
+        raise "Error joining consumer group #{group_name}: " <>
+                "#{inspect(error_code)}"
+      {:error, reason} ->


Same here - dialyzer says that join_response can never be {:error, _}

Also checked on live application - exception can be raised from line 268

joshuawscott · 2019-05-08T19:49:51Z

lib/kafka_ex/consumer_group/manager.ex

+          "Received error #{inspect(error_code)}, " <>
+            "consumer group manager will exit regardless."
+        end)
+      {:error, reason} ->


Also here, Dialyzer catches this as not possible

…utdown

ukrbublik · 2019-05-11T22:19:37Z

@joshuawscott I've updated specs, Dialyzer pass.

sourcelevel-bot · 2019-05-19T23:21:40Z

Ebert has finished reviewing this Pull Request and has found:

10 fixed issues! 🎉

But beware that this branch is 1 commit behind the kafkaex:master branch, and a review of an up to date branch would produce more accurate results.

You can see more details about this review at https://ebertapp.io/github/kafkaex/kafka_ex/pulls/347.

joshuawscott

This looks good to me. Thanks again!

Attempt of better connection error handling / reconnect

ukrbublik force-pushed the handle_connection_loss branch from a342e3f to a746114 Compare May 5, 2019 20:57

ukrbublik changed the title ~~Attempt of better error handling (#190)~~ Attempt of better connection error handling / reconnect May 5, 2019

joshuawscott reviewed May 8, 2019

View reviewed changes

Fixed specs in responses where {:error, atom} is possible on Kafka sh…

34d2b7f

…utdown

ukrbublik force-pushed the handle_connection_loss branch 2 times, most recently from 0cb335d to aa20483 Compare May 19, 2019 01:28

ukrbublik added 2 commits May 20, 2019 02:20

stutdown manager on unrecoverable error

60ac0a2

restart consumer group together with manager consumer group

ec9299b

ukrbublik force-pushed the handle_connection_loss branch from aa20483 to ec9299b Compare May 19, 2019 23:21

joshuawscott approved these changes May 21, 2019

View reviewed changes

joshuawscott merged commit 8ff19af into kafkaex:master Jun 3, 2019

andrielfn pushed a commit to TheRealReal/kafka_ex that referenced this pull request Nov 27, 2019

Merge pull request kafkaex#347 from ukrbublik/handle_connection_loss

caeeb07

Attempt of better connection error handling / reconnect

joshuawscott mentioned this pull request Jul 14, 2020

Release 0.11.0 #411

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt of better connection error handling / reconnect #347

Attempt of better connection error handling / reconnect #347

ukrbublik commented May 5, 2019 •

edited

Loading

sourcelevel-bot bot commented May 5, 2019

joshuawscott May 8, 2019

ukrbublik May 8, 2019

joshuawscott May 8, 2019

joshuawscott May 8, 2019

ukrbublik May 8, 2019

joshuawscott May 8, 2019

ukrbublik commented May 11, 2019

sourcelevel-bot bot commented May 19, 2019

joshuawscott left a comment

Attempt of better connection error handling / reconnect #347

Attempt of better connection error handling / reconnect #347

Conversation

ukrbublik commented May 5, 2019 • edited Loading

sourcelevel-bot bot commented May 5, 2019

joshuawscott May 8, 2019

Choose a reason for hiding this comment

ukrbublik May 8, 2019

Choose a reason for hiding this comment

joshuawscott May 8, 2019

Choose a reason for hiding this comment

joshuawscott May 8, 2019

Choose a reason for hiding this comment

ukrbublik May 8, 2019

Choose a reason for hiding this comment

joshuawscott May 8, 2019

Choose a reason for hiding this comment

ukrbublik commented May 11, 2019

sourcelevel-bot bot commented May 19, 2019

joshuawscott left a comment

Choose a reason for hiding this comment

ukrbublik commented May 5, 2019 •

edited

Loading