Validator client doesn't crash due to unresponsive beacon node. #1743

g-r-a-n-t · 2020-05-25T00:48:58Z

Closes #1737

What was wrong?

The validator client would crash if a request to a beacon node failed.

How was it fixed?

The validator client catches these exceptions and logs a warning instead of crashing.

To-Do

Clean up commit history

Add entry to the release notes

Cute Animal Picture

Cleaner exception handling. removed extra space removed extra space lint

cburgdorf · 2020-05-25T08:15:22Z

eth2/validator_client/duty_scheduler.py

+                duty.tick_for_execution.slot,
+                duty.committee_index,
+            )
+        except OSError as err:


I'm not very familiar with the eth2 code but I just want to note that generally catching OSError in that way may lead to accidentally catching a whole range of unrelated errors which can make debugging harder.

If we have to catch the OSError then maybe we just catch it here and re-raise it as a more specific error.

trinity/eth2/validator_client/beacon_node.py

Lines 50 to 51 in 32f5231

except OSError:

raise

Something like BeaconNodeUnresponsive or I actually wonder if that one would even qualify as a TimeoutError 🤔

Makes sense. I'm assuming this applies to _post_json as well, so I've updated that too.

OSError is along the exception chain when the tcp connection cannot be made -- if there is a more specific exception then let's use that.

the context here is that the validator client will try to dial the beacon node host and if the port isn't open (or also i imagine the ip is not reachable) then we get an exception. it is definitely worth looking into a more specific exception for this case

the context here is that the validator client will try to dial the beacon node host and if the port isn't open (or also i imagine the ip is not reachable) then we get an exception

Correct, which is why something like BeaconNodeUnreachable or TimeoutError seems more suitable to be used across multiple callsites. Because catching OSError among many callsites may accidentially catch unrelated errors.

The current state of the PR reflects the idea of turning the OSError early into a TimeoutError but I'm not 100 % sold on the idea (even though I brought it up). I think creating a custom BeaconNodeUnreachable (derived from BaseTrinityError) may be the most straight forward.

I suppose TimeoutError would be misleading. As Alex noted, the node could be offline or running at a different address and therefore unreachable, which is probably going to be the case most of the time. I think BeaconNodeUnreachable is clearer, but might be misleading if the request has actually timed out or returned a server error. Maybe we should go with BeaconNodeRequestFailed?

I'm fine with BeaconNodeRequestFailed 👍

ralexstokes

the general concern i have with this approach is that it leaks details about the beacon node into the rest of the validator client.

we have to now find all the current (and future!) places where we call the beacon_node and make sure to catch a possible failure.

this is error prone -- for example if we merge this in we will have to install the same error handler for the block proposal duty in duty_scheduler.py.

it also mixes concerns about a particular kind of beacon node implementation (the HTTP one, that could fail) with other possible implementations (like one over a local unix socket or one "in-memory" that should really not fail in this kind of way...).

i could suggest an alternative, it just requires some deeper editing of the beacon node client and i wasn't sure how that would land with the work in #1712

the alternative is to effectively serialize every http request through a "dispatcher" -- in this one place, we can catch the OSError (and if possible find a more specific exception for the lack of connection -- if anyone is curious, let's consider use of socket.connect_ex) and log a warning or just do nothing...

the callers of this functionality just get empty objects upon this kind of error (e.g. the attestation is None on line 36 of duty_scheduler.py in this PR.

in terms of a "dispatcher", the simplest change (although it is higher volume) is to catch OSError within each of the external-facing functions in the beacon node client class -- to minimize the volume of code here, i would recommend piping every request through a central location inside that class so we only have to write the error handling logic once... happy to elaborate if something here didn't make sense.

ralexstokes · 2020-05-29T21:05:38Z

i'm also honestly not sure if there aren't other places we need to catch this exception if we keep it external the BeaconNode class. and we can update the API to always return the object (rather than Optional[Object]) if we stick w/ this "exception external to the class" approach.

ralexstokes · 2020-05-29T21:07:25Z

trinity/exceptions.py

@@ -148,3 +148,10 @@ class MetricsReportingError(BaseTrinityError):
    Raised when there is an error while reporting metrics.
    """
    pass
+
+
+class BeaconNodeRequestFailure(BaseTrinityError):


there may be some python conventions i'm missing but i'd think we want to define an exception as close to its possible source of creation as possible w/in the broader module/file hierarchy.... if we moved this concern to be entirely inside the scope of the BeaconNode then we can even keep this exception private to that file/module

the general reason to do this is that i think it tends towards better maintainability and discoverability of the code base over time -- imagine you are exploring the trinity codebase and you suddenly have N exception types that tie you across not only different parts of the client stack, but different clients! it would quickly overflow my cognitive stack :)

g-r-a-n-t · 2020-06-01T11:07:11Z

Yeah, I think this makes sense to me for the most part. By dispatcher you mean a class that basically consumes the _post_json, _get_json and _get_... methods in beacon_node.py, right?

fix

ralexstokes · 2020-08-21T23:07:11Z

@g-r-a-n-t what do you want to do here? get it to a mergeable state? or just close? port to new trinity-eth2 repo...?

g-r-a-n-t · 2020-08-21T23:28:41Z

let's just close it

Validator client doesn't crash due to unresponsive beacon node.

c0068c8

Cleaner exception handling. removed extra space removed extra space lint

g-r-a-n-t force-pushed the val-client-crash branch from 328234f to c0068c8 Compare May 25, 2020 01:57

cburgdorf reviewed May 25, 2020

View reviewed changes

g-r-a-n-t added 3 commits May 25, 2020 19:48

Beacon node JSON methods reraise OSError as TimeoutError.

206a09b

Failed beacon node requests raise BeaconNodeRequestFailure.

08fa7db

Merge branch 'master' into val-client-crash

a5a87b1

ralexstokes reviewed May 29, 2020

View reviewed changes

g-r-a-n-t force-pushed the val-client-crash branch from 4c931a5 to fb10a8d Compare June 2, 2020 03:18

BeaconNodeRequestFailure moved to validator_client.

2882e5f

fix

g-r-a-n-t force-pushed the val-client-crash branch from fb10a8d to 2882e5f Compare June 2, 2020 03:33

ralexstokes added the eth2.0 label Jul 23, 2020

g-r-a-n-t closed this Aug 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validator client doesn't crash due to unresponsive beacon node. #1743

Validator client doesn't crash due to unresponsive beacon node. #1743

g-r-a-n-t commented May 25, 2020 •

edited

Loading

cburgdorf May 25, 2020 •

edited

Loading

g-r-a-n-t May 25, 2020

ralexstokes May 25, 2020

ralexstokes May 25, 2020

cburgdorf May 25, 2020 •

edited

Loading

g-r-a-n-t May 26, 2020

cburgdorf May 26, 2020

ralexstokes left a comment

ralexstokes commented May 29, 2020

ralexstokes May 29, 2020

ralexstokes May 29, 2020

g-r-a-n-t commented Jun 1, 2020

ralexstokes commented Aug 21, 2020

g-r-a-n-t commented Aug 21, 2020

Validator client doesn't crash due to unresponsive beacon node. #1743

Validator client doesn't crash due to unresponsive beacon node. #1743

Conversation

g-r-a-n-t commented May 25, 2020 • edited Loading

What was wrong?

How was it fixed?

To-Do

Cute Animal Picture

cburgdorf May 25, 2020 • edited Loading

Choose a reason for hiding this comment

g-r-a-n-t May 25, 2020

Choose a reason for hiding this comment

ralexstokes May 25, 2020

Choose a reason for hiding this comment

ralexstokes May 25, 2020

Choose a reason for hiding this comment

cburgdorf May 25, 2020 • edited Loading

Choose a reason for hiding this comment

g-r-a-n-t May 26, 2020

Choose a reason for hiding this comment

cburgdorf May 26, 2020

Choose a reason for hiding this comment

ralexstokes left a comment

Choose a reason for hiding this comment

ralexstokes commented May 29, 2020

ralexstokes May 29, 2020

Choose a reason for hiding this comment

ralexstokes May 29, 2020

Choose a reason for hiding this comment

g-r-a-n-t commented Jun 1, 2020

ralexstokes commented Aug 21, 2020

g-r-a-n-t commented Aug 21, 2020

g-r-a-n-t commented May 25, 2020 •

edited

Loading

cburgdorf May 25, 2020 •

edited

Loading

cburgdorf May 25, 2020 •

edited

Loading