-
Notifications
You must be signed in to change notification settings - Fork 146
Validator client doesn't crash due to unresponsive beacon node. #1743
Conversation
Cleaner exception handling. removed extra space removed extra space lint
328234f
to
c0068c8
Compare
duty.tick_for_execution.slot, | ||
duty.committee_index, | ||
) | ||
except OSError as err: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with the eth2 code but I just want to note that generally catching OSError
in that way may lead to accidentally catching a whole range of unrelated errors which can make debugging harder.
If we have to catch the OSError
then maybe we just catch it here and re-raise it as a more specific error.
trinity/eth2/validator_client/beacon_node.py
Lines 50 to 51 in 32f5231
except OSError: | |
raise |
Something like BeaconNodeUnresponsive
or I actually wonder if that one would even qualify as a TimeoutError
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I'm assuming this applies to _post_json
as well, so I've updated that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OSError
is along the exception chain when the tcp connection cannot be made -- if there is a more specific exception then let's use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the context here is that the validator client will try to dial the beacon node host and if the port isn't open (or also i imagine the ip is not reachable) then we get an exception. it is definitely worth looking into a more specific exception for this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the context here is that the validator client will try to dial the beacon node host and if the port isn't open (or also i imagine the ip is not reachable) then we get an exception
Correct, which is why something like BeaconNodeUnreachable
or TimeoutError
seems more suitable to be used across multiple callsites. Because catching OSError
among many callsites may accidentially catch unrelated errors.
The current state of the PR reflects the idea of turning the OSError
early into a TimeoutError
but I'm not 100 % sold on the idea (even though I brought it up). I think creating a custom BeaconNodeUnreachable
(derived from BaseTrinityError
) may be the most straight forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose TimeoutError
would be misleading. As Alex noted, the node could be offline or running at a different address and therefore unreachable, which is probably going to be the case most of the time. I think BeaconNodeUnreachable
is clearer, but might be misleading if the request has actually timed out or returned a server error. Maybe we should go with BeaconNodeRequestFailed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with BeaconNodeRequestFailed
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the general concern i have with this approach is that it leaks details about the beacon node into the rest of the validator client.
we have to now find all the current (and future!) places where we call the beacon_node
and make sure to catch a possible failure.
this is error prone -- for example if we merge this in we will have to install the same error handler for the block proposal duty in duty_scheduler.py
.
it also mixes concerns about a particular kind of beacon node implementation (the HTTP one, that could fail) with other possible implementations (like one over a local unix socket or one "in-memory" that should really not fail in this kind of way...).
i could suggest an alternative, it just requires some deeper editing of the beacon node client and i wasn't sure how that would land with the work in #1712
the alternative is to effectively serialize every http request through a "dispatcher" -- in this one place, we can catch the OSError
(and if possible find a more specific exception for the lack of connection -- if anyone is curious, let's consider use of socket.connect_ex
) and log a warning or just do nothing...
the callers of this functionality just get empty objects upon this kind of error (e.g. the attestation
is None
on line 36 of duty_scheduler.py
in this PR.
in terms of a "dispatcher", the simplest change (although it is higher volume) is to catch OSError
within each of the external-facing functions in the beacon node client class -- to minimize the volume of code here, i would recommend piping every request through a central location inside that class so we only have to write the error handling logic once... happy to elaborate if something here didn't make sense.
i'm also honestly not sure if there aren't other places we need to catch this exception if we keep it external the |
trinity/exceptions.py
Outdated
@@ -148,3 +148,10 @@ class MetricsReportingError(BaseTrinityError): | |||
Raised when there is an error while reporting metrics. | |||
""" | |||
pass | |||
|
|||
|
|||
class BeaconNodeRequestFailure(BaseTrinityError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there may be some python conventions i'm missing but i'd think we want to define an exception as close to its possible source of creation as possible w/in the broader module/file hierarchy.... if we moved this concern to be entirely inside the scope of the BeaconNode
then we can even keep this exception private to that file/module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the general reason to do this is that i think it tends towards better maintainability and discoverability of the code base over time -- imagine you are exploring the trinity codebase and you suddenly have N exception types that tie you across not only different parts of the client stack, but different clients! it would quickly overflow my cognitive stack :)
Yeah, I think this makes sense to me for the most part. By dispatcher you mean a class that basically consumes the |
4c931a5
to
fb10a8d
Compare
fb10a8d
to
2882e5f
Compare
@g-r-a-n-t what do you want to do here? get it to a mergeable state? or just close? port to new trinity-eth2 repo...? |
let's just close it |
Closes #1737
What was wrong?
The validator client would crash if a request to a beacon node failed.
How was it fixed?
The validator client catches these exceptions and logs a warning instead of crashing.
To-Do
Cute Animal Picture