Skip to content
This repository has been archived by the owner on Jul 1, 2021. It is now read-only.

Validator client doesn't crash due to unresponsive beacon node. #1743

Closed
wants to merge 5 commits into from

Conversation

g-r-a-n-t
Copy link
Member

@g-r-a-n-t g-r-a-n-t commented May 25, 2020

Closes #1737

What was wrong?

The validator client would crash if a request to a beacon node failed.

How was it fixed?

The validator client catches these exceptions and logs a warning instead of crashing.

To-Do

  • Clean up commit history

Cute Animal Picture

Cleaner exception handling.

removed extra space

removed extra space

lint
duty.tick_for_execution.slot,
duty.committee_index,
)
except OSError as err:
Copy link
Contributor

@cburgdorf cburgdorf May 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with the eth2 code but I just want to note that generally catching OSError in that way may lead to accidentally catching a whole range of unrelated errors which can make debugging harder.

If we have to catch the OSError then maybe we just catch it here and re-raise it as a more specific error.

except OSError:
raise

Something like BeaconNodeUnresponsive or I actually wonder if that one would even qualify as a TimeoutError 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'm assuming this applies to _post_json as well, so I've updated that too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OSError is along the exception chain when the tcp connection cannot be made -- if there is a more specific exception then let's use that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the context here is that the validator client will try to dial the beacon node host and if the port isn't open (or also i imagine the ip is not reachable) then we get an exception. it is definitely worth looking into a more specific exception for this case

Copy link
Contributor

@cburgdorf cburgdorf May 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the context here is that the validator client will try to dial the beacon node host and if the port isn't open (or also i imagine the ip is not reachable) then we get an exception

Correct, which is why something like BeaconNodeUnreachable or TimeoutError seems more suitable to be used across multiple callsites. Because catching OSError among many callsites may accidentially catch unrelated errors.

The current state of the PR reflects the idea of turning the OSError early into a TimeoutError but I'm not 100 % sold on the idea (even though I brought it up). I think creating a custom BeaconNodeUnreachable (derived from BaseTrinityError) may be the most straight forward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose TimeoutError would be misleading. As Alex noted, the node could be offline or running at a different address and therefore unreachable, which is probably going to be the case most of the time. I think BeaconNodeUnreachable is clearer, but might be misleading if the request has actually timed out or returned a server error. Maybe we should go with BeaconNodeRequestFailed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with BeaconNodeRequestFailed 👍

Copy link
Member

@ralexstokes ralexstokes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the general concern i have with this approach is that it leaks details about the beacon node into the rest of the validator client.

we have to now find all the current (and future!) places where we call the beacon_node and make sure to catch a possible failure.

this is error prone -- for example if we merge this in we will have to install the same error handler for the block proposal duty in duty_scheduler.py.

it also mixes concerns about a particular kind of beacon node implementation (the HTTP one, that could fail) with other possible implementations (like one over a local unix socket or one "in-memory" that should really not fail in this kind of way...).

i could suggest an alternative, it just requires some deeper editing of the beacon node client and i wasn't sure how that would land with the work in #1712

the alternative is to effectively serialize every http request through a "dispatcher" -- in this one place, we can catch the OSError (and if possible find a more specific exception for the lack of connection -- if anyone is curious, let's consider use of socket.connect_ex) and log a warning or just do nothing...

the callers of this functionality just get empty objects upon this kind of error (e.g. the attestation is None on line 36 of duty_scheduler.py in this PR.

in terms of a "dispatcher", the simplest change (although it is higher volume) is to catch OSError within each of the external-facing functions in the beacon node client class -- to minimize the volume of code here, i would recommend piping every request through a central location inside that class so we only have to write the error handling logic once... happy to elaborate if something here didn't make sense.

@ralexstokes
Copy link
Member

i'm also honestly not sure if there aren't other places we need to catch this exception if we keep it external the BeaconNode class. and we can update the API to always return the object (rather than Optional[Object]) if we stick w/ this "exception external to the class" approach.

@@ -148,3 +148,10 @@ class MetricsReportingError(BaseTrinityError):
Raised when there is an error while reporting metrics.
"""
pass


class BeaconNodeRequestFailure(BaseTrinityError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there may be some python conventions i'm missing but i'd think we want to define an exception as close to its possible source of creation as possible w/in the broader module/file hierarchy.... if we moved this concern to be entirely inside the scope of the BeaconNode then we can even keep this exception private to that file/module

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the general reason to do this is that i think it tends towards better maintainability and discoverability of the code base over time -- imagine you are exploring the trinity codebase and you suddenly have N exception types that tie you across not only different parts of the client stack, but different clients! it would quickly overflow my cognitive stack :)

@g-r-a-n-t
Copy link
Member Author

Yeah, I think this makes sense to me for the most part. By dispatcher you mean a class that basically consumes the _post_json, _get_json and _get_... methods in beacon_node.py, right?

@ralexstokes
Copy link
Member

@g-r-a-n-t what do you want to do here? get it to a mergeable state? or just close? port to new trinity-eth2 repo...?

@g-r-a-n-t
Copy link
Member Author

let's just close it

@g-r-a-n-t g-r-a-n-t closed this Aug 21, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The validator client currently crashes if a request is made against a beacon node that goes down.
3 participants