Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: node gateway error handling improvements #124

Merged
merged 6 commits into from
Dec 19, 2023
Merged

Conversation

brianluong
Copy link
Member

@brianluong brianluong commented Dec 19, 2023

Description

This task to essentially fix the response_code tag on node_gateway_router_upstream_rpc_request_errors.count ended up just a tad bit larger. I ended up making some improvements to how we handle errors, and some small misc things.

Type of change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • 😎 New feature (non-breaking change which adds functionality)
  • ⁉️ Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • ⚒️ Refactor (no functional changes)
  • 📖 Documentation (updating or adding docs)

How Has This Been Tested?

Testing local and staging before prod.

@@ -94,7 +94,7 @@ var (
"Batches are deconstructed to single JSON RPC requests for this metric.",
},
// jsonrpc_method is "batch" for batch requests
[]string{"chain_name", "client", "upstream_id", "url", "jsonrpc_method", "response_code", "jsonrpc_error_code"},
[]string{"chain_name", "client", "upstream_id", "url", "jsonrpc_method", "jsonrpc_error_code"},
Copy link
Member Author

@brianluong brianluong Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't feel HTTP code on a JSON RPC metric was necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed!

)

originFunc := func() (*jsonrpc.SingleResponseBody, error) {
var err error

// Any errors will result in respBody and resp being nil.
respBody, resp, err = r.getResponseBody(httpReq, &requestBody, configToRoute) //nolint:bodyclose // linter bug
jsonRPCRespBody, httpResp, err = r.getResponseBody(httpReq, &requestBody, configToRoute) //nolint:bodyclose // linter bug
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lil bit of renaming to jsonRPCRespBody and httpResp in a few places. Thought this naming would be more clear.

func (r *RequestExecutor) routeToConfig(
ctx context.Context,
requestBody jsonrpc.RequestBody,
configToRoute *config.UpstreamConfig,
) (jsonrpc.ResponseBody, *http.Response, bool, error) {
) (jsonrpc.ResponseBody, *HttpResponse, bool, error) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't think it was necessary to plumb the full *http.Response object all the way back to the front. We can close it and pass back some specific pieces of metadata we're interested in (e.g. httpStatusCode).

The impetus for this change was that I was getting confused when I was first looking at the code (I haven't looked at it in months!).

respBodyString := util.ReadAndCopyBackResponseBody(resp)
httpResp := &HttpResponse{resp.StatusCode}

if resp.StatusCode >= http.StatusBadRequest {
Copy link
Member Author

@brianluong brianluong Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We weren't doing a good job of handling HTTP responses w/ 4xx and 5xxs in them.

We were trying to deserialize their bodies and failing (most of the time, unless body was empty).

This caused our error metrics to have the wrong HTTP response code attached to them. Ex: I saw a 429 response in logs but it was a 500 in metrics.

HTTP responses with non 2xx//3xx codes should count as errors IMO.


if err != nil {
r.logger.Warn("Could not deserialize response.", zap.Any("request", requestBody),
zap.String("upstreamID", configToRoute.ID), zap.String("response", fmt.Sprintf("%v", resp)), zap.Error(err))
return nil, nil, &OriginError{err}
return nil, httpResp, err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we see DecodeError from responses, we should return the response to the customer even though we can't deserialize it.

Passing the DecodeError back all the way back to the front so that we can do that.

As far as I'm aware, this isn't affecting anything right now. Just thought it was something nice to fix while I'm meddling around here.

@@ -155,7 +142,7 @@ func (r *SimpleRouter) Route(
strconv.FormatBool(cached),
).Inc()

if err != nil {
if err != nil || httpResponse.StatusCode > http.StatusBadRequest {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes an issue where the HTTP response codes aren't added to the node_gateway_router_upstream_rpc_request_errors.count metric
Uploading Screenshot 2023-12-18 at 6.29.57 PM.png…

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2023-12-18 at 6 29 57 PM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice find, do we want 400 as well? aka >= vs >?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@brianluong brianluong changed the title node gateway visbility improvements fix: node gateway visbility improvements Dec 19, 2023
@brianluong brianluong changed the title fix: node gateway visbility improvements fix: node gateway error handling improvements Dec 19, 2023
Copy link
Member

@kevin-satsuma kevin-satsuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good overall, just some small questions

@@ -94,7 +94,7 @@ var (
"Batches are deconstructed to single JSON RPC requests for this metric.",
},
// jsonrpc_method is "batch" for batch requests
[]string{"chain_name", "client", "upstream_id", "url", "jsonrpc_method", "response_code", "jsonrpc_error_code"},
[]string{"chain_name", "client", "upstream_id", "url", "jsonrpc_method", "jsonrpc_error_code"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed!

@@ -155,7 +142,7 @@ func (r *SimpleRouter) Route(
strconv.FormatBool(cached),
).Inc()

if err != nil {
if err != nil || httpResponse.StatusCode > http.StatusBadRequest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice find, do we want 400 as well? aka >= vs >?

Comment on lines +43 to +44
type HTTPResponse struct {
StatusCode int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, what was the reason behind using this type instead of http.Response?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry, i missed your comment : #124 (comment)

Copy link
Member Author

@brianluong brianluong Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition says that the response body, aka HTTP.Response.Body, should be read and discarded (body.close()) as soon as possible. Otherwise, the connection is held until body.close() is called.

And since http.Response.body is a field on http.Response, it seems to appropriate to apply the "read and discard ASAP" model to the entire object. If we need some data from the object, it's best to read it out into an user-defined structure.

Besides that, the body is a read-once object. How will other functions know if it has been read yet?

Comment on lines +62 to +73
// Still pass the response to client if we're not able to decode response from upstream.
case *jsonrpc.DecodeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder how this changes the behavior on the indexers. i'm assuming current decode errors get swallowed up into a generic 5XX or something, and the indexer will retry on the request.

once we add this change the indexer will potentially receive a poorly formatted JSON response. hopefully it gracefully handles this scenario and retries. then again this case is probably pretty rare.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not totally sure of the consequences atm, but you're right, it's probably rare.

kevin-satsuma
kevin-satsuma previously approved these changes Dec 19, 2023

var body *SingleRequestBody

func DecodeRequestBody(requestBodyRawBytes []byte) (RequestBody, error) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the io.ReadAll call out of this function because errors reading data out of the stream were being conflated with real decoding errors.

Example from logs: "error":"decode error: http2: server sent GOAWAY and closed the connection; LastStreamID=19999, ErrCode=NO_ERROR, debug=\"\",

This is an error from io.ReadAll. It's an error on the connection level, not with the contents of the response.

respBodyBytes, err := util.ReadAndCopyBackResponseBody(resp)
respBodyString := string(respBodyBytes)

if err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're handling connection-level errors like
"error":"decode error: http2: server sent GOAWAY and closed the connection; LastStreamID=19999, ErrCode=NO_ERROR, debug=\"\",

@brianluong brianluong merged commit b7f20aa into main Dec 19, 2023
3 checks passed
@brianluong brianluong deleted the misc_ng_improvements branch December 19, 2023 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants