Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitize Content-Type for multipart/form-data requests #1139

Merged
merged 1 commit into from
Jan 17, 2019

Conversation

i-rinat
Copy link
Contributor

@i-rinat i-rinat commented Dec 18, 2018

Some application servers parse Content-Type header field in a non-standard way, which is used to bypass web application firewalls. For example, PHP checks whenever parameter name contains substring boundary, and that way matches xxboundaryxx and the like.

To make life easier for backend servers, this patch makes Tempesta to validate Content-Type header field format by parsing it in a strict way. More over it makes possible to recompose its value from parsed data, with any other unrelated parameter dropped. Behavior is controlled by http_post_validate configuration parameter; valid in vhost and location context.

(partially addresses #902)

TFW_STR_FROM("Content-Type"),
TFW_STR_FROM(": "),
TFW_STR_FROM("multipart/form-data; boundary="),
req->multipart_boundary_raw,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

req->multipart_boundary_raw can't be compound string according to the usage. This should be asserted in unit tests, but i didn't found such check. But as I see usage of the variable in the http_parser, it actually is compound.

Copy link
Contributor Author

@i-rinat i-rinat Dec 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, req->multipart_boundary_raw can be a compound string, and I totally forgot about that. Fixed by explicitly handling both cases of compound and plain types of req->multipart_boundary_raw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved back to the code version that expected req->multipart_boundary_raw to be a single-chunk string.

Since this data is then used to recreate Content-Type field of a request header, there is a huge possibility that buffer containing boundary string will overlap with the buffer it is copied into. Handling all the special cases looks infeasible at the moment, and the request parser was changed to make a copy of req->multipart_boundary_raw. So it's always linear at this point.

@i-rinat
Copy link
Contributor Author

i-rinat commented Dec 20, 2018

There was a NULL-dereference due to req->location being NULL. Fixed that by checking for NULL.

Copy link
Contributor

@vankoven vankoven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same work in __req_parse_content_type() is done twice, this is not acceptable for fast parser. Also tfw_http_recreate_content_type_multipart_hdr() seems to be dangerous. I would like to ask you to provide a simple functional test just to test that the header is recreated successfully.

if (req->method == TFW_HTTP_METH_POST
&& (req->location->validate_post_req
|| (req->vhost->loc_dflt
&& req->vhost->loc_dflt->validate_post_req))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This location search is incomplete, need to search in the following order:

  • current location of requests: req->location
  • default location of current vhost: req->vhost->loc_dflt
  • default vhost global configuration: req->vhost->vhost_dflt
    There is a task to simplify the search and place all effective configuration into req->location (a part of Unified location processing #862) but it's not implemented yet.

Such search can look complicated but it because of inheriting of configuration directives from global objects to private sections. E.g. validate_post_req option may be defined globally, so all vhosts will use the settings. Take a look at tfw_http_req_mark_nip() function it's a good example how to search for the effective settings beneath locations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved selection code to tfw_http_should_validate_post_req() to make it readable. It also now checks all three possible places as proposed.

@@ -2155,6 +2176,14 @@ static TfwCfgSpec tfw_vhost_internal_specs[] = {
.allow_repeat = true,
.allow_reconfig = true,
},
{
.name = "http_post_validate",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option is allowed to be listed inside vhost directive, inside location directive, but not allowed to setup globally (not listed in tfw_vhost_specs[]). Why? Why it's not allowed to be configured globally for all vhosts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot about global configuration at first. Added the option to tfw_vhost_specs[] too. Thanks.

__FSM_STATE(I_ContTypeParamOWS) {
switch (c) {
case ' ':
case '\t':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a macroses for this: IS_WS(c), IS_CRLF(c). I see, why you don't use them, swich-case looks to be a better option to chose execution branch according to the current character, but such approach is not used in the http_parser.c at all. Instead conditions IS_WS(c), IS_CRLF(c) are used one-by-one in order of expectancy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved back to IS_WS(), IS_CRLF() macros across the patch.

__FSM_I_MOVE(I_ContTypeParamOWS);
case '\r':
case '\n':
__FSM_I_MOVE_n(I_ContTypeParamOWS, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why jump to I_ContTypeParamOWS and not to I_EoL directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lost the exact patch version that this comment was written at. But I believe it is fixed, as I removed at least one such excessive state change. And also none of IS_CRLF() occurrences move to another state in the current patch version. Instead they rush to completion in one way or another.

__FSM_I_MOVE(I_ContTypeMultipart);
case ' ':
case '\t':
__FSM_I_MOVE(I_ContTypeMultipartOWS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, as I understood, I_ContTypeMultipartOWS exists only because TRY_STR_LAMBDA tries only start of the token. With #1128 implemented, probably the FSM would become cleaner here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly why this state is needed. However, I think it will be needed anyway, as a fallback for the case there received buffer ends exactly at /form-data and we have no data beyond that at the moment of parsing. There are similar cases in the main parser where field names are matched: fast and slow paths.

}

__FSM_STATE(I_ContTypeMultipart) {
req->flags |= TFW_HTTP_F_CT_MULTIPART;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to set the flag in I_ContTypeMaybeMultipart and I_ContTypeMultipartOWS and avoid creation of a new state just for setting the flag. But it's a matter of taste.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This state was eliminated. Required code was moved upwards to the previous states.


TRY_STR_LAMBDA("boundary=", {
if ((req->flags & TFW_HTTP_F_CT_MULTIPART_HAS_BOUNDARY)
&& (req->flags & TFW_HTTP_F_CT_MULTIPART))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Condition req->flags & TFW_HTTP_F_CT_MULTIPART seems to be redundant there. TFW_HTTP_F_CT_MULTIPART_HAS_BOUNDARY flag can't be set without it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's redundant. Removed the check. Thanks.


__FSM_STATE(I_ContTypeParamOther) {
__fsm_n = __data_remain(p);
__fsm_sz = tfw_match_token(p, __fsm_n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not __FSM_MATCH_MOVE(token, I_ContTypeParamOther);?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to __FSM_I_MATCH_MOVE_fixup(), as it was the most appropriate here. I believe there was something that I wanted to add, but it turned out to unnecessary.

However, in couple of other places I needed to "unpack" __FSM_I_MATCH_MOVE_fixup() to add statements inside. Existing macros were not enough.

replacement.len = c[0].len + c[1].len + c[2].len +
req->multipart_boundary_raw.len;

r = tfw_http_msg_hdr_xfrm_str((TfwHttpMsg *)req, &replacement,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that it's a safe operation. If original header is longer and it mostly so, we cut off skb segment to comply new data length. Copy operation must be failed, if I'm not mistaken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now I see that this is not a safe operation.
I've changed the code to make a copy of data of req->multipart_boundary_raw, so it survives. Now the string data and the buffer it is copied into are guaranteed to be separate, so there should be no issues with memcpy'ing intersecting memory regions.

}

static int
__req_parse_content_type(TfwHttpMsg *hm, unsigned char *data, size_t len)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole function doesn't follow approach used all across http_parser. As a result of this function we got the following:

  • req->h_tbl[TFW_HTTP_HDR_CONTENT_TYPE] -> TfwStr with no special limiting by chunks
  • newly allocated req->multipart_boundary_raw which points to boundary value with its own splitting by chunks
  • newly allocated req->multipart_boundary which points to parts of boundary value with its own splitting by chunks
  • copy operation from req->multipart_boundary_raw to req->multipart_boundary which is done char-by-char. And actually the same work is done twice: first we scan for escapes in fsm, and the second time in unescape_multipart_boundary().

To follow the approach used in the parser module, the function should:

  1. Call __TFW_HTTP_PARSE_SPECHDR_VAL(......, 0) macro. This macro doesn't fill parsed header, instead parsed header is filled manually chunck-by-chunk, each chunk can have very specific flags. In this case there is no need to allocate additional space for req->multipart_boundary_raw or make full symbols scans in unescape_multipart_boundary(). See __req_parse_cookie() as an example.

  2. Parse the header until start of boundary value. Save all parsed chunks into parser->hdr (see __FSM_I_MOVE_fixup() macro)

  3. Next chunk is start of boundary value, set req->multipart_boundary_raw->ptr to TFW_STR_CURR(parser->hdr)

  4. If the currently parsed symbol should appear in unescaped req->multipart_boundary, add it to current chunk and set TFW_STR_VALUE flag for it.

  5. If current character is escape character and should not appear in req->multipart_boundary, put it in it's own chunk without TFW_STR_VALUE flag.

  6. Loop 4-5 until boundary value end.

  7. Once boundary value ended, collect req->multipart_boundary and req->multipart_boundary_raw. Number of chunks for req->multipart_boundary_raw can be easily calculated: n = TFW_STR_CURR(parser->hdr) -req->multipart_boundary_raw->ptr. And then run through req->multipart_boundary_raw->ptrand copy TfwStr headers to req->multipart_boundaryifTFW_STR_VALUE` flag set. This copy operation is faster, since it scans by chanks, not by symbols.

  8. Save the rest of data into parser->hdr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed instructions!
I've changed the patch to follow them, but rather than using a pointer from TFW_STR_CURR(parser->hdr), used a current chunk number. It turned out, that during parser->hdr growth, parser->hdr.ptr is reallocated multiple times. At some point, allocator may change parser->hdr.ptr.

@vankoven
Copy link
Contributor

As I understood, unescaped req->multipart_boundary will be used to validate/parse multipart payload. We're going to add feature to proxy messages without assembling. This means that first headers are forwarded and only then payload is validated using req->multipart_boundary . As soon as Content-Type header is overwritten before forwarding, req->multipart_boundary may contain garbage data. We have to make a deep copy of boundary value into req->multipart_boundary.

@i-rinat i-rinat force-pushed the ri-post-hdr-sanitize branch 2 times, most recently from 39c466a to 5de5ddd Compare January 14, 2019 13:11
@i-rinat
Copy link
Contributor Author

i-rinat commented Jan 14, 2019

We have to make a deep copy of boundary value into req->multipart_boundary.

The patch now makes deep copies of both req->multipart_boundary_raw and req->multipart_boundary for both reason you mentioned and ability to call tfw_http_msg_hdr_xfrm_str() without being afraid of intersecting buffers in memcpy.


__FSM_STATE(I_ContTypeMaybeMultipart) {
if (c == ';') {
set_bit(TFW_HTTP_F_CT_MULTIPART, req->flags);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect usage of set_bit() function. First parameter must be bit number, not the flag value. Correct expression is to be set_bit(TFW_HTTP_B_CT_MULTIPART, req->flags) and TFW_HTTP_F_CT_MULTIPART macro is not needed at all. You can't see errors, because the issue is happening at every test_bit()/set_bit() operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed both TFW_HTTP_F_CT_MULTIPART and TFW_HTTP_F_CT_MULTIPART_HAS_BOUNDARY, and changed to TFW_HTTP_B_*.

@i-rinat i-rinat force-pushed the ri-post-hdr-sanitize branch 2 times, most recently from 15f6b41 to add646e Compare January 15, 2019 21:38
Copy link
Contributor

@vankoven vankoven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but a small cleanup is required.

@@ -2114,6 +2114,45 @@ tfw_http_set_loc_hdrs(TfwHttpMsg *hm, TfwHttpReq *req)
return 0;
}

static int
tfw_http_recreate_content_type_multipart_hdr(TfwHttpReq *req)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR has a good comment, why the header is recreated from parsed value. Let's keep it with the function, since the backend compatibility issue is not a trivial case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to the function. Text is mostly from the pull request descriptions, with references to the pull request removed.

@@ -2378,6 +2414,14 @@ static TfwCfgSpec tfw_vhost_specs[] = {
.allow_repeat = true,
.allow_reconfig = true,
},
{
.name = "http_post_validate",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add the new directive to tempesta_fw.conf, note, that it should be also listed inside vhost and location directives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

__FSM_STATE(I_ContType) {
if (req->method != TFW_HTTP_METH_POST)
__FSM_I_JMP(I_EoL);
__FSM_I_JMP(I_ContTypeMediaType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question, do we need to strictly analyse the header if directive http_post_validate is not set? Or wee can simply skip it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, we are selecting vhost and location after all request is parsed completely. So when we get here, there is no enough information to decide yet.
There may be enough information, if we would've seen Host field when we get here. We still need to move vhost and location selection earlier to make this possible though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, forgot about per-vhost directive scope.

__FSM_I_MOVE_fixup(I_ContTypeParamOWS, 1, 0);
}
if (IS_WS(c))
__FSM_I_MOVE_fixup(I_ContTypeMultipartOWS, 1, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to stay in the same state and remove extra I_ContTypeMultipartOWS state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not possible.

There are minor differences in how I_ContTypeMaybeMultipart and I_ContTypeMultipartOWS behave when none of the checks are true. In other words, for anything not ;, whitespace, CR, or LF. If we merge these two states, we must either fail (CSTR_NEQ) or move to I_ContTypeOtherSubtype as a default action.
In the first case (CSTR_NEQ) we'll fail to process Content-Type: multipart/form-data2 which is correct according to the description from the RFC. In the latter case we'll accept Content-Type: multipart/form-data aaa; boundary=123 which is not correct according to the description from the RFC.

* Requests with multipart/form-data payload should have
* only one boundary parameter.
*/
if (test_bit(TFW_HTTP_B_CT_MULTIPART_HAS_BOUNDARY,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One single function can be used here:

if (__test_and_set_bit(TFW_HTTP_B_CT_MULTIPART_HAS_BOUNDARY, req->flags))
    return CSTR_NEQ;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use __test_and_set_bit.

if (test_bit(TFW_HTTP_B_CT_MULTIPART_HAS_BOUNDARY,
req->flags))
return CSTR_NEQ;
set_bit(TFW_HTTP_B_CT_MULTIPART_HAS_BOUNDARY,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request is currently parsed and no multiple access may happen, atomic operation shoud be replaced with non-atomic variant __set_bit().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced to __set_bit() everywhere in the patch.

* Content-Type field composing. So to prevent memcpy'ing
* intersecting buffers, we have to make a separate copy.
*/
__strdup_multipart_boundaries(req);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function can return non-NULL exit code, which is ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check for the return value.

}
}

BUG_ON(ptr_raw != data_raw + req->multipart_boundary_raw.len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use TFW_WARN here and return error code to the caller. If __req_parse_content_type() has a bug, the currently parsed request will be dropped, bu the server will remain alive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, changed to use TFW_WARN. I haven't added any information about lengths, as it seem to be insufficient for debugging anyway.

# 'nonidempotent' or 'hdr_add' directives (see the corresponding directives'
# description).
# 'nonidempotent', 'hdr_add' or 'http_post_validate' directives (see the
# corresponding directives' description).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra asterisk after directives

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that was a "possessive 's". Since "directives" is plural, "'s" reduces to a sole apostrophe.

@i-rinat i-rinat merged commit bdf5843 into master Jan 17, 2019
@i-rinat i-rinat deleted the ri-post-hdr-sanitize branch January 17, 2019 14:01
return true;

if (!req->vhost)
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of this function (as well as the whole tfw_http_adjust_req() procedure) implies that req->vhost should be already known at the moment the function is used. If req->vhost == NULL it means that the function is used in wrong place. So, it should be better to use

if (WARN_ON_ONCE(!req->vhost))
	return false;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, if req->vhost is NULL, this function is not called at all. Added WARN_ON_ONCE in #1154.

@@ -229,6 +229,10 @@ enum {
TFW_HTTP_B_CONN_EXTRA,
/* Chunked transfer encoding. */
TFW_HTTP_B_CHUNKED,
/* Media type is multipart/form-data. */
TFW_HTTP_B_CT_MULTIPART,
/* Multipart/form-data request have boundary parameter. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have -> has

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #1154. Thanks!

.deflt = NULL,
.handler = tfw_cfgop_out_http_post_validate,
.allow_none = true,
.allow_repeat = true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why allow_repeat = true here, while in vhost and location specs (above) - repeat is not allowed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a copy-paste error. There is no sense in allowing multiple instances of the statement, so the flag was changed to false in #1154.

i-rinat added a commit that referenced this pull request Jan 18, 2019
* req->vhost is expected to be non-NULL, as it was checked before;
* disallow multiple instances of http_post_validate at topmost level;
* a typo.
i-rinat added a commit that referenced this pull request Jan 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants