HTTP message buffering and streaming #498

krizhanovsky · 2016-05-23T12:50:42Z

Probably good to be done with #1902

General requirements

Tempesta must support two modes of operation: HTTP messages buffering as now and streaming. In current mode of operation all HTTP messages are buffered, i.e. we deliver a proxied request or response only when we fully receive it. As opposite each received skb must be forwarded to client or server immediately.

Full buffering and streaming are two edge modes and intermediate mode must be supported as well - partial message buffering based on TCP receive buffer.

HTTP headers must always be buffered - we need the header to decide what we should do with the message and how to forward/cache/whatever to do with it.

Configuration

The behavior must be controlled by new configuration options. Since Linux doesn't support per-socket sysctls available for a user, we have to introduce our own memory limits for server and client sockets separately.

client_mem <soft_limit> <hard_limit> - controls haw many memory is used to store unanswered client requests and requests with linked responses which can not be forwarded to a client. If soft_limit is zero, then streaming mode is used, i.e. each received skb is immediately forwarded. Otherwise the message is buffered, but not more than for soft_bytes bytes. hard_limit is 2 * soft_limit by default, see description of in security operations section.
client_msg_buffering N - controls message buffering. Buffer only first N bytes of requests when forwarding requests to backend servers. Longer messages are forwarded part-by-part and never fully assembled in Tempesta. If request headers are longer than N bytes, then they are still buffered, since full set of headers is required to correctly serve and forward the request. The limit is applied per message and must not overcome per-connecton client_conn_bufferng limit.
server_msg_buffering N - same as client_msg_buffering but for the server connections.

Previous attempts to implement the issue use client_rmem very similar to tcp_rmem sysctl, however the current client_mem is very different because it accounts linked server responses.

TCP interactions

All ingress skbs are immediately evicted and ACKed from the TCP receive queue, so actually we don't use TCP receive buffer for now. With the new enhancement we must account all HTTP data kept in Tempesta memory as residing in the TCP receive buffer of the socket which we received the data on, so TCP will send lower TCP receive windows. See ss_rcv_space_adjust() and tfw_cli_rmem_{reserve,release}() in #1183.

client_mem <soft_limit> <hard_limit> determines TCP receive window, but it also account responses. A simplified example for client_mem 10 20:

0. initial rwnd=10
1. receive 5B request:  client_mem=5, announce rwnd=5
2. receive 10B response:  client_mem=15, announce rwnd=0
3. forward response & free request: announce rwnd=10
4. receive 3 pipelined reqs of 10B size in total: client_mem=10, rwnd=0
5. receive a resp of 5B for 2nd req: client_mem=15 (we have to keep the resp)
6. receive a resp of 5B for 3rd req: client_mem=20=hard_limit -> drop the connection

In proxy mode we have to slow down fetching data from the server TCP receive queue if we read a response for a slow client, which can't read it with the same speed. Otherwise we can overrun our RAM in many clients, just somewhat slower than the servers (but they not necessary be really slow!). This might lead to HoL problem when a response pipelined by the server after the problem response will stay in the queue for a very long time and a good and quick client will experience significant delays. To cope with the issue server_queue_size and larger conns_n should be used for the server group (please add this to Wiki!). Dynamically allocated server connections from #710 and server HTTP/2 #1125 are more robust solutions to the problem.

Following performance counters must be implemented for traceability of the feature (e.g. to debug the problem above):

current buffers size for client and server connections
TBD

HTTP streams

HTTP/2 (#309) and HTTP/2 (QUIC, #724) introduce flow control which can efficiently throttle clients, so it seems the TCP window adjustments make sense only for HTTP/1.1 and the issue highly depends on QUIC and HTTP/2. RFC 7540 5.2.2 begins right from the issue of this task - memory constraints and too fast clients which must be limited whereby WINDOW_UPDATE.

The security aspect of the issue is that clients can request quite large resources and announce very small windows (see RFC 7540 10.5) leading to memory exhaustion on our side (they can do the same with TCP & HTTP/1.1 for now).

At least following things must be done in the issue:

some streaming in context of HTTP QoS for asymmetric DDoS mitigation #488 : we should not keep in memory more data from the server response than a client announced in it's window, i.e. we should announce smaller TCP window for server connection. This point is good to do in generic way: we should handle the window from TCP layer, HTTP/2 and HTTP/3 in future. Actually, this is a tradeoff for buffering and streaming modes, which must be decided by an administrator. HTTP QoS for asymmetric DDoS mitigation #488 can determine malicius/slow clients and mitigate their impact thought.
honour the client announced window and do not send more data than it was specified
announce real HTTP/2 window according to the configured buffer size.

X-Accel-Buffering header processing must be implemented to let a client manage the buffering (e.g. Dropbox does this).

If we receive RST_STREAM frame in streaming mode, then we should reset our stream with the upstream as well and it store only the head of the transferred response in the cache.

Security operation

#995 makes an example how a client can exhaust memory by the very first blocking request and many pipelined requests with large responses. So client_mem must account the whole memory spent for a client. If client reaches the soft limit a zero receive window is sent. However, server responses for already processed requests may continue to arrive and if the hard_limit is reached, then the client connection must be dropped (we have no chance to send a normal error response in this case).

A malicious client may send byte by byte in streaming mode to overload a backend. This scenario must be addressed by the implementation, e.g. to configure minimum buffer size - only if an administrator allows 1 byte buffering or so, then only in this case pass through so small stream chunks. The other opportunity is DDoS QoS reduction in sense of automatic classification in #488.

Several implementation notes

These notes must be mirrored in the Wiki.

A streamed message consumes a server connection and we can not scheduler other requests to the connection, so using small buffers isn't desired.
Streamed requests can not be resent on server connection failures.

Tricky cases

From #1183 :

Streamed messages can be buffered. This happen, when the receiver is not ready to receive a stream. E.g. client requested two pipelined uris: first is very slow to complete, second is fast, but it can be a full BD image. It's not possible to stream the BD image until the first request is responsed. We can't put server connection on hold.

In general, our design must be as simple as possible. Say both the requests go to different connections. Both of them can be streamed or a first request may just require heavy operations on server side, but the 2nd request can be streamed immediately. As we receive server_msg_buffering of response data, we link the data with TfwHttpResp (just as previously - received skbs are linked to the structure), the response is marked as incomplete and stays in the client seq_queue. Yes, the server connection is getting on hold. If the server processing the first request is stuck, then failovering process takes place and both the requests will be freed and both the server connections must be reestablised. We also have #710 addressing the problem of hold connections.

Response can appear at any time.

We need to forward the responses immediately to a client, just mimicing the server. Probably we also should close connection. While the client request is not finished, TfwHttpReq should be sitting in server forward queue and we should forward new skbs of the request to the server connection and response skbs to the client connection. Existing TfwHttpReq and TfwHttpResp descriptos should be used for the skb buffered forwarding.

Target connections to stream a message can dissappear at any time.
Client disconnected when streamed message wasn't received in full

A connection dropping or skbs dropping can be good there. It's good to see how HAproxy, Nginx or Tengine behave in the same situations.

Relating issues

This implementation must use TCP receive buffer size to control how much data can be buffered (i.e. be on-the-fly between receive and send sockets). Meantime #488 adjusts TCP receive buffer size to change QoS. So this issue is foundation for #488.

Appropraite testing issue tempesta-tech/tempesta-test#87

See branch https://github.com/tempesta-tech/tempesta/tree/ik-streaming-failed and discussions in #1183

TEST

try test case from Deproxy errors on TCP segmentation tests tempesta-test#512

The text was updated successfully, but these errors were encountered:

vankoven · 2018-03-26T07:50:35Z

I have a bunch of questions on the task:

What is the right behaviour for streaming: can we unconditionally buffer message headers? Some headers affects how TempestaFW processes the message. E.g., cache module and cache control headers.
I assume that only messages that fit buffer size can be cached.
Long polling is another question that bothers me. Imagine we have 32k connections to backend servers, and have 32k clients that use long polling In this case server connections are depleted and we can't serve new client connections.

krizhanovsky · 2018-03-26T08:30:26Z

What is the right behaviour for streaming: can we unconditionally buffer message headers? Some headers affects how TempestaFW processes the message. E.g., cache module and cache control headers.

Basically, yes, we can do some caching unconditionally. To protect memory exhausting attacks we provide Frang limits. Let's discuss the details in chat.

I assume that only messages that fit buffer size can be cached.

Not at all. The process is orthogonal to caching. For the cache we should behavior the same way as now for skb chunks of data.

Long polling is another question that bothers me. Imagine we have 32k connections to backend servers, and have 32k clients that use long polling In this case server connections are depleted and we can't serve new client connections.

Yes, good point. This is the subject for #710 . I'll add the note to the issue.

UPD. The important note from the chat discussion is that proxy_buffering should be implemented on top of TCP socket receive buffer since we can and we should use tight integration with TCP/IP stack. Also consider related points in #391 and #488.

b3b · 2022-10-11T12:28:18Z

Visualization of the difference in packet processing between Tempesta and Nginx.
Backend -> Proxy packets are in red, Proxy -> Client are in green.

Tempesta

Nginx default settings (buffering is on)

Nginx, buffering is off

b3b · 2022-10-11T12:30:37Z

Nginx log wnen backend closes the connection in the middle of a transfer:

2022/10/10 16:15:12 [error] 387031#387031: *1 upstream prematurely closed connection while reading upstream,
client: 127.0.0.1, server: tempesta-tech.com, request: "GET /1 HTTP/1.1",
upstream: "http://127.0.0.1:8000/1", host: "127.0.0.1"

Client (curl) reaction:

> Host: 127.0.0.1
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: nginx/1.18.0 (Ubuntu)
< Date: Mon, 10 Oct 2022 16:15:10 GMT
< Content-Type: text/html
< Content-Length: 1073741824
< Connection: keep-alive
< Last-Modified: Mon, 10 Oct 2022 16:15:00 GMT
< ETag: "63444504-40000000"
< X-Upstream-Id: 1
< Accept-Ranges: bytes
< 
{ [12016 bytes data]
 41 1024M   41  424M    0     0   446M      0  0:00:02 --:--:--  0:00:02  445M* transfer closed with 138765652 bytes remaining to read
 87 1024M   87  891M    0     0   480M      0  0:00:02  0:00:01  0:00:01  480M
* Closing connection 0
curl: (18) transfer closed with 138765652 bytes remaining to read

kingluo · 2024-04-26T12:21:14Z

It's better to refer to Nginx implementation, which is very mature in production. Simply put, the only difference is that we have to implement buffering at the sk/skb level.

References:
https://www.getpagespeed.com/server-setup/nginx/tuning-proxy_buffer_size-in-nginx
http://luajit.io/posts/openresty-lua-request-time/

kingluo · 2024-05-03T07:22:20Z

Let me give more clarification about the nginx proxy module.

The buffers are shared by the upstream and the downstream, i.e. when the buffer is written to the downstream successfully, it will be used to receive data from the upstream in turn.
As I said in Correct handling on trailer headers in h1 -> h2 proxying #1902 (comment), the cache reserves the original format of the response, i.e. Content-Length: 100 or Transfer-Encoding: chunked, it saves the whole response, so if the response is not fully received, the blocks are saved in buffers or temporary files (slice module is an exception of course). When the response comes from the cache next time, it just replays the response to the downstream as is.
Temepesta saves the whole response in the memory, which is in OOM risk. In comparison, Nginx will only receive bytes with a maximum size of proxy_buffer_size + proxy_buffers (saved in memory) + proxy_max_temp_file_size (saved in file).

Directive	Description
`proxy_buffer_size`	the size received one time.
`proxy_buffers`	the total size could be used to receive from the upstream or to the downstream, besides `proxy_buffer_size`.
`proxy_busy_buffers_size`	the size to balance when sending response blocks to the downstream.

Let me explain some critical steps in the diagram above.

If proxy_buffering is on.

In step 20, it tries to read the response into the buffers specified by proxy_buffer_size and proxy_buffers, or, if the buffers are full, write the response to temporary files, or, if the files are full (determined by proxy_max_temp_file_size), stop reading the upstream.
In step 23, if the downstream is writeable, then write response blocks with size of proxy_busy_buffers_size to the downstream. If successful and if the upstream is blocked because of no free buffers, it turns to read the upstream.

If proxy_buffering is off, then check steps 25 and 27, these steps will forward the response bytes with a maximum size of proxy_buffer_size. Note that in this mode, the cache is disabled.

krizhanovsky · 2024-06-25T12:03:12Z

A relevant discussion https://github.com/tempesta-tech/linux-5.10.35-tfw/pull/17/files#r1637865281 and the subject of the discussion #2108 , which I think should be fixed in this issue: we need to push for transmission not more (or not significantly more) data to a client connection and h2 stream than the connection or stream allows. Since we proxy data, we should backpresure upstream connections and limit the most aggressive clients with #488.

krizhanovsky · 2024-07-18T21:12:13Z

sysctl_tcp_auto_rbuf_rtt_thresh_us automatically adjust TCP rcv buffer per socket depending on the current conditions. Probably mostly affect #488.

krizhanovsky added the enhancement label May 23, 2016

krizhanovsky added this to the 0.6 OS milestone May 23, 2016

krizhanovsky mentioned this issue May 23, 2016

[RFC7233] Range Requests & partial responses #499

Open

krizhanovsky mentioned this issue Nov 17, 2016

Web-server: send large resources in chunked mode #534

Closed

krizhanovsky modified the milestones: 0.5.0 Web Server, 0.6 OS Nov 17, 2016

krizhanovsky modified the milestones: 0.6 WebOS, 0.5.0 Web Server Feb 12, 2017

krizhanovsky mentioned this issue Aug 23, 2017

Chunked Transfer Coding - chunk-size decoding issue #768

Closed

krizhanovsky mentioned this issue Nov 9, 2017

Test for Comet (server push) #854

Closed

krizhanovsky added the crucial label Jan 9, 2018

krizhanovsky modified the milestones: backlog, 0.6 KTLS Jan 9, 2018

krizhanovsky mentioned this issue Feb 4, 2018

HTTP normalization #2

Open

krizhanovsky assigned vankoven Feb 18, 2018

krizhanovsky mentioned this issue Feb 18, 2018

Missing Content-Lenght header: Tempesta does not indicate message end for client #639

Closed

krizhanovsky modified the milestones: 0.6 KTLS, 0.7 HTTP/2 Mar 22, 2018

krizhanovsky mentioned this issue Mar 22, 2018

POST body processing #902

Open

krizhanovsky mentioned this issue Mar 26, 2018

Dynamically allocate and shrink server connections #710

Open

vankoven mentioned this issue May 7, 2018

Add message streaming mode #1012

Closed

This was referenced Jun 20, 2018

HTTP QoS for asymmetric DDoS mitigation #488

Open

Limit pipelined requests from client to avoid memory exhaustion #995

Closed

Redesign of TCP synchronous sending and data caching #391

Open

krizhanovsky modified the milestones: 0.6 Tempesta TLS, 0.7 TempestaTLS v0.3 & HTTP/2 Feb 11, 2019

krizhanovsky mentioned this issue Jun 23, 2022

Flow contol and pull-model for message forwarding #1394

Closed

7 tasks

krizhanovsky modified the milestones: 1.2 TBD, 0.8 - Beta: TDBv0.2 & web cache eviction Oct 11, 2022

krizhanovsky added the performance label Oct 11, 2022

This was referenced Oct 11, 2022

Large responses support #1714

Open

Slow read DoS prevention #1715

Open

This was referenced Oct 12, 2022

Optimizer for HTTP messages adjustment #1103

Open

Remove chunked encoding in http/1 responses #1418

Merged

krizhanovsky added the bug label Nov 17, 2022

krizhanovsky modified the milestones: 1.x: TBD, 1.0 - GA Jul 5, 2023

krizhanovsky mentioned this issue Oct 11, 2023

Correct handling on trailer headers in h1 -> h2 proxying #1902

Open

krizhanovsky removed their assignment Oct 11, 2023

krizhanovsky modified the milestones: 1.0 - GA, 0.9 - LA Oct 31, 2023

krizhanovsky assigned EvgeniiMekhanik Nov 7, 2023

const-t mentioned this issue Jan 18, 2024

Tests for server_forward_timeout/retries tempesta-tech/tempesta-test#566

Open

RomanBelozerov mentioned this issue Mar 5, 2024

Deproxy errors on TCP segmentation tests tempesta-tech/tempesta-test#512

Closed

kingluo mentioned this issue Apr 17, 2024

HTTTP/2 (D)DoS prevention #1346

Open

2 tasks

kingluo mentioned this issue Apr 26, 2024

test OOM by slow read attack tempesta-tech/tempesta-test#618

Open

krizhanovsky mentioned this issue May 1, 2024

fix(1902): correct the skb len #2095

Merged

kingluo mentioned this issue Jun 24, 2024

Tls errors under ping flood #2117

Closed

This was referenced Jun 25, 2024

Mekhanik evgenii/1196 tempesta-tech/linux-5.10.35-tfw#17

Merged

fix(1346): fix control frames flood #2108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTP message buffering and streaming #498

HTTP message buffering and streaming #498

krizhanovsky commented May 23, 2016 •

edited

Loading

vankoven commented Mar 26, 2018

krizhanovsky commented Mar 26, 2018 •

edited

Loading

b3b commented Oct 11, 2022

b3b commented Oct 11, 2022

kingluo commented Apr 26, 2024 •

edited

Loading

kingluo commented May 3, 2024 •

edited

Loading

krizhanovsky commented Jun 25, 2024

krizhanovsky commented Jul 18, 2024

HTTP message buffering and streaming #498

HTTP message buffering and streaming #498

Comments

krizhanovsky commented May 23, 2016 • edited Loading

General requirements

Configuration

TCP interactions

HTTP streams

Security operation

Several implementation notes

Tricky cases

Relating issues

TEST

vankoven commented Mar 26, 2018

krizhanovsky commented Mar 26, 2018 • edited Loading

b3b commented Oct 11, 2022

Tempesta

Nginx default settings (buffering is on)

Nginx, buffering is off

b3b commented Oct 11, 2022

kingluo commented Apr 26, 2024 • edited Loading

kingluo commented May 3, 2024 • edited Loading

krizhanovsky commented Jun 25, 2024

krizhanovsky commented Jul 18, 2024

krizhanovsky commented May 23, 2016 •

edited

Loading

krizhanovsky commented Mar 26, 2018 •

edited

Loading

kingluo commented Apr 26, 2024 •

edited

Loading

kingluo commented May 3, 2024 •

edited

Loading