v2.6.4 php-fpm gets stuck in read state #5420

Jimbolino · 2023-03-06T17:20:47Z

Since we upgraded from v2.6.3 to v2.6.4, we are experiencing some problems with our php-fpm backend

The problem is that the php-fpm process gets stuck in a "read" state (and not go back to "accept" state).
Eventually php's pm.max_children is reached and the entire server will be killed by ec2's health checker.

I've done some debugging with strace, but have not found a pattern, because the servers gets lots of traffic, but the problem only occurs every once in a while. (maybe once in a 1000 requests)

root@frontend:~# strace -p 39614
strace: Process 39614 attached
read(5, 

root@frontend:~# nsenter -t 39614 --net lsof -n -p 39614 | grep 5u
php-fpm 39614 www-data    5u  IPv6 287757      0t0     TCP 172.18.0.3:9000->172.18.0.4:44492 (ESTABLISHED)

172.18.0.3 = app_php_1.app_default
172.18.0.4 = app_caddy_1.app_default

Im afraid that #5367 might need to be revisited.
I've reverted back to v2.6.3 and the problem is gone.

The text was updated successfully, but these errors were encountered:

mholt · 2023-03-06T17:21:52Z

Thanks for the report. What is your config?

Jimbolino · 2023-03-06T17:26:52Z

Caddyfile:

{
    debug
    skip_install_trust
    auto_https disable_redirects
    on_demand_tls {
        ask {$CADDY_ON_DEMAND_ASK_URL}
    }
    admin caddy:8080 {
        origins {$CADDY_ADMIN_ALLOWED_ORIGINS}
    }
    storage redis {
        host {$CADDY_CLUSTERING_REDIS_HOST}
        port {$CADDY_CLUSTERING_REDIS_PORT}
        db {$CADDY_CLUSTERING_REDIS_DB}
        tls_enabled {$CADDY_CLUSTERING_REDIS_TLS}
    }
}

(global) {
    header server maglr
}

(frontend) {
    root * /app/docs

    route {
        # If the requested file does not exist, try index files
        @indexFiles file {
            try_files {path} {path}/index.php index.php
            split_path .php
        }
        rewrite @indexFiles {http.matchers.file.relative}

        # Proxy PHP files to the FastCGI responder
        @phpFiles path *.php
        reverse_proxy @phpFiles php:9000 {
            transport fastcgi {
                split .php
            }
        }
    }

    encode zstd gzip
    file_server {
        hide .htaccess
    }
    header -x-powered-by
}

(security) {
    header referrer-policy strict-origin-when-cross-origin
    header strict-transport-security max-age=31536000
    header x-content-type-options nosniff
}

*.maglr.localhost, *.localhost, https://127.0.0.1 {
    tls internal
    import frontend
    import global
}

https:// {
    tls {
        on_demand
    }

    import security
    import frontend
    import global
}


http:// {
    handle /health {
        respond "OK" 200
    }
    @notHealth {
        not path /health.php
    }
    handle {
        redir @notHealth https://{host}{uri}
    }
    import frontend
    import global
}

francislavoie · 2023-03-06T18:21:19Z

That's odd. Was it broken on v2.6.2 as well?

Would you be able to set up a minimally reproducible example? i.e. a simple PHP script that exhibits the issue, plus a Caddyfile that only has the bits that are relevant to the breakage.

Jimbolino · 2023-03-07T09:57:21Z

nope, v2.6.2 ran on production just fine

Im having trouble reproducing it locally (or on our staging environment)

Maybe it's got something to do with the AWS Network Load Balancer, or maybe bad bots that send malformed headers (wrong Content-Length etc)

francislavoie · 2023-03-07T10:05:37Z

v2.6.4 should be functionally equivalent to v2.6.2 in terms of the proxy buffering functionality. I don't understand what's going on here.

mholt · 2023-03-07T18:36:26Z

Can you try enabling access logs and maybe see if we can pin down which requests exactly are hanging?

Jimbolino · 2023-03-08T12:36:43Z

We have cloudwatch enabled for our access logs
I think the default timeout is 3 seconds, because we see the following entries when searching for { $.duration > 1 }

{
    "level": "error",
    "ts": 1678042875.2110817,
    "logger": "http.log.error",
    "msg": "dialing backend: dial tcp 172.18.0.3:9000: i/o timeout",
    "request": {
        "remote_ip": "172.31.10.238",
        "remote_port": "26109",
        "proto": "HTTP/1.1",
        "method": "GET",
        "host": "172.31.10.238:443",
        "uri": "/health.php",
        "headers": {
            "User-Agent": [
                "ELB-HealthChecker/2.0"
            ],
            "Connection": [
                "close"
            ],
            "Accept": [
                "*/*"
            ],
            "Accept-Encoding": [
                "*"
            ]
        }
    },
    "duration": 3.001065249,
    "status": 502,
    "err_id": "qd6cryvhi",
    "err_trace": "reverseproxy.statusError (reverseproxy.go:1299)"
}

From this ts (2023-03-05T20:01:15.211+01:00) I went back in the logs to see what was happening.
The previous 5-6 mins are only calls to health.php, and before that there is some normal traffic.

trea · 2023-03-09T20:14:43Z

Is your health.php making any external calls that might not be returned to PHP before Caddy gives up on the request?

I worked on a project where we had reverse proxying to PHP on Docker and couldn't figure out why we would repeatedly get 5xx errors and no running PHP.

It turned out that we were making requests with Curl to a flaky service and someone had inadvertently set CURLOPT_TIMEOUT to 0. Any request to that flaky service that wasn't handled before the reverse proxy gave up and returned to the client left open a PHP worker sitting around waiting for the flaky service to respond until there weren't any workers left.

Jimbolino · 2023-03-09T20:24:43Z

nope:

<?php
const MAX_LOAD = 25;
$load = sys_getloadavg();
if ($load[0] > MAX_LOAD) {
    http_response_code(503);
    exit('503');
}
if (!isset($_SERVER['APP_ENV'])) {
    $root = __DIR__.'/..';
    if (!file_exists($root.'/.env')) {
        http_response_code(500);
        exit('unknown environment');
    }
    require_once $root.'/vendor/autoload.php';
    Dotenv\Dotenv::createImmutable($root)->load();
}
exit('ok');

and if i understand strace correctly, it's php who is trying to read on the incoming connection from caddy that is stuck
so maybe caddy did not close the connection or send a EOF or something?

trea · 2023-03-10T19:39:04Z

Thanks, the health.php file you provided looks fine.

I stand by my theory though that you have something else that's causing PHP to not return quickly enough for your call to /health.php. I did create a reproduction using Docker Compose and put it in this repo: https://github.com/trea/php-fpm-issue

If you follow the README (and you may have to run the healthcheck a few times), you end up with requests that take too long. The healthcheck fails and then autoheal restarts it and Caddy can't connect to PHP anymore until it's finished restarting:

"error": "dialing backend: dial tcp 192.168.48.6:9000: connect: connection refused"

Log Output

php-fpm-exit-test-caddy-1     | {
  "level": "debug",
  "ts": 1678475769.7431428,
  "logger": "http.handlers.reverse_proxy",
  "msg": "upstream roundtrip",
  "upstream": "php:9000",
  "duration": 0.000587583,
  "request": {
    "remote_ip": "192.168.48.4",
    "remote_port": "54200",
    "proto": "HTTP/1.1",
    "method": "GET",
    "host": "caddy",
    "uri": "/slow.php",
    "headers": {
      "X-Forwarded-Proto": [
        "http"
      ],
      "X-Forwarded-Host": [
        "caddy"
      ],
      "User-Agent": [
        "fortio.org/fortio-1.53.1"
      ],
      "X-Forwarded-For": [
        "192.168.48.4"
      ]
    }
  },
  "error": "dialing backend: dial tcp 192.168.48.6:9000: connect: connection refused"
}

I would recommend setting up a slowlog for FPM and ensuring that the ptrace capability is available, that way you should be able to get a better idea of what is causing FPM to hang.

In this test case the output points to curl_exec:

php-fpm-exit-test-php-1       | [10-Mar-2023 19:33:48] WARNING: [pool www] child 8, script '/var/www/html/slow.php' (request: "GET /slow.php") executing too slow (2.210546 sec), logging
php-fpm-exit-test-php-1       | [10-Mar-2023 19:33:48] WARNING: [pool www] child 7, script '/var/www/html/slow.php' (request: "GET /slow.php") executing too slow (2.220195 sec), logging
php-fpm-exit-test-php-1       | [10-Mar-2023 19:33:48] NOTICE: child 8 stopped for tracing
php-fpm-exit-test-php-1       | [10-Mar-2023 19:33:48] NOTICE: about to trace 8
php-fpm-exit-test-php-1       |
php-fpm-exit-test-php-1       | [10-Mar-2023 19:33:48]  [pool www] pid 8
php-fpm-exit-test-php-1       | script_filename = /var/www/html/slow.php
php-fpm-exit-test-php-1       | [0x0000ffff9f2130c0] curl_exec() /var/www/html/slow.php:6

Jimbolino · 2023-03-13T09:36:46Z

There are indeed some calls that incidentally take too long.
It still doesnt explain why previous versions of caddy dont cause the same issue.

Maybe #5431 is related

trea · 2023-03-13T22:23:28Z

Actually, after looking into this more, I think you were right when you suggested #5367 was related, and I do think #5431 is related as well. I think they specifically go back to this change:

https://github.com/caddyserver/caddy/pull/5367/files#diff-a248c9a1ec018edea8b377d155bc1df1a642bf79d00ababb5cdacc6b86c5733dL627

The FastCGI specification states that Responders read up to the total of CONTENT_LENGTH, but as a result of 5367, that is now empty for requests with Transfer-Encoding: chunked.

I have updated my repo to include both versions of Caddy and two more services in the Docker Compose file chunked and chunked264. When I run docker-compose chunked264 PHP actually dies and is respawned immediately. You can call docker-compose run chunked all you want and it works fine.

I understand the reverting of the buffer settings, but I think Caddy should guarantee that CONTENT_LENGTH is set to an int (0 is fine and even desirable) at least as far as the FastCGI transport is concerned. I'd be happy to submit a PR if that'd be welcome.

I think changing this to guard against empty strings and send a 0 instead of an empty string for CONTENT_LENGTH would do it:
https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/fastcgi/fastcgi.go#L299

In fact, it's done for the FastCGI protocol elsewhere 🤔 :
https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/fastcgi/fastcgi.go#L186

francislavoie · 2023-03-13T22:42:30Z

Ah, interesting. That does make sense @trea. PR is definitely welcome :)

…PHP-FPM requests from hanging Fixes: caddyserver#5420

trea · 2023-03-15T19:17:54Z

This turned out to be more complicated than I expected (weird! 😅 )

Setting the Content-Length does prevent the requests from hanging as I've done in #5435. I wrote up a custom integration suite that verifies that verifies that outcome.

However, I suspect this will mean only preventing those requests from hanging and not doing anything about the any underlying issue.

I discovered that FastCGI doesn't allow for transfer encodings, and indeed Caddy's reverse proxying rightly categorizes the Transfer-Encoding header as a "hop-by-hop" header that shouldn't be transferred to any upstreams including FastCGI:

https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/reverseproxy.go#L1375

Caddy accounts for this in its HTTP Transport for reverse proxying and allows for buffering by default. The issue seems to just be that it doesn't do so for FastCGI.

For a point of comparison, nginx has buffering for fastcgi on by default 4/8k in memory and then to tmpfile up to 1gb both configurable limits.

As a recap:

Prior to Caddy 2.6.3: Requests to PHP backends that had Transfer-Encoding: chunked were hanging #5236
Caddy 2.6.3: Included #5289 which allowed for an exception with the check on whether or not to buffer the request to account for Transfer-Encoding: chunked, at the risk of OOMing Caddy
Caddy 2.6.4: Reverts #5289 via #5367 which means that the requests go back to hanging
#5435: Prevents the requests from hanging but only if buffering is enabled
Future?: Caddy implements buffering in the FastCGI transport negating the need for enabling buffering by default when using FastCGI

mholt · 2023-03-16T16:02:54Z

@trea Excellent, thanks for confirming.

We'll make note of the need for FastCGI to buffer. @WeidiDeng is currently our FastCGI expert -- I'm not sure we're ready to implement temp file buffers though, but maybe we are? I dunno.

I'm sure we'd also accept a PR for that too if you're interested 😉

Thank you for the carefully-researched contribution and the effort you have made for an elegant fix!

trea · 2023-03-16T16:16:35Z

You're very welcome! Glad I could dig into it!

I might have a look at buffering for the transport at some point 😄

I think a few intermediary steps could be taken:

Add note to docs that calls out use of FastCGI and recommends enabling buffers and setting a limit
Consider enabling buffers by default for now when the php_fastcgi directive is used
Writing to logs and metrics when chunked requests destined for FastCGI are received

My concern mostly is that yes Caddy will now not OOM, but PHP upstreams will not work correctly for chunked requests and I suspect that'll cause some headaches. 😅

ahmetsoguksu · 2024-02-27T09:29:01Z

Hi @mholt ;

You marked this problem as solved in 2.7.6, but we continue to experience the same problem in requests made with --header "Transfer-Encoding: chunked".

I encounter the same problem when I try to compile from the master branch.

03f703a00e006a1e53a2eaf5f4e33a1b3a5e4237 (25 Feb 24 21:13 UTC)

For our customers who need to use this header, we are forced to revert to version 2.6.3.

$ curl http://127.0.0.1 --header "Pragma: no-cache" -X POST -d file=test --header "Transfer-Encoding: chunked" -v

    php_fastcgi 127.0.0.1:9000 {
        index index.php
        try_files {path} {path}/index.php /customerror.php
    }
    encode gzip zstd
    file_server
}

I might be missing something.

Thanks.

mholt · 2024-03-01T16:55:09Z

@ahmetsoguksu Can you post a minimally reproducible test case?

For example, a Caddyfile config (ideally one that is self-contained, as setting up backend apps and such is time-consuming), and a curl -v request that exhibits the behavior?

ahmetsoguksu · 2024-03-07T13:36:19Z

Hi @mholt,

Since we are experiencing the relevant problem with php-fpm; I have to add this to the test scenario as well.

Lets started with version 2.7.6

$ caddy version
v2.7.6 h1:w0NymbG2m9PcvKWsrXO6EEkY9Ru4FJK8uQbYcev1p3A=

We can test it with a simple php code.

$ cat wwwroot/index.php
<?php
echo "Hello";

I used container for php-fpm, you can use the same compose file just by changing the path of index.php.

$ cat docker-compose.yml
version: "3.7"

services:
   php-fpm:
     image: php:8.1-fpm
     ports:
       - "9000:9000"
     volumes:
       - /home/ahmetsoguksu/projects/php-fpm/wwwroot:/home/ahmetsoguksu/projects/php-fpm/wwwroot

The Caddyfile is like this:

$ cat /etc/caddy/Caddyfile

:80 {
  root * /home/ahmetsoguksu/projects/php-fpm/wwwroot
  php_fastcgi localhost:9000 
  encode gzip zstd
  file_server
}

I see that "Hello" is returned in a normal request:

$ curl 127.0.0.1
Hello

When I send a request with the "Transfer-Encoding:chunked" header, I do not receive a response:

$ curl 127.0.0.1 --header "Pragma: no-cache" -X POST -d file=test --header "Transfer-Encoding:chunked" -vvv                       
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 127.0.0.1:80...
* Connected to 127.0.0.1 (127.0.0.1) port 80 (#0)
> POST / HTTP/1.1
> Host: 127.0.0.1
> User-Agent: curl/7.81.0
> Accept: */*
> Pragma: no-cache
> Transfer-Encoding:chunked
> Content-Type: application/x-www-form-urlencoded
>

I don't experience any problems when I try it with version 2.6.3.

$ wget https://github.com/caddyserver/caddy/releases/download/v2.6.3/caddy_2.6.3_linux_amd64.deb                                  
$ sudo dpkg -i caddy_2.6.3_linux_amd64.deb 
$ caddy version
v2.6.3 h1:QRVBNIqfpqZ1eJacY44I6eUC1OcxQ8D04EKImzpj7S8=
$ sudo systemctl restart caddy

$ curl 127.0.0.1 --header "Pragma: no-cache" -X POST -d file=test --header "Transfer-Encoding:chunked" -vvv                       
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 127.0.0.1:80...
* Connected to 127.0.0.1 (127.0.0.1) port 80 (#0)
> POST / HTTP/1.1
> Host: 127.0.0.1
> User-Agent: curl/7.81.0
> Accept: */*
> Pragma: no-cache
> Transfer-Encoding:chunked
> Content-Type: application/x-www-form-urlencoded
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: text/html; charset=UTF-8
< Server: Caddy
< X-Powered-By: PHP/8.1.27
< Date: Thu, 07 Mar 2024 13:30:34 GMT
< Content-Length: 5
< 
* Connection #0 to host 127.0.0.1 left intact
Hello

Thanks.

vedatkamer · 2024-03-20T17:31:05Z

@mholt do you have any comments about @ahmetsoguksu's latest comment? Thank you.

mholt · 2024-03-20T18:21:33Z

What are the contents of the file named test?

One thought I have is that we/Go recently fixed a bug related to concurrently responding to requests while still reading from them (i.e. before the payload is finished). Does test end with two \r\ns? If not, the server is correctly waiting for the request to finish being sent. If you are declaring a chunked encoding then you need to send chunks.

ahmetsoguksu · 2024-03-26T15:20:16Z

Based on this (#5420 (comment)), we first upgraded Caddy which was 2.6.4 to 2.7.6 and we enabled fastcgi buffering like this:

php_fastcgi localhost:9000 {
    request_buffers 4k
    response_buffers 4k
}

Now it doesn't hang.
Thanks a lot.

akindemirci · 2024-04-04T15:07:06Z

@mholt Could you help us implement fastcgi buffering like Nginx: "nginx has buffering for fastcgi on by default 4/8k in memory and then to tmpfile up to 1gb both configurable limits."

With this setting in Caddyfile (request_buffers 4k) POST requests exceeding 4KB are not received.

Thanks.

mholt · 2024-04-10T03:58:12Z

@akindemirci That's definitely something I could prioritize with a sufficient sponsorship (send me an email if you have interest in that -- see Caddy website for an email address)!

shyim · 2024-10-16T07:25:53Z

Could we get this issue reopend? It still persists in the latest version and it's pretty evil that you can kill with simple curl commands a PHP Application with Caddy. I would have rated this tbh also as an security issue as you an attacker can easily deny access to websites.

Here is an simple reproducer:

FROM php:fpm-alpine

COPY --from=caddy /usr/bin/caddy /usr/bin/caddy

COPY <<EOF /etc/caddy/Caddyfile
:8000

root * /var/www/html/public
php_fastcgi localhost:9000 {
    request_buffers 4k
    response_buffers 4k
}
EOF

COPY <<EOF /var/www/html/public/index.php
<?php
phpinfo();
EOF

COPY <<EOF /startup
#!/usr/bin/env sh

caddy run --config /etc/caddy/Caddyfile &
php-fpm &
wait
EOF

ENTRYPOINT ["/bin/sh", "/startup"]

docker build -t caddy-issue .
docker run --rm -p 8000:800 caddy-issue

http://localhost:8000 serves a phpinfo,

run multiple times:

curl -X POST http://127.0.0.1:8000 --header "Transfer-Encoding: chunked" --header "Content-Type: application/json" --http1.1 --data '{}' &

and the page does not respond anymore as php-fpm processes are blocked

WeidiDeng · 2024-10-17T01:37:23Z

@shyim this issue is for the version v2.6.4 of caddy. But your comment is about the latest version of caddy. Either way, it's tracked in the new issue.

francislavoie added the needs info 📭 Requires more information label Mar 6, 2023

francislavoie added bug 🐞 Something isn't working and removed needs info 📭 Requires more information labels Mar 14, 2023

trea added a commit to trea/caddy that referenced this issue Mar 15, 2023

fix(reverse-proxy): Reset Content-Length when the request to prevent …

2fe9eb1

…PHP-FPM requests from hanging Fixes: caddyserver#5420

WeidiDeng mentioned this issue Mar 15, 2023

php-fpm stuck request #5434

Closed

trea mentioned this issue Mar 15, 2023

reverseproxy: Reset Content-Length to prevent fastcgi from hanging #5435

Merged

mholt closed this as completed in 2182270 Mar 16, 2023

mholt mentioned this issue Apr 26, 2023

Chore: updates Middleware adding synthesis of Transfer-Encoding header corazawaf/coraza-caddy#55

Merged

mcfedr mentioned this issue Sep 20, 2023

reverseproxy: content-length is wrong when the request doesnt fit in the request_buffer #5829

Closed

mcfedr mentioned this issue Oct 2, 2023

reverseproxy: only change the content-length header when the full request was buffered #5830

Merged

dallyger mentioned this issue Oct 8, 2024

Transfer-Encoding: chunked causing denial of service shopware/docker#107

Open

WeidiDeng mentioned this issue Oct 17, 2024

reverse_proxy: problems with fastcgi requests with unknown body length #6637

Open

WeidiDeng mentioned this issue Oct 17, 2024

reverseproxy: Set Content-Length when body is fully buffered #6638

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.6.4 php-fpm gets stuck in read state #5420

v2.6.4 php-fpm gets stuck in read state #5420

Jimbolino commented Mar 6, 2023

mholt commented Mar 6, 2023

Jimbolino commented Mar 6, 2023

francislavoie commented Mar 6, 2023 •

edited

Loading

Jimbolino commented Mar 7, 2023

francislavoie commented Mar 7, 2023

mholt commented Mar 7, 2023

Jimbolino commented Mar 8, 2023

trea commented Mar 9, 2023

Jimbolino commented Mar 9, 2023

trea commented Mar 10, 2023

Jimbolino commented Mar 13, 2023

trea commented Mar 13, 2023

francislavoie commented Mar 13, 2023

trea commented Mar 15, 2023

mholt commented Mar 16, 2023

trea commented Mar 16, 2023

ahmetsoguksu commented Feb 27, 2024 •

edited

Loading

mholt commented Mar 1, 2024

ahmetsoguksu commented Mar 7, 2024

vedatkamer commented Mar 20, 2024

mholt commented Mar 20, 2024

ahmetsoguksu commented Mar 26, 2024

akindemirci commented Apr 4, 2024

mholt commented Apr 10, 2024

shyim commented Oct 16, 2024

WeidiDeng commented Oct 17, 2024

v2.6.4 php-fpm gets stuck in read state #5420

v2.6.4 php-fpm gets stuck in read state #5420

Comments

Jimbolino commented Mar 6, 2023

mholt commented Mar 6, 2023

Jimbolino commented Mar 6, 2023

francislavoie commented Mar 6, 2023 • edited Loading

Jimbolino commented Mar 7, 2023

francislavoie commented Mar 7, 2023

mholt commented Mar 7, 2023

Jimbolino commented Mar 8, 2023

trea commented Mar 9, 2023

Jimbolino commented Mar 9, 2023

trea commented Mar 10, 2023

Jimbolino commented Mar 13, 2023

trea commented Mar 13, 2023

francislavoie commented Mar 13, 2023

trea commented Mar 15, 2023

As a recap:

mholt commented Mar 16, 2023

trea commented Mar 16, 2023

ahmetsoguksu commented Feb 27, 2024 • edited Loading

mholt commented Mar 1, 2024

ahmetsoguksu commented Mar 7, 2024

vedatkamer commented Mar 20, 2024

mholt commented Mar 20, 2024

ahmetsoguksu commented Mar 26, 2024

akindemirci commented Apr 4, 2024

mholt commented Apr 10, 2024

shyim commented Oct 16, 2024

WeidiDeng commented Oct 17, 2024

francislavoie commented Mar 6, 2023 •

edited

Loading

ahmetsoguksu commented Feb 27, 2024 •

edited

Loading