Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.6.4 php-fpm gets stuck in read state #5420

Closed
Jimbolino opened this issue Mar 6, 2023 · 26 comments
Closed

v2.6.4 php-fpm gets stuck in read state #5420

Jimbolino opened this issue Mar 6, 2023 · 26 comments
Labels
bug 🐞 Something isn't working

Comments

@Jimbolino
Copy link

Since we upgraded from v2.6.3 to v2.6.4, we are experiencing some problems with our php-fpm backend

The problem is that the php-fpm process gets stuck in a "read" state (and not go back to "accept" state).
Eventually php's pm.max_children is reached and the entire server will be killed by ec2's health checker.

I've done some debugging with strace, but have not found a pattern, because the servers gets lots of traffic, but the problem only occurs every once in a while. (maybe once in a 1000 requests)

root@frontend:~# strace -p 39614
strace: Process 39614 attached
read(5, 

root@frontend:~# nsenter -t 39614 --net lsof -n -p 39614 | grep 5u
php-fpm 39614 www-data    5u  IPv6 287757      0t0     TCP 172.18.0.3:9000->172.18.0.4:44492 (ESTABLISHED)

172.18.0.3 = app_php_1.app_default
172.18.0.4 = app_caddy_1.app_default

Im afraid that #5367 might need to be revisited.
I've reverted back to v2.6.3 and the problem is gone.

@mholt
Copy link
Member

mholt commented Mar 6, 2023

Thanks for the report. What is your config?

@Jimbolino
Copy link
Author

Caddyfile:

{
    debug
    skip_install_trust
    auto_https disable_redirects
    on_demand_tls {
        ask {$CADDY_ON_DEMAND_ASK_URL}
    }
    admin caddy:8080 {
        origins {$CADDY_ADMIN_ALLOWED_ORIGINS}
    }
    storage redis {
        host {$CADDY_CLUSTERING_REDIS_HOST}
        port {$CADDY_CLUSTERING_REDIS_PORT}
        db {$CADDY_CLUSTERING_REDIS_DB}
        tls_enabled {$CADDY_CLUSTERING_REDIS_TLS}
    }
}

(global) {
    header server maglr
}

(frontend) {
    root * /app/docs

    route {
        # If the requested file does not exist, try index files
        @indexFiles file {
            try_files {path} {path}/index.php index.php
            split_path .php
        }
        rewrite @indexFiles {http.matchers.file.relative}

        # Proxy PHP files to the FastCGI responder
        @phpFiles path *.php
        reverse_proxy @phpFiles php:9000 {
            transport fastcgi {
                split .php
            }
        }
    }

    encode zstd gzip
    file_server {
        hide .htaccess
    }
    header -x-powered-by
}

(security) {
    header referrer-policy strict-origin-when-cross-origin
    header strict-transport-security max-age=31536000
    header x-content-type-options nosniff
}

*.maglr.localhost, *.localhost, https://127.0.0.1 {
    tls internal
    import frontend
    import global
}

https:// {
    tls {
        on_demand
    }

    import security
    import frontend
    import global
}


http:// {
    handle /health {
        respond "OK" 200
    }
    @notHealth {
        not path /health.php
    }
    handle {
        redir @notHealth https://{host}{uri}
    }
    import frontend
    import global
}

@francislavoie
Copy link
Member

francislavoie commented Mar 6, 2023

That's odd. Was it broken on v2.6.2 as well?

Would you be able to set up a minimally reproducible example? i.e. a simple PHP script that exhibits the issue, plus a Caddyfile that only has the bits that are relevant to the breakage.

@francislavoie francislavoie added the needs info 📭 Requires more information label Mar 6, 2023
@Jimbolino
Copy link
Author

nope, v2.6.2 ran on production just fine

Im having trouble reproducing it locally (or on our staging environment)

Maybe it's got something to do with the AWS Network Load Balancer, or maybe bad bots that send malformed headers (wrong Content-Length etc)

@francislavoie
Copy link
Member

v2.6.4 should be functionally equivalent to v2.6.2 in terms of the proxy buffering functionality. I don't understand what's going on here.

@mholt
Copy link
Member

mholt commented Mar 7, 2023

Can you try enabling access logs and maybe see if we can pin down which requests exactly are hanging?

@Jimbolino
Copy link
Author

We have cloudwatch enabled for our access logs
I think the default timeout is 3 seconds, because we see the following entries when searching for { $.duration > 1 }

{
    "level": "error",
    "ts": 1678042875.2110817,
    "logger": "http.log.error",
    "msg": "dialing backend: dial tcp 172.18.0.3:9000: i/o timeout",
    "request": {
        "remote_ip": "172.31.10.238",
        "remote_port": "26109",
        "proto": "HTTP/1.1",
        "method": "GET",
        "host": "172.31.10.238:443",
        "uri": "/health.php",
        "headers": {
            "User-Agent": [
                "ELB-HealthChecker/2.0"
            ],
            "Connection": [
                "close"
            ],
            "Accept": [
                "*/*"
            ],
            "Accept-Encoding": [
                "*"
            ]
        }
    },
    "duration": 3.001065249,
    "status": 502,
    "err_id": "qd6cryvhi",
    "err_trace": "reverseproxy.statusError (reverseproxy.go:1299)"
}

From this ts (2023-03-05T20:01:15.211+01:00) I went back in the logs to see what was happening.
The previous 5-6 mins are only calls to health.php, and before that there is some normal traffic.

@trea
Copy link
Contributor

trea commented Mar 9, 2023

Is your health.php making any external calls that might not be returned to PHP before Caddy gives up on the request?

I worked on a project where we had reverse proxying to PHP on Docker and couldn't figure out why we would repeatedly get 5xx errors and no running PHP.

It turned out that we were making requests with Curl to a flaky service and someone had inadvertently set CURLOPT_TIMEOUT to 0. Any request to that flaky service that wasn't handled before the reverse proxy gave up and returned to the client left open a PHP worker sitting around waiting for the flaky service to respond until there weren't any workers left.

@Jimbolino
Copy link
Author

nope:

<?php
const MAX_LOAD = 25;
$load = sys_getloadavg();
if ($load[0] > MAX_LOAD) {
    http_response_code(503);
    exit('503');
}
if (!isset($_SERVER['APP_ENV'])) {
    $root = __DIR__.'/..';
    if (!file_exists($root.'/.env')) {
        http_response_code(500);
        exit('unknown environment');
    }
    require_once $root.'/vendor/autoload.php';
    Dotenv\Dotenv::createImmutable($root)->load();
}
exit('ok');

and if i understand strace correctly, it's php who is trying to read on the incoming connection from caddy that is stuck
so maybe caddy did not close the connection or send a EOF or something?

@trea
Copy link
Contributor

trea commented Mar 10, 2023

Thanks, the health.php file you provided looks fine.

I stand by my theory though that you have something else that's causing PHP to not return quickly enough for your call to /health.php. I did create a reproduction using Docker Compose and put it in this repo: https://github.com/trea/php-fpm-issue

If you follow the README (and you may have to run the healthcheck a few times), you end up with requests that take too long. The healthcheck fails and then autoheal restarts it and Caddy can't connect to PHP anymore until it's finished restarting:

"error": "dialing backend: dial tcp 192.168.48.6:9000: connect: connection refused"

Log Output
php-fpm-exit-test-caddy-1     | {
  "level": "debug",
  "ts": 1678475769.7431428,
  "logger": "http.handlers.reverse_proxy",
  "msg": "upstream roundtrip",
  "upstream": "php:9000",
  "duration": 0.000587583,
  "request": {
    "remote_ip": "192.168.48.4",
    "remote_port": "54200",
    "proto": "HTTP/1.1",
    "method": "GET",
    "host": "caddy",
    "uri": "/slow.php",
    "headers": {
      "X-Forwarded-Proto": [
        "http"
      ],
      "X-Forwarded-Host": [
        "caddy"
      ],
      "User-Agent": [
        "fortio.org/fortio-1.53.1"
      ],
      "X-Forwarded-For": [
        "192.168.48.4"
      ]
    }
  },
  "error": "dialing backend: dial tcp 192.168.48.6:9000: connect: connection refused"
}

I would recommend setting up a slowlog for FPM and ensuring that the ptrace capability is available, that way you should be able to get a better idea of what is causing FPM to hang.

In this test case the output points to curl_exec:

php-fpm-exit-test-php-1       | [10-Mar-2023 19:33:48] WARNING: [pool www] child 8, script '/var/www/html/slow.php' (request: "GET /slow.php") executing too slow (2.210546 sec), logging
php-fpm-exit-test-php-1       | [10-Mar-2023 19:33:48] WARNING: [pool www] child 7, script '/var/www/html/slow.php' (request: "GET /slow.php") executing too slow (2.220195 sec), logging
php-fpm-exit-test-php-1       | [10-Mar-2023 19:33:48] NOTICE: child 8 stopped for tracing
php-fpm-exit-test-php-1       | [10-Mar-2023 19:33:48] NOTICE: about to trace 8
php-fpm-exit-test-php-1       |
php-fpm-exit-test-php-1       | [10-Mar-2023 19:33:48]  [pool www] pid 8
php-fpm-exit-test-php-1       | script_filename = /var/www/html/slow.php
php-fpm-exit-test-php-1       | [0x0000ffff9f2130c0] curl_exec() /var/www/html/slow.php:6

@Jimbolino
Copy link
Author

There are indeed some calls that incidentally take too long.
It still doesnt explain why previous versions of caddy dont cause the same issue.

Maybe #5431 is related

@trea
Copy link
Contributor

trea commented Mar 13, 2023

Actually, after looking into this more, I think you were right when you suggested #5367 was related, and I do think #5431 is related as well. I think they specifically go back to this change:

https://github.com/caddyserver/caddy/pull/5367/files#diff-a248c9a1ec018edea8b377d155bc1df1a642bf79d00ababb5cdacc6b86c5733dL627

The FastCGI specification states that Responders read up to the total of CONTENT_LENGTH, but as a result of 5367, that is now empty for requests with Transfer-Encoding: chunked.

I have updated my repo to include both versions of Caddy and two more services in the Docker Compose file chunked and chunked264. When I run docker-compose chunked264 PHP actually dies and is respawned immediately. You can call docker-compose run chunked all you want and it works fine.

I understand the reverting of the buffer settings, but I think Caddy should guarantee that CONTENT_LENGTH is set to an int (0 is fine and even desirable) at least as far as the FastCGI transport is concerned. I'd be happy to submit a PR if that'd be welcome.

I think changing this to guard against empty strings and send a 0 instead of an empty string for CONTENT_LENGTH would do it:
https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/fastcgi/fastcgi.go#L299

In fact, it's done for the FastCGI protocol elsewhere 🤔 :
https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/fastcgi/fastcgi.go#L186

@francislavoie
Copy link
Member

Ah, interesting. That does make sense @trea. PR is definitely welcome :)

@francislavoie francislavoie added bug 🐞 Something isn't working and removed needs info 📭 Requires more information labels Mar 14, 2023
trea added a commit to trea/caddy that referenced this issue Mar 15, 2023
@trea
Copy link
Contributor

trea commented Mar 15, 2023

This turned out to be more complicated than I expected (weird! 😅 )

Setting the Content-Length does prevent the requests from hanging as I've done in #5435. I wrote up a custom integration suite that verifies that verifies that outcome.

However, I suspect this will mean only preventing those requests from hanging and not doing anything about the any underlying issue.

I discovered that FastCGI doesn't allow for transfer encodings, and indeed Caddy's reverse proxying rightly categorizes the Transfer-Encoding header as a "hop-by-hop" header that shouldn't be transferred to any upstreams including FastCGI:

https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/reverseproxy.go#L1375

Caddy accounts for this in its HTTP Transport for reverse proxying and allows for buffering by default. The issue seems to just be that it doesn't do so for FastCGI.

For a point of comparison, nginx has buffering for fastcgi on by default 4/8k in memory and then to tmpfile up to 1gb both configurable limits.

As a recap:

Prior to Caddy 2.6.3: Requests to PHP backends that had Transfer-Encoding: chunked were hanging #5236
Caddy 2.6.3: Included #5289 which allowed for an exception with the check on whether or not to buffer the request to account for Transfer-Encoding: chunked, at the risk of OOMing Caddy
Caddy 2.6.4: Reverts #5289 via #5367 which means that the requests go back to hanging
#5435: Prevents the requests from hanging but only if buffering is enabled
Future?: Caddy implements buffering in the FastCGI transport negating the need for enabling buffering by default when using FastCGI

@mholt
Copy link
Member

mholt commented Mar 16, 2023

@trea Excellent, thanks for confirming.

We'll make note of the need for FastCGI to buffer. @WeidiDeng is currently our FastCGI expert -- I'm not sure we're ready to implement temp file buffers though, but maybe we are? I dunno.

I'm sure we'd also accept a PR for that too if you're interested 😉

Thank you for the carefully-researched contribution and the effort you have made for an elegant fix!

@trea
Copy link
Contributor

trea commented Mar 16, 2023

You're very welcome! Glad I could dig into it!

I might have a look at buffering for the transport at some point 😄

I think a few intermediary steps could be taken:

  • Add note to docs that calls out use of FastCGI and recommends enabling buffers and setting a limit
  • Consider enabling buffers by default for now when the php_fastcgi directive is used
  • Writing to logs and metrics when chunked requests destined for FastCGI are received

My concern mostly is that yes Caddy will now not OOM, but PHP upstreams will not work correctly for chunked requests and I suspect that'll cause some headaches. 😅

@ahmetsoguksu
Copy link

ahmetsoguksu commented Feb 27, 2024

Hi @mholt ;

You marked this problem as solved in 2.7.6, but we continue to experience the same problem in requests made with --header "Transfer-Encoding: chunked".

I encounter the same problem when I try to compile from the master branch.

03f703a00e006a1e53a2eaf5f4e33a1b3a5e4237 (25 Feb 24 21:13 UTC)

For our customers who need to use this header, we are forced to revert to version 2.6.3.

$ curl http://127.0.0.1 --header "Pragma: no-cache" -X POST -d file=test --header "Transfer-Encoding: chunked" -v

    php_fastcgi 127.0.0.1:9000 {
        index index.php
        try_files {path} {path}/index.php /customerror.php
    }
    encode gzip zstd
    file_server
}

I might be missing something.

Thanks.

@mholt
Copy link
Member

mholt commented Mar 1, 2024

@ahmetsoguksu Can you post a minimally reproducible test case?

For example, a Caddyfile config (ideally one that is self-contained, as setting up backend apps and such is time-consuming), and a curl -v request that exhibits the behavior?

@ahmetsoguksu
Copy link

Hi @mholt,

Since we are experiencing the relevant problem with php-fpm; I have to add this to the test scenario as well.


Lets started with version 2.7.6

$ caddy version
v2.7.6 h1:w0NymbG2m9PcvKWsrXO6EEkY9Ru4FJK8uQbYcev1p3A=

We can test it with a simple php code.

$ cat wwwroot/index.php
<?php
echo "Hello";

I used container for php-fpm, you can use the same compose file just by changing the path of index.php.

$ cat docker-compose.yml
version: "3.7"

services:
   php-fpm:
     image: php:8.1-fpm
     ports:
       - "9000:9000"
     volumes:
       - /home/ahmetsoguksu/projects/php-fpm/wwwroot:/home/ahmetsoguksu/projects/php-fpm/wwwroot

The Caddyfile is like this:

$ cat /etc/caddy/Caddyfile

:80 {
  root * /home/ahmetsoguksu/projects/php-fpm/wwwroot
  php_fastcgi localhost:9000 
  encode gzip zstd
  file_server
}

I see that "Hello" is returned in a normal request:

$ curl 127.0.0.1
Hello

When I send a request with the "Transfer-Encoding:chunked" header, I do not receive a response:

$ curl 127.0.0.1 --header "Pragma: no-cache" -X POST -d file=test --header "Transfer-Encoding:chunked" -vvv                       
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 127.0.0.1:80...
* Connected to 127.0.0.1 (127.0.0.1) port 80 (#0)
> POST / HTTP/1.1
> Host: 127.0.0.1
> User-Agent: curl/7.81.0
> Accept: */*
> Pragma: no-cache
> Transfer-Encoding:chunked
> Content-Type: application/x-www-form-urlencoded
> 


I don't experience any problems when I try it with version 2.6.3.

$ wget https://github.com/caddyserver/caddy/releases/download/v2.6.3/caddy_2.6.3_linux_amd64.deb                                  
$ sudo dpkg -i caddy_2.6.3_linux_amd64.deb 
$ caddy version
v2.6.3 h1:QRVBNIqfpqZ1eJacY44I6eUC1OcxQ8D04EKImzpj7S8=
$ sudo systemctl restart caddy  
$ curl 127.0.0.1 --header "Pragma: no-cache" -X POST -d file=test --header "Transfer-Encoding:chunked" -vvv                       
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 127.0.0.1:80...
* Connected to 127.0.0.1 (127.0.0.1) port 80 (#0)
> POST / HTTP/1.1
> Host: 127.0.0.1
> User-Agent: curl/7.81.0
> Accept: */*
> Pragma: no-cache
> Transfer-Encoding:chunked
> Content-Type: application/x-www-form-urlencoded
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: text/html; charset=UTF-8
< Server: Caddy
< X-Powered-By: PHP/8.1.27
< Date: Thu, 07 Mar 2024 13:30:34 GMT
< Content-Length: 5
< 
* Connection #0 to host 127.0.0.1 left intact
Hello     

Thanks.

@vedatkamer
Copy link

@mholt do you have any comments about @ahmetsoguksu's latest comment? Thank you.

@mholt
Copy link
Member

mholt commented Mar 20, 2024

What are the contents of the file named test?

One thought I have is that we/Go recently fixed a bug related to concurrently responding to requests while still reading from them (i.e. before the payload is finished). Does test end with two \r\ns? If not, the server is correctly waiting for the request to finish being sent. If you are declaring a chunked encoding then you need to send chunks.

@ahmetsoguksu
Copy link

Based on this (#5420 (comment)), we first upgraded Caddy which was 2.6.4 to 2.7.6 and we enabled fastcgi buffering like this:

php_fastcgi localhost:9000 {
    request_buffers 4k
    response_buffers 4k
}

Now it doesn't hang.
Thanks a lot.

@akindemirci
Copy link

@mholt Could you help us implement fastcgi buffering like Nginx: "nginx has buffering for fastcgi on by default 4/8k in memory and then to tmpfile up to 1gb both configurable limits."

With this setting in Caddyfile (request_buffers 4k) POST requests exceeding 4KB are not received.

Thanks.

@mholt
Copy link
Member

mholt commented Apr 10, 2024

@akindemirci That's definitely something I could prioritize with a sufficient sponsorship (send me an email if you have interest in that -- see Caddy website for an email address)!

@shyim
Copy link
Contributor

shyim commented Oct 16, 2024

Could we get this issue reopend? It still persists in the latest version and it's pretty evil that you can kill with simple curl commands a PHP Application with Caddy. I would have rated this tbh also as an security issue as you an attacker can easily deny access to websites.

Here is an simple reproducer:

FROM php:fpm-alpine

COPY --from=caddy /usr/bin/caddy /usr/bin/caddy

COPY <<EOF /etc/caddy/Caddyfile
:8000

root * /var/www/html/public
php_fastcgi localhost:9000 {
    request_buffers 4k
    response_buffers 4k
}
EOF

COPY <<EOF /var/www/html/public/index.php
<?php
phpinfo();
EOF

COPY <<EOF /startup
#!/usr/bin/env sh

caddy run --config /etc/caddy/Caddyfile &
php-fpm &
wait
EOF

ENTRYPOINT ["/bin/sh", "/startup"]
docker build -t caddy-issue .
docker run --rm -p 8000:800 caddy-issue

http://localhost:8000 serves a phpinfo,

run multiple times:

curl -X POST http://127.0.0.1:8000 --header "Transfer-Encoding: chunked" --header "Content-Type: application/json" --http1.1 --data '{}' &

and the page does not respond anymore as php-fpm processes are blocked

@WeidiDeng
Copy link
Member

@shyim this issue is for the version v2.6.4 of caddy. But your comment is about the latest version of caddy. Either way, it's tracked in the new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants