Redis: timeout #279

liukaiyuliukaiyu · 2024-08-06T08:29:56Z

Whether the redis client will block the nginx event loop in implementation, or the read timeout is caused by lua's own coroutine scheduling, because I found that the redis server responds quickly, and the redis client will appear in the timeout caused by not being able to read the response within 50ms

liukaiyuliukaiyu · 2024-08-06T09:02:21Z

the error log：

2024/08/06 15:49:01 [debug] 384947#384947: *171134 [lua] base_plugin.lua:24: access(): executing plugin "rate-limiting": access
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:60: new(): ==========new==========, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket pool get keepalive peer
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket get keepalive peer: using connection 00007FBE83FEB1F8, fd:79
2024/08/06 15:49:01 [debug] 384947#384947: *171134 event timer del: 79: 3647014359
2024/08/06 15:49:01 [debug] 384947#384947: *171134 http cleanup add: 0000000001424E68
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:220: _gen_req(): ======nargs======5, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:231: _gen_req(): ======_gen_req=====eval=====i=====1, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:231: _gen_req(): ======_gen_req===== local cache_key, expiration = KEYS[1], ARGV[1]
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:231: _gen_req(): ======_gen_req=====1=====i=====3, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:231: _gen_req(): ======_gen_req=====ratelimit:d7af6063-037c-47d0-934a-1f19c412ea94:d7af6063-037c-47d0-934a-1f19c412ea94:1722930541000:second:0aa0512166b1d56d3617dfb30304e602=====i=====4, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:231: _gen_req(): ======_gen_req=====1=====i=====5, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket send timeout: 50
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua allocate new chainlink and new buf of size 431, cl:0000000001424EB0
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket send data
2024/08/06 15:49:01 [debug] 384947#384947: *171134 send: fd:79 431 of 431
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket sent all the data
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:268: eval(): =====req======26, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:279: eval(): ===req===1:*5
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:129: eval(): before_receive_time timestamp: 1722930541450, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket calling receive() method
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket read timeout: 50
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua allocate new chainlink and new buf of size 4096, cl:00000000014250C0
2024/08/06 15:49:01 [debug] 384947#384947: *171134 malloc: 0000000001425200:4096
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket read timeout: 50
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket read data: wait:0
2024/08/06 15:49:01 [debug] 384947#384947: *171134 event timer add: 79: 50:3647004499
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua resume returned 1
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua thread yielded
2024/08/06 15:49:01 [debug] 384947#384947: *171134 http run request: "/test_kylin_upstream_timeout?"
2024/08/06 15:49:01 [debug] 384947#384947: *171134 access phase: 14
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua access handler, uri:"/test_kylin_upstream_timeout" c:1
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua run write event handler: timedout:0, ready:1, writing_raw_req_socket:0
2024/08/06 15:49:01 [debug] 384947#384947: *171134 useless lua write event handler
2024/08/06 15:49:01 [debug] 384947#384947: *171134 event timer del: 79: 3647004499
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket handler for "/test_kylin_upstream_timeout?", wev 0
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket read handler
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket handle read error
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket waking up the current request
2024/08/06 15:49:01 [debug] 384947#384947: *171134 access phase: 14
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua access handler, uri:"/test_kylin_upstream_timeout" c:1
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp operation done, resuming lua thread
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket calling prepare retvals handler 0000000000663B16, u:00007FBE8318A408
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket receive return value handler
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua run thread, top:0 c:1
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:133: eval(): =========receive time===== 53ms, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:134: eval(): =======line=======nil=====err=====timeout=====partial=====, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua finalize socket
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua http cleanup free: 0000000001424E68
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua close socket connection

zhuizhuhaomeng · 2024-08-07T00:46:38Z

Just give us a minimal example that could reproduce this problem.
That would help us understand your problem.

You can also use OpenResty XRay to see if there are any off-CPU events.

liukaiyuliukaiyu · 2024-08-07T02:12:23Z

The gateway kong uses Redis to limit the service flow, and the gateway and redis services are closed in the same computer room. The timeout time set by the gateway as a client is 50ms
The problem now is that the qps of the gateway request to redis is not high, but the redis client on the gateway side will report many timeouts. tcpdump captures the packet and analyzes it, and it is found that the redis server responds to the data within 5ms. timeout was returned after receiving more than 50ms
Machine specification 4C8G, The container environment only has test traffic, without any other requests, and uses the wrk tool to send requests in parallel. Because with service throttling enabled, each request will trigger nginx to interact with the redis server, look up the value of a key in redis, and incrby will increase the value of the key, as shown in the source code: https://github.com/Kong/kong/blob/master/kong/plugins/rate-limiting/policies/init.lua

The cpu idle rate is more than 85%
The captured data is as follows. The redis server responded within 1.5ms

liukaiyuliukaiyu · 2024-08-08T06:16:43Z

@zhuizhuhaomeng Hello, do you have any ideas on this problem? Please give me some guidance. Thank you

zhuizhuhaomeng · 2024-08-13T08:44:32Z

@liukaiyuliukaiyu
Did you run the tcpdump on the gateway or in the redis services server？
You may also use perf sched record to check the max latency of the process.

flowdopip · 2024-09-13T09:16:46Z

I'm having the same problem that @liukaiyuliukaiyu mentioned. I created a custom plugin in Kong with this library and have several timeouts connecting and doing authentication through Redis. My setup is with Kong Db-less in and Redis-Cluster, both in Kubernetes on the same nodes (sometimes).

A few errors /debug messages from my setup:
[proxy-cache-redis] REDIS: redis-check failed to authenticate: timeout,
fetch_slots(): failed to fetch slots: failed to fetch cluster info: timeout,
redis.lua:200 failed to commit the cache value to Redis: timeout

Already tried to increase the connection timeout to bigger values, and the problem remains.
Already did the TCPDump and I see exactly the same of @liukaiyuliukaiyu .
Already check file descriptores and they are limited to the max defined by nginx and open resty lua 16384

Any idea or guidance to try to find this issue?

My redis connection settings

"redis_hosts": "api-redis.trf-public.svc.cluster.local", "redis_password": xxx, "redis_port": "6379", "redis_timeout": "150", "redis_database": "0", "redis_ssl": "false", "redis_ssl_verify": "false", "pool_size": "256", "backlog": "16384", "max_idle_timeout": "5000", "redis_headless": "api-redis-headless.trf-public.svc.cluster.local"

liukaiyuliukaiyu · 2024-09-13T10:04:07Z

Any idea or guidance to try to find this issue?

Kong/kong#13473
With this in mind, our conclusion so far is that the nginx user-mode event loop takes a long time

flowdopip · 2024-09-13T13:41:07Z

@liukaiyuliukaiyu did you already apply any mitigation or workaround for this issue?
Im using Azure Fsv16 machines and using only 8 workers for kong, that way I guarantee a set of resources for the k8s system and not exhausting the machines. (per pod)

liukaiyuliukaiyu · 2024-09-14T02:17:13Z

@liukaiyuliukaiyu did you already apply any mitigation or workaround for this issue? Im using Azure Fsv16 machines and using only 8 workers for kong, that way I guarantee a set of resources for the k8s system and not exhausting the machines. (per pod)

We plan to upgrade the machine specifications or traffic splitting to solve this problem, but it has not been implemented yet. If you have any new progress and solutions, please leave a message and communicate in time, thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redis: timeout #279

Redis: timeout #279

liukaiyuliukaiyu commented Aug 6, 2024

liukaiyuliukaiyu commented Aug 6, 2024

zhuizhuhaomeng commented Aug 7, 2024

liukaiyuliukaiyu commented Aug 7, 2024

liukaiyuliukaiyu commented Aug 8, 2024

zhuizhuhaomeng commented Aug 13, 2024

flowdopip commented Sep 13, 2024 •

edited

Loading

liukaiyuliukaiyu commented Sep 13, 2024

flowdopip commented Sep 13, 2024 •

edited

Loading

liukaiyuliukaiyu commented Sep 14, 2024

Redis: timeout #279

Redis: timeout #279

Comments

liukaiyuliukaiyu commented Aug 6, 2024

liukaiyuliukaiyu commented Aug 6, 2024

zhuizhuhaomeng commented Aug 7, 2024

liukaiyuliukaiyu commented Aug 7, 2024

liukaiyuliukaiyu commented Aug 8, 2024

zhuizhuhaomeng commented Aug 13, 2024

flowdopip commented Sep 13, 2024 • edited Loading

liukaiyuliukaiyu commented Sep 13, 2024

flowdopip commented Sep 13, 2024 • edited Loading

liukaiyuliukaiyu commented Sep 14, 2024

flowdopip commented Sep 13, 2024 •

edited

Loading

flowdopip commented Sep 13, 2024 •

edited

Loading