Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis: timeout #279

Open
liukaiyuliukaiyu opened this issue Aug 6, 2024 · 9 comments
Open

Redis: timeout #279

liukaiyuliukaiyu opened this issue Aug 6, 2024 · 9 comments

Comments

@liukaiyuliukaiyu
Copy link

Whether the redis client will block the nginx event loop in implementation, or the read timeout is caused by lua's own coroutine scheduling, because I found that the redis server responds quickly, and the redis client will appear in the timeout caused by not being able to read the response within 50ms

@liukaiyuliukaiyu
Copy link
Author

the error log:

2024/08/06 15:49:01 [debug] 384947#384947: *171134 [lua] base_plugin.lua:24: access(): executing plugin "rate-limiting": access
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:60: new(): ==========new==========, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket pool get keepalive peer
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket get keepalive peer: using connection 00007FBE83FEB1F8, fd:79
2024/08/06 15:49:01 [debug] 384947#384947: *171134 event timer del: 79: 3647014359
2024/08/06 15:49:01 [debug] 384947#384947: *171134 http cleanup add: 0000000001424E68
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:220: _gen_req(): ======nargs======5, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:231: _gen_req(): ======_gen_req=====eval=====i=====1, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:231: _gen_req(): ======_gen_req===== local cache_key, expiration = KEYS[1], ARGV[1]
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:231: _gen_req(): ======_gen_req=====1=====i=====3, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:231: _gen_req(): ======_gen_req=====ratelimit:d7af6063-037c-47d0-934a-1f19c412ea94:d7af6063-037c-47d0-934a-1f19c412ea94:1722930541000:second:0aa0512166b1d56d3617dfb30304e602=====i=====4, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:231: _gen_req(): ======_gen_req=====1=====i=====5, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket send timeout: 50
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua allocate new chainlink and new buf of size 431, cl:0000000001424EB0
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket send data
2024/08/06 15:49:01 [debug] 384947#384947: *171134 send: fd:79 431 of 431
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket sent all the data
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:268: eval(): =====req======26, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:279: eval(): ===req===1:*5
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:129: eval(): before_receive_time timestamp: 1722930541450, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket calling receive() method
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket read timeout: 50
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua allocate new chainlink and new buf of size 4096, cl:00000000014250C0
2024/08/06 15:49:01 [debug] 384947#384947: *171134 malloc: 0000000001425200:4096
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket read timeout: 50
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket read data: wait:0
2024/08/06 15:49:01 [debug] 384947#384947: *171134 event timer add: 79: 50:3647004499
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua resume returned 1
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua thread yielded
2024/08/06 15:49:01 [debug] 384947#384947: *171134 http run request: "/test_kylin_upstream_timeout?"
2024/08/06 15:49:01 [debug] 384947#384947: *171134 access phase: 14
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua access handler, uri:"/test_kylin_upstream_timeout" c:1
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua run write event handler: timedout:0, ready:1, writing_raw_req_socket:0
2024/08/06 15:49:01 [debug] 384947#384947: *171134 useless lua write event handler
2024/08/06 15:49:01 [debug] 384947#384947: *171134 event timer del: 79: 3647004499
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket handler for "/test_kylin_upstream_timeout?", wev 0
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket read handler
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket handle read error
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket waking up the current request
2024/08/06 15:49:01 [debug] 384947#384947: *171134 access phase: 14
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua access handler, uri:"/test_kylin_upstream_timeout" c:1
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp operation done, resuming lua thread
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket calling prepare retvals handler 0000000000663B16, u:00007FBE8318A408
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua tcp socket receive return value handler
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua run thread, top:0 c:1
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:133: eval(): =========receive time===== 53ms, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [error] 384947#384947: *171134 [lua] redis.lua:134: eval(): =======line=======nil=====err=====timeout=====partial=====, client: 127.0.0.1, server: kylin, request: "GET /test_kylin_upstream_timeout HTTP/1.1", host: "127.0.0.1:8000"
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua finalize socket
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua http cleanup free: 0000000001424E68
2024/08/06 15:49:01 [debug] 384947#384947: *171134 lua close socket connection

@zhuizhuhaomeng
Copy link
Contributor

Just give us a minimal example that could reproduce this problem.
That would help us understand your problem.

You can also use OpenResty XRay to see if there are any off-CPU events.

@liukaiyuliukaiyu
Copy link
Author

The gateway kong uses Redis to limit the service flow, and the gateway and redis services are closed in the same computer room. The timeout time set by the gateway as a client is 50ms
The problem now is that the qps of the gateway request to redis is not high, but the redis client on the gateway side will report many timeouts. tcpdump captures the packet and analyzes it, and it is found that the redis server responds to the data within 5ms. timeout was returned after receiving more than 50ms
Machine specification 4C8G, The container environment only has test traffic, without any other requests, and uses the wrk tool to send requests in parallel. Because with service throttling enabled, each request will trigger nginx to interact with the redis server, look up the value of a key in redis, and incrby will increase the value of the key, as shown in the source code: https://github.com/Kong/kong/blob/master/kong/plugins/rate-limiting/policies/init.lua
image
The cpu idle rate is more than 85%
The captured data is as follows. The redis server responded within 1.5ms
image

@liukaiyuliukaiyu
Copy link
Author

@zhuizhuhaomeng Hello, do you have any ideas on this problem? Please give me some guidance. Thank you

@zhuizhuhaomeng
Copy link
Contributor

@liukaiyuliukaiyu
Did you run the tcpdump on the gateway or in the redis services server?
You may also use perf sched record to check the max latency of the process.

@flowdopip
Copy link

flowdopip commented Sep 13, 2024

I'm having the same problem that @liukaiyuliukaiyu mentioned. I created a custom plugin in Kong with this library and have several timeouts connecting and doing authentication through Redis. My setup is with Kong Db-less in and Redis-Cluster, both in Kubernetes on the same nodes (sometimes).

A few errors /debug messages from my setup:
[proxy-cache-redis] REDIS: redis-check failed to authenticate: timeout,
fetch_slots(): failed to fetch slots: failed to fetch cluster info: timeout,
redis.lua:200 failed to commit the cache value to Redis: timeout

Already tried to increase the connection timeout to bigger values, and the problem remains.
Already did the TCPDump and I see exactly the same of @liukaiyuliukaiyu .
Already check file descriptores and they are limited to the max defined by nginx and open resty lua 16384

Any idea or guidance to try to find this issue?

My redis connection settings

"redis_hosts": "api-redis.trf-public.svc.cluster.local", "redis_password": xxx, "redis_port": "6379", "redis_timeout": "150", "redis_database": "0", "redis_ssl": "false", "redis_ssl_verify": "false", "pool_size": "256", "backlog": "16384", "max_idle_timeout": "5000", "redis_headless": "api-redis-headless.trf-public.svc.cluster.local"

@liukaiyuliukaiyu
Copy link
Author

Any idea or guidance to try to find this issue?

Kong/kong#13473
With this in mind, our conclusion so far is that the nginx user-mode event loop takes a long time

@flowdopip
Copy link

flowdopip commented Sep 13, 2024

@liukaiyuliukaiyu did you already apply any mitigation or workaround for this issue?
Im using Azure Fsv16 machines and using only 8 workers for kong, that way I guarantee a set of resources for the k8s system and not exhausting the machines. (per pod)

@liukaiyuliukaiyu
Copy link
Author

@liukaiyuliukaiyu did you already apply any mitigation or workaround for this issue? Im using Azure Fsv16 machines and using only 8 workers for kong, that way I guarantee a set of resources for the k8s system and not exhausting the machines. (per pod)

We plan to upgrade the machine specifications or traffic splitting to solve this problem, but it has not been implemented yet. If you have any new progress and solutions, please leave a message and communicate in time, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants