Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate context deadline exceeded with DragonflyDB #389

Closed
FZambia opened this issue Jun 11, 2024 · 8 comments
Closed

Investigate context deadline exceeded with DragonflyDB #389

FZambia opened this issue Jun 11, 2024 · 8 comments

Comments

@FZambia
Copy link
Member

FZambia commented Jun 11, 2024

There was a report in community TG:

Hello. We have a problem with CF. We had CF v5.1.1, Redis cluster, k8s and it conf works fine. We change redis cluster to dragonflydb on our staging and it works fine too. After that we change redis cluster to dragonflydb on our prod and we have a lot of errors error updating presence for channel and error adding presence. We update CF to v5.4.0 but errors is still going. How can we fix it ?

image

docker.dragonflydb.io/dragonflydb/dragonfly:v1.19.0
Centrifugo v5.4.0

Current assumption is that benchmarks should help to reproduce.

@romange
Copy link

romange commented Jun 14, 2024

Please let us know how it goes

@FZambia
Copy link
Member Author

FZambia commented Jun 22, 2024

I was unable to reproduce this error with benchmarks, tried on MacOS (Docker) and Ubuntu 24.04 (with Docker and without Docker).

Some other findings:

Redis overperforms Dragonflydb up to order of magnitude in Centrifuge benchmarks. I am explaining this to myself that we use pipelining over a single connection in Centrifuge, and I guess DF batches calls over uring collecting them from different connections. And this means requests coming through a single conn are just waiting more time to be executed. This is actually not bad since in practice we have many Centrifuge nodes and eventually a higher throughput could be achieved potentially.

I quickly tried running 10 instances of benchmarks in parallel, I see that a higher throughput may be achieved with DF in this case, so for me it proves the theory above. Still was far away from Redis throughput though. And CPU was like 450% compared to 100% of Redis. This is the limit for Redis, but it's clear how Redis provides the best throughput it can on a single core.

Also, latencies are very unstable with DF when using several pipelining connections, running the same bench may result into 20k rps, then into 100k rps, then again 20k. While with Redis latencies are stable and benchmark rps is always consistent.

For now I've run presence benchmarks, Centrifuge uses Lua in such requests. It's not very handy to experiment at this point, had to do many manual tweaks, so would be nice to automate various bench conditions.

@romange
Copy link

romange commented Jun 22, 2024

@FZambia , thank you for performing these tests. Can you instruct me on how to run a centrifuge benchmark?
Pipelining indeed has an inherent delay in Dragonfly, because each request is being dispatched to possibly another thread and then the connection waits for it to finish before dispatching the next one. Having said, that I would like to see if we have some unexpected bottleneck with this usecase.

@FZambia
Copy link
Member Author

FZambia commented Jun 23, 2024

Yep - let me prepare sth suitable for reproducing various scenarios in convenient way, now it's not trivial.

But a benchmark which just uses a single connection with pipelining may be run like this after cloning:

docker compose up redis dragonflydb

Redis bench (uses 6379 port):

go test -run xxx -bench BenchmarkRedisAddPresence_ManyCh/rd_single -benchmem -tags integration

Same but with Dragonfly (uses different port - 7379):

go test -run xxx -bench BenchmarkRedisAddPresence_ManyCh/df_single -benchmem -tags integration

Go 1.21 or higher should be installed. These benches run many ops in parallel, all operations are then collected to a single pipeline.

I'll try to find and implement a simple way to run benches which utilize several pipelines instead of one - to quickly experiment how it scales adding more connections.

@romange
Copy link

romange commented Jun 23, 2024

I will check it out, thanks!

Does centrifuge usually open a single upstream connection from a single centrifuge process?
DF becomes more efficient when multiple connections "talk" to it.

@FZambia
Copy link
Member Author

FZambia commented Jun 23, 2024

Usually yes, it uses single pipeline. But that's what I've been talking in comments above - i tried to run multiple conns with separate pipelines and for now could not achieve good results, but I had super hacky bash scripts to experiment, will try to write a cleaner Go bench with more conns

@FZambia
Copy link
Member Author

FZambia commented Dec 21, 2024

This seems related to centrifugal/centrifugo#925 – nothing about DragonflyDB, caused by deadlock in Centrifugo code which blocks Redis client read loop.

@FZambia
Copy link
Member Author

FZambia commented Dec 25, 2024

The fix was shipped in Centrifuge v0.33.5, I am 99% sure it's the same reason here, closing. There is a chance that this may be re-opened if we have more cases in new versions.

@FZambia FZambia closed this as completed Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants