-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query performance drops as one loki backend spins out of control #14312
Comments
@ashwanthgoli do you have an idea what might be happening here? Just want to see if we can unstick our Grafana Champion. Many thanks in advance! |
@Starefossen we fixed an out of memory issue with a fix that was released with Loki 3.2. Have you tried upgrading? Wondering if this is still causing problems for you? |
Thanks for the comment @JStickler, we upgraded to Loki 3.2.0 a week ago and we sill have about the same amount of problems on when the backend needs to be restarted. Here is the memory usage for the last 7 days |
@JStickler is it correct that only one backend is processing at any given time? Same goes for the the CPU graphs, there is only ever one backend that is consuming CPU, the other ones looks more of less idle (distributed mode). |
+1 for version 3.0.0 (Helm chart 6.6.4) |
Our current solution has to set a lover memory limit and have Kubernetes restarting it automatically when it goes over, this seams to fix the query performance issues but it not a particularly satisfying solution to this issue. |
Hi @Starefossen, I spoke with the Loki team again on this today. @chaudum mentioned it would be helpful if we could pull a Profile to better understand what might be happening during a memory spike. |
I have upgraded most of our Loki clusters to v3.3.0 and will be reporting back if that makes any difference to the problem. If not I will upload a profile to better understand what is happening. |
Describe the bug
Some times (and I have not found a clear interval and can be in the middle of the night even with low volume of traffic) we get paged that
loki_api_v1_query_range
is experiencing an abnormal p99 latency of several seconds.From the reader point of view we can see a lot of errors related to
reached tail max duration limit
and connection timeouts.From a backend point of view there are no logs and only one backend that seams to do all the work, not really sure what it is doing.
Restarting the backend (
kubectl rollout restart sts/loki-backend
) has a clear effect on this issue bringing down memory usage way down and cpu usage down to what is more normal levels compared to when things are working normally:And the alert clears out.
Zooming out on a seven day period we can see a trend that repeats, namely memory usage for loki backend which leads us to expect that this might be a memory leak of some kind:
To Reproduce
Steps to reproduce the behavior:
Install Loki v3.1.0 with the following helm values:
Expected behavior
Expecting query performance being tied more to usage and less to how long loki has been running.
Environment:
Loki config
The text was updated successfully, but these errors were encountered: