-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loki on leopard
is a mess
#3813
Comments
It appears that the issue described above and that has been lasting for more than 2 weeks is now irrelevant and the
In any case, there seem to be some kind of network issue in this installation that is not kubernetes dependent. |
Here is what we discussed during alerts review on Tuesday:
I did investigate the 3 following error logs:
=> 2 similar logs in the past 24h, all from loki.
=> 134 similar logs in the past 24h. All of them from loki-backend, mostly compactor (1 for tsdb-shipper-cache). Looks related to the
=> 3 similar logs overt the past 24h, all from loki. I didn't find any mimir-related logs showing similar errors. |
However this query shows lots of mimir errors:
Here are some example errors:
|
Well, one really good reason why promtail was not working properly is it's been replaced by alloy, but not properly deleted (as per https://intranet.giantswarm.io/docs/support-and-ops/ops-recipes/promtail/#check-that-it-does-not-conflict-with-alloy-logs). I deleted the promtail chart for the MC ( |
I also did some cleanup in loki's user-values:
Let's see how it goes from here. |
We've had some alerts during the week-end (#inc-2025-01-18-expired-heartbeat-for-installation-leopard-affecting-observa), so let's tune it a bit tighter. loki-write scales up and down quite a lot: So I did 2 changes to it (via
Let's see if it improves the situation. |
Loki on
leopard
is having big issues on several topics. There seem to be a storage issue related to the NFS storage system but more importantly, theloki-write
pods are unable to enter theReady
state and are logging the following messages :In an attempt to solve this, I tried increasing the heartbeat timeout as well as enabling the autoforget feature for unhealthy ring instances in the config :
Unfortunately, this didn't change anything. I also made sure this wasn't a network issue.
For more details concerning this, head over to the related incident: #inc-2024-12-18-leopard-promtailrequestserrors.
The text was updated successfully, but these errors were encountered: