Loki on `leopard` is a mess #3813

QuantumEnigmaa · 2025-01-06T11:09:12Z

Loki on leopard is having big issues on several topics. There seem to be a storage issue related to the NFS storage system but more importantly, the loki-write pods are unable to enter the Ready state and are logging the following messages :

level=warn ts=2024-12-21T17:36:42.356823343Z caller=lifecycler.go:295 component=pattern-ingester msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=pattern-ingester err="instance 100.96.1.164:9095 past heartbeat timeout"

In an attempt to solve this, I tried increasing the heartbeat timeout as well as enabling the autoforget feature for unhealthy ring instances in the config :

    loki:
      loki:
        ingester:
          autoforget_unhealthy: true
          lifecycler:
            ring:
              heartbeat_timeout: 10m
            heartbeat_timeout: 10m

Unfortunately, this didn't change anything. I also made sure this wasn't a network issue.

For more details concerning this, head over to the related incident: #inc-2024-12-18-leopard-promtailrequestserrors.

The text was updated successfully, but these errors were encountered:

QuantumEnigmaa · 2025-01-07T13:45:52Z

It appears that the issue described above and that has been lasting for more than 2 weeks is now irrelevant and the loki-write pods are entering the Ready state. However it has been brought to our attention, that the pods are still failing sometimes with such logs :

level=warn ts=2025-01-06T20:43:41.662248739Z caller=logging.go:128 orgID=production msg="POST /loki/api/v1/push (500) 5.00113882s Response: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Accept-Encoding: gzip; Connection: close; Content-Length: 392; Content-Type: application/x-protobuf; User-Agent: promtail/2.9.3; X-Forwarded-Host: loki-gateway.loki.svc.cluster.local; X-Forwarded-Port: 443; X-Forwarded-Scheme: https; X-Real-Ip: 100.96.1.76; X-Request-Id: 315cf677a53b09875002d98dd70c8479; X-Scheme: https; X-Scope-Orgid: production; "

In any case, there seem to be some kind of network issue in this installation that is not kubernetes dependent.

hervenicol · 2025-01-10T15:50:23Z

Here is what we discussed during alerts review on Tuesday:

it looks like infrastructure issues (S3 timeouts, NFS timeouts, network timeouts…)
we should check if mimir has similar issues
then set a meeting with rocket to discuss it

I did investigate the 3 following error logs:

level=error ts=2024-12-18T09:41:29.304362565Z caller=retention.go:311 msg="error deleting chunk" chunkID=production/9f1da3a40ad5fbb1/193774f9de5:193774fd052:fbe57f0a err="RequestCanceled: request context canceled\ncaused by: context canceled"

=> 2 similar logs in the past 24h, all from loki.

level=error ts=2024-12-18T05:36:23.606728197Z caller=table.go:140 table-name=loki_index_20074 msg="failed to remove working directory /var/loki/compactor/loki_index_20074" err="unlinkat /var/loki/compactor/loki_index_20074/testing/.nfs0000000037f1015700000d41: device or resource busy"

=> 134 similar logs in the past 24h. All of them from loki-backend, mostly compactor (1 for tsdb-shipper-cache).

Looks related to the .nfs files, and not the overall block storage availability. So probably irrelevant (apart that it may prevent the compactor from deleting some directories).

level=error ts=2024-12-18T01:44:22.107179082Z caller=retention.go:311 msg="error deleting chunk" chunkID=production/965798fe412674d8/19389b5bfb2:19389b60041:43653dc8 err="RequestError: send request failed\ncaused by: Delete \"https://satwo-prod-m2m-objectstore.telekom.de/panamax-satwo-leopard-loki/production/965798fe412674d8/19389b5bfb2%3A19389b60041%3A43653dc8\": EOF"

=> 3 similar logs overt the past 24h, all from loki.

I didn't find any mimir-related logs showing similar errors.

hervenicol · 2025-01-10T17:01:50Z

However this query shows lots of mimir errors:

{cluster_id="leopard", scrape_job="kubernetes-pods", instance=~"mimir.*"} | detected_level!=`info` != "failed to fetch items from memcached" !~ "200.*POST .*api/v1/" != `200 "GET`

Here are some example errors:

context deadline exceeded (distributors)
Connection reset by peer (memcached)
HTTP 500 (gateway)
dial tcp 100.96.2.31:7946: i/o timeout
closing ingester client stream failed

hervenicol · 2025-01-17T17:51:02Z

Well, one really good reason why promtail was not working properly is it's been replaced by alloy, but not properly deleted (as per https://intranet.giantswarm.io/docs/support-and-ops/ops-recipes/promtail/#check-that-it-does-not-conflict-with-alloy-logs).

I deleted the promtail chart for the MC (leopard) and the production WC. It was already deleted for the other WCs.

hervenicol · 2025-01-17T17:52:37Z

I also did some cleanup in loki's user-values:

deleted some config that was already set in shared-configs and panamax-configs (auto-forget indexers and data retention).
limited write HPA to 10 replicas max (because there's not enough resources for more, and nominal situation works with 8 replicas).

Let's see how it goes from here.

hervenicol · 2025-01-20T15:17:05Z

We've had some alerts during the week-end (#inc-2025-01-18-expired-heartbeat-for-installation-leopard-affecting-observa), so let's tune it a bit tighter.

loki-write scales up and down quite a lot:

So I did 2 changes to it (via loki-user-values for now):

set targetCPUUtilizationPercentage to 90% instead of 60%, so it scales up less easily
set maxReplicas down to 7, as when it reaches high values it prevents other pods (like mimir) from starting, causing alerts.

Let's see if it improves the situation.

hervenicol · 2025-01-22T08:17:31Z

Alerts firing last night

LogForwardingErrors

The alert query shows a clear burst of errors for around 45min

There actually is an increase in error rate, from all services / clusters (sum by (service,pod,cluster_id) (irate(loki_write_request_duration_seconds_count{status_code!~"2.."}[5m:])))

...but these is also a decrease of logs at the same time. The more stable lines (green and yellow) are grafana-agent logs (sum by (service,pod,cluster_id) (irate(loki_write_request_duration_seconds_count{}[5m:])))

Heartbeat

Heartbeat [leopard] is expired.

Did not happen at the same time as the LogForwardingErrors, so probably not directly related.
Some Mimir components restarted and did not manage to re-schedule, but I don't know which ones or why they rescheduled.

I don't know if it's HPA or VPA, so I looked at cpu and memory requests changes summed by container:

memory (sum(cluster:namespace:pod_memory:active:kube_pod_container_resource_requests{cluster_id="leopard", namespace="mimir"}) by (container))

=> we can see the incident as there's a gap
=> the ingesters got scaled down before the incident, and scaled up after, but seem to remain at the same size around the incident.
=> the distributor increased at the moment of the incident. This one scales up horizontally via HPA. It increased from 21.5 to 26.8GB (around +25%)
cpu (sum(cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests{cluster_id="leopard", namespace="mimir", pod=~"mimir.*"}) by (container))

=> distributor (HPA) increased from 1.6 to 2 (+25%)
=> querier decreased briefly

Action

Distributor was increasing from 8 to 10 replicas.
I'll limit it to 8 replicas, which should hopefully be enough. Then let's see.

QuantumEnigmaa added component/loki postmortem team/atlas Team Atlas labels Jan 6, 2025

github-project-automation bot added this to Roadmap Jan 6, 2025

github-project-automation bot moved this to Inbox 📥 in Roadmap Jan 6, 2025

Rotfuks assigned QuantumEnigmaa Jan 7, 2025

hervenicol self-assigned this Jan 20, 2025

QuantumEnigmaa removed their assignment Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loki on `leopard` is a mess #3813

Loki on `leopard` is a mess #3813

QuantumEnigmaa commented Jan 6, 2025 •

edited by hervenicol

Loading

QuantumEnigmaa commented Jan 7, 2025

hervenicol commented Jan 10, 2025

hervenicol commented Jan 10, 2025

hervenicol commented Jan 17, 2025

hervenicol commented Jan 17, 2025

hervenicol commented Jan 20, 2025

hervenicol commented Jan 22, 2025 •

edited

Loading

Loki on leopard is a mess #3813

Loki on leopard is a mess #3813

Comments

QuantumEnigmaa commented Jan 6, 2025 • edited by hervenicol Loading

QuantumEnigmaa commented Jan 7, 2025

hervenicol commented Jan 10, 2025

hervenicol commented Jan 10, 2025

hervenicol commented Jan 17, 2025

hervenicol commented Jan 17, 2025

hervenicol commented Jan 20, 2025

hervenicol commented Jan 22, 2025 • edited Loading

Alerts firing last night

LogForwardingErrors

Heartbeat

Action

Loki on `leopard` is a mess #3813

Loki on `leopard` is a mess #3813

QuantumEnigmaa commented Jan 6, 2025 •

edited by hervenicol

Loading

hervenicol commented Jan 22, 2025 •

edited

Loading