Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki on leopard is a mess #3813

Open
QuantumEnigmaa opened this issue Jan 6, 2025 · 7 comments
Open

Loki on leopard is a mess #3813

QuantumEnigmaa opened this issue Jan 6, 2025 · 7 comments

Comments

@QuantumEnigmaa
Copy link

QuantumEnigmaa commented Jan 6, 2025

Loki on leopard is having big issues on several topics. There seem to be a storage issue related to the NFS storage system but more importantly, the loki-write pods are unable to enter the Ready state and are logging the following messages :

level=warn ts=2024-12-21T17:36:42.356823343Z caller=lifecycler.go:295 component=pattern-ingester msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=pattern-ingester err="instance 100.96.1.164:9095 past heartbeat timeout"

In an attempt to solve this, I tried increasing the heartbeat timeout as well as enabling the autoforget feature for unhealthy ring instances in the config :

    loki:
      loki:
        ingester:
          autoforget_unhealthy: true
          lifecycler:
            ring:
              heartbeat_timeout: 10m
            heartbeat_timeout: 10m

Unfortunately, this didn't change anything. I also made sure this wasn't a network issue.

For more details concerning this, head over to the related incident: #inc-2024-12-18-leopard-promtailrequestserrors.

@QuantumEnigmaa
Copy link
Author

It appears that the issue described above and that has been lasting for more than 2 weeks is now irrelevant and the loki-write pods are entering the Ready state. However it has been brought to our attention, that the pods are still failing sometimes with such logs :

level=warn ts=2025-01-06T20:43:41.662248739Z caller=logging.go:128 orgID=production msg="POST /loki/api/v1/push (500) 5.00113882s Response: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Accept-Encoding: gzip; Connection: close; Content-Length: 392; Content-Type: application/x-protobuf; User-Agent: promtail/2.9.3; X-Forwarded-Host: loki-gateway.loki.svc.cluster.local; X-Forwarded-Port: 443; X-Forwarded-Scheme: https; X-Real-Ip: 100.96.1.76; X-Request-Id: 315cf677a53b09875002d98dd70c8479; X-Scheme: https; X-Scope-Orgid: production; "

In any case, there seem to be some kind of network issue in this installation that is not kubernetes dependent.

@hervenicol
Copy link

Here is what we discussed during alerts review on Tuesday:

  • it looks like infrastructure issues (S3 timeouts, NFS timeouts, network timeouts…)
  • we should check if mimir has similar issues
  • then set a meeting with rocket to discuss it

I did investigate the 3 following error logs:

level=error ts=2024-12-18T09:41:29.304362565Z caller=retention.go:311 msg="error deleting chunk" chunkID=production/9f1da3a40ad5fbb1/193774f9de5:193774fd052:fbe57f0a err="RequestCanceled: request context canceled\ncaused by: context canceled"

=> 2 similar logs in the past 24h, all from loki.

level=error ts=2024-12-18T05:36:23.606728197Z caller=table.go:140 table-name=loki_index_20074 msg="failed to remove working directory /var/loki/compactor/loki_index_20074" err="unlinkat /var/loki/compactor/loki_index_20074/testing/.nfs0000000037f1015700000d41: device or resource busy"

=> 134 similar logs in the past 24h. All of them from loki-backend, mostly compactor (1 for tsdb-shipper-cache).

Looks related to the .nfs files, and not the overall block storage availability. So probably irrelevant (apart that it may prevent the compactor from deleting some directories).

level=error ts=2024-12-18T01:44:22.107179082Z caller=retention.go:311 msg="error deleting chunk" chunkID=production/965798fe412674d8/19389b5bfb2:19389b60041:43653dc8 err="RequestError: send request failed\ncaused by: Delete \"https://satwo-prod-m2m-objectstore.telekom.de/panamax-satwo-leopard-loki/production/965798fe412674d8/19389b5bfb2%3A19389b60041%3A43653dc8\": EOF"

=> 3 similar logs overt the past 24h, all from loki.

I didn't find any mimir-related logs showing similar errors.

@hervenicol
Copy link

However this query shows lots of mimir errors:

{cluster_id="leopard", scrape_job="kubernetes-pods", instance=~"mimir.*"} | detected_level!=`info` != "failed to fetch items from memcached" !~ "200.*POST .*api/v1/" != `200 "GET`

Here are some example errors:

  • context deadline exceeded (distributors)
  • Connection reset by peer (memcached)
  • HTTP 500 (gateway)
  • dial tcp 100.96.2.31:7946: i/o timeout
  • closing ingester client stream failed

@hervenicol
Copy link

Well, one really good reason why promtail was not working properly is it's been replaced by alloy, but not properly deleted (as per https://intranet.giantswarm.io/docs/support-and-ops/ops-recipes/promtail/#check-that-it-does-not-conflict-with-alloy-logs).

I deleted the promtail chart for the MC (leopard) and the production WC. It was already deleted for the other WCs.

@hervenicol
Copy link

I also did some cleanup in loki's user-values:

  • deleted some config that was already set in shared-configs and panamax-configs (auto-forget indexers and data retention).
  • limited write HPA to 10 replicas max (because there's not enough resources for more, and nominal situation works with 8 replicas).

Let's see how it goes from here.

@hervenicol
Copy link

We've had some alerts during the week-end (#inc-2025-01-18-expired-heartbeat-for-installation-leopard-affecting-observa), so let's tune it a bit tighter.

loki-write scales up and down quite a lot:
Image

So I did 2 changes to it (via loki-user-values for now):

  • set targetCPUUtilizationPercentage to 90% instead of 60%, so it scales up less easily
  • set maxReplicas down to 7, as when it reaches high values it prevents other pods (like mimir) from starting, causing alerts.

Let's see if it improves the situation.

@hervenicol hervenicol self-assigned this Jan 20, 2025
@QuantumEnigmaa QuantumEnigmaa removed their assignment Jan 21, 2025
@hervenicol
Copy link

hervenicol commented Jan 22, 2025

Alerts firing last night

LogForwardingErrors

The alert query shows a clear burst of errors for around 45min
Image

There actually is an increase in error rate, from all services / clusters (sum by (service,pod,cluster_id) (irate(loki_write_request_duration_seconds_count{status_code!~"2.."}[5m:])))
Image

...but these is also a decrease of logs at the same time. The more stable lines (green and yellow) are grafana-agent logs (sum by (service,pod,cluster_id) (irate(loki_write_request_duration_seconds_count{}[5m:])))
Image

Heartbeat

Did not happen at the same time as the LogForwardingErrors, so probably not directly related.
Some Mimir components restarted and did not manage to re-schedule, but I don't know which ones or why they rescheduled.

I don't know if it's HPA or VPA, so I looked at cpu and memory requests changes summed by container:

  • memory (sum(cluster:namespace:pod_memory:active:kube_pod_container_resource_requests{cluster_id="leopard", namespace="mimir"}) by (container))
    Image
    => we can see the incident as there's a gap
    => the ingesters got scaled down before the incident, and scaled up after, but seem to remain at the same size around the incident.
    => the distributor increased at the moment of the incident. This one scales up horizontally via HPA. It increased from 21.5 to 26.8GB (around +25%)

  • cpu (sum(cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests{cluster_id="leopard", namespace="mimir", pod=~"mimir.*"}) by (container))
    Image
    => distributor (HPA) increased from 1.6 to 2 (+25%)
    => querier decreased briefly

Action

Distributor was increasing from 8 to 10 replicas.
I'll limit it to 8 replicas, which should hopefully be enough. Then let's see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Inbox 📥
Development

No branches or pull requests

2 participants