Query performance drops as one loki backend spins out of control #14312

Starefossen · 2024-09-30T12:30:07Z

Describe the bug

Some times (and I have not found a clear interval and can be in the middle of the night even with low volume of traffic) we get paged that loki_api_v1_query_range is experiencing an abnormal p99 latency of several seconds.

exmple/loki-read loki_api_v1_query_range is experiencing 20.07s 95th percentile latency.

From the reader point of view we can see a lot of errors related to reached tail max duration limit and connection timeouts.

From a backend point of view there are no logs and only one backend that seams to do all the work, not really sure what it is doing.

Restarting the backend (kubectl rollout restart sts/loki-backend) has a clear effect on this issue bringing down memory usage way down and cpu usage down to what is more normal levels compared to when things are working normally:

And the alert clears out.

Zooming out on a seven day period we can see a trend that repeats, namely memory usage for loki backend which leads us to expect that this might be a memory leak of some kind:

To Reproduce
Steps to reproduce the behavior:

Install Loki v3.1.0 with the following helm values:

loki:
  backend:
    resources:
      limits:
        memory: 3Gi
      requests:
        cpu: "2"
        memory: 3Gi
  chunksCache:
    allocatedMemory: 32768
  ingress:
    hosts:
    - loki.example.com
  loki:
    storage:
      bucketNames:
        chunks: loki-storage-example
        ruler: loki-ruler-example
  read:
    resources:
      limits:
        memory: 30Gi
      requests:
        cpu: "6"
        memory: 30Gi
  serviceAccount:
    annotations:
      iam.gke.io/gcp-service-account: [email protected]
  write:
    resources:
      limits:
        memory: 6Gi
      requests:
        cpu: "2"
        memory: 6Gi

Expected behavior

Expecting query performance being tied more to usage and less to how long loki has been running.

Environment:

Infrastructure: gke
Deployment tool: helm

Loki config

apiVersion: v1
data:
  config.yaml: |2

    auth_enabled: false
    bloom_build:
      builder:
        planner_address: loki-backend-headless.example.svc.cluster.local:9095
      enabled: false
    bloom_gateway:
      client:
        addresses: dnssrvnoa+_grpc._tcp.loki-backend-headless.example.svc.cluster.local
      enabled: false
    chunk_store_config:
      chunk_cache_config:
        background:
          writeback_buffer: 500000
          writeback_goroutines: 1
          writeback_size_limit: 500MB
        default_validity: 0s
        memcached:
          batch_size: 4
          parallelism: 5
        memcached_client:
          addresses: dnssrvnoa+_memcached-client._tcp.loki-chunks-cache.example.svc
          consistent_hash: true
          max_idle_conns: 72
          timeout: 2000ms
    common:
      compactor_address: 'http://loki-backend:3100'
      path_prefix: /var/loki
      replication_factor: 3
      storage:
        gcs:
          bucket_name: loki-storage-nav-903dd5c3
          chunk_buffer_size: 0
          enable_http2: true
          request_timeout: 0s
    compactor:
      compaction_interval: 1m
      delete_request_cancel_period: 5m
      delete_request_store: gcs
      retention_delete_delay: 30m
      retention_delete_worker_count: 500
      retention_enabled: true
    frontend:
      scheduler_address: ""
      tail_proxy_url: ""
    frontend_worker:
      scheduler_address: ""
    index_gateway:
      mode: simple
    limits_config:
      allow_structured_metadata: false
      cardinality_limit: 200000
      deletion_mode: filter-and-delete
      ingestion_burst_size_mb: 1000
      ingestion_rate_mb: 10000
      max_cache_freshness_per_query: 10m
      max_concurrent_tail_requests: 100
      max_entries_limit_per_query: 1000000
      max_label_name_length: 10240
      max_label_names_per_series: 300
      max_label_value_length: 20480
      max_query_parallelism: 100
      max_query_series: 5000
      per_stream_rate_limit: 512M
      per_stream_rate_limit_burst: 1024M
      query_timeout: 300s
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      split_queries_by_interval: 15m
      volume_enabled: true
    memberlist:
      join_members:
      - loki-memberlist
    pattern_ingester:
      enabled: true
    querier:
      max_concurrent: 256
    query_range:
      align_queries_with_step: true
      cache_results: true
      results_cache:
        cache:
          background:
            writeback_buffer: 500000
            writeback_goroutines: 1
            writeback_size_limit: 500MB
          default_validity: 12h
          memcached_client:
            addresses: dnssrvnoa+_memcached-client._tcp.loki-results-cache.ny-namespace.svc
            consistent_hash: true
            timeout: 500ms
            update_interval: 1m
    query_scheduler:
      max_outstanding_requests_per_tenant: 200000
    ruler:
      storage:
        gcs:
          bucket_name: loki-ruler-nav-903dd5c3
          chunk_buffer_size: 0
          enable_http2: true
          request_timeout: 0s
        type: gcs
    runtime_config:
      file: /etc/loki/runtime-config/runtime-config.yaml
    schema_config:
      configs:
      - from: "2024-01-01"
        index:
          period: 24h
          prefix: index_
        object_store: gcs
        schema: v12
        store: tsdb
      - from: "2024-09-12"
        index:
          period: 24h
          prefix: index_
        object_store: gcs
        schema: v13
        store: tsdb
    server:
      grpc_listen_port: 9095
      grpc_server_max_recv_msg_size: 8388608
      grpc_server_max_send_msg_size: 8388608
      http_listen_port: 3100
      http_server_read_timeout: 600s
      http_server_write_timeout: 600s
      log_level: warn
    storage_config:
      bloom_shipper:
        working_directory: /var/loki/data/bloomshipper
      boltdb_shipper:
        index_gateway_client:
          server_address: dns+loki-backend-headless.example.svc.cluster.local:9095
      hedging:
        at: 250ms
        max_per_second: 20
        up_to: 3
      tsdb_shipper:
        index_gateway_client:
          server_address: dns+loki-backend-headless.example.svc.cluster.local:9095
    table_manager:
      retention_deletes_enabled: true
      retention_period: 90d
    tracing:
      enabled: false

The text was updated successfully, but these errors were encountered:

Jayclifford345 · 2024-10-02T08:54:11Z

@ashwanthgoli do you have an idea what might be happening here? Just want to see if we can unstick our Grafana Champion. Many thanks in advance!

JStickler · 2024-10-30T17:25:00Z

@Starefossen we fixed an out of memory issue with a fix that was released with Loki 3.2. Have you tried upgrading? Wondering if this is still causing problems for you?

Starefossen · 2024-11-04T08:17:48Z

Thanks for the comment @JStickler, we upgraded to Loki 3.2.0 a week ago and we sill have about the same amount of problems on when the backend needs to be restarted. Here is the memory usage for the last 7 days

Starefossen · 2024-11-04T09:49:17Z

@JStickler is it correct that only one backend is processing at any given time? Same goes for the the CPU graphs, there is only ever one backend that is consuming CPU, the other ones looks more of less idle (distributed mode).

seanocca · 2024-11-04T23:42:09Z

+1 for version 3.0.0 (Helm chart 6.6.4)
I have autoscaling enabled for backend and it is definitely scaling on CPU/MEM but all the load is on 2 pods

Starefossen · 2024-11-12T08:00:15Z

Still have to restart the backend on a regular basis due to this issue.

Starefossen · 2024-11-15T08:29:03Z

Our current solution has to set a lover memory limit and have Kubernetes restarting it automatically when it goes over, this seams to fix the query performance issues but it not a particularly satisfying solution to this issue.

Jayclifford345 · 2024-11-26T12:23:15Z

Hi @Starefossen, I spoke with the Loki team again on this today. @chaudum mentioned it would be helpful if we could pull a Profile to better understand what might be happening during a memory spike.

Starefossen · 2024-11-29T08:29:29Z

I have upgraded most of our Loki clusters to v3.3.0 and will be reporting back if that makes any difference to the problem. If not I will upload a profile to better understand what is happening.

Jayclifford345 added the 3.0 label Sep 30, 2024

JStickler added the type/bug label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query performance drops as one loki backend spins out of control #14312

Query performance drops as one loki backend spins out of control #14312

Starefossen commented Sep 30, 2024 •

edited

Loading

Jayclifford345 commented Oct 2, 2024

JStickler commented Oct 30, 2024

Starefossen commented Nov 4, 2024 •

edited

Loading

Starefossen commented Nov 4, 2024

seanocca commented Nov 4, 2024

Starefossen commented Nov 12, 2024

Starefossen commented Nov 15, 2024

Jayclifford345 commented Nov 26, 2024

Starefossen commented Nov 29, 2024

Query performance drops as one loki backend spins out of control #14312

Query performance drops as one loki backend spins out of control #14312

Comments

Starefossen commented Sep 30, 2024 • edited Loading

Jayclifford345 commented Oct 2, 2024

JStickler commented Oct 30, 2024

Starefossen commented Nov 4, 2024 • edited Loading

Starefossen commented Nov 4, 2024

seanocca commented Nov 4, 2024

Starefossen commented Nov 12, 2024

Starefossen commented Nov 15, 2024

Jayclifford345 commented Nov 26, 2024

Starefossen commented Nov 29, 2024

Starefossen commented Sep 30, 2024 •

edited

Loading

Starefossen commented Nov 4, 2024 •

edited

Loading