Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query performance drops as one loki backend spins out of control #14312

Open
Starefossen opened this issue Sep 30, 2024 · 9 comments
Open

Query performance drops as one loki backend spins out of control #14312

Starefossen opened this issue Sep 30, 2024 · 9 comments
Labels
3.0 type/bug Somehing is not working as expected

Comments

@Starefossen
Copy link
Contributor

Starefossen commented Sep 30, 2024

Describe the bug

Some times (and I have not found a clear interval and can be in the middle of the night even with low volume of traffic) we get paged that loki_api_v1_query_range is experiencing an abnormal p99 latency of several seconds.

exmple/loki-read loki_api_v1_query_range is experiencing 20.07s 95th percentile latency.

From the reader point of view we can see a lot of errors related to reached tail max duration limit and connection timeouts.

Image

From a backend point of view there are no logs and only one backend that seams to do all the work, not really sure what it is doing.

Image

Restarting the backend (kubectl rollout restart sts/loki-backend) has a clear effect on this issue bringing down memory usage way down and cpu usage down to what is more normal levels compared to when things are working normally:

Image

And the alert clears out.

Zooming out on a seven day period we can see a trend that repeats, namely memory usage for loki backend which leads us to expect that this might be a memory leak of some kind:

Image

To Reproduce
Steps to reproduce the behavior:

Install Loki v3.1.0 with the following helm values:

loki:
  backend:
    resources:
      limits:
        memory: 3Gi
      requests:
        cpu: "2"
        memory: 3Gi
  chunksCache:
    allocatedMemory: 32768
  ingress:
    hosts:
    - loki.example.com
  loki:
    storage:
      bucketNames:
        chunks: loki-storage-example
        ruler: loki-ruler-example
  read:
    resources:
      limits:
        memory: 30Gi
      requests:
        cpu: "6"
        memory: 30Gi
  serviceAccount:
    annotations:
      iam.gke.io/gcp-service-account: [email protected]
  write:
    resources:
      limits:
        memory: 6Gi
      requests:
        cpu: "2"
        memory: 6Gi

Expected behavior

Expecting query performance being tied more to usage and less to how long loki has been running.

Environment:

  • Infrastructure: gke
  • Deployment tool: helm

Loki config

apiVersion: v1
data:
  config.yaml: |2

    auth_enabled: false
    bloom_build:
      builder:
        planner_address: loki-backend-headless.example.svc.cluster.local:9095
      enabled: false
    bloom_gateway:
      client:
        addresses: dnssrvnoa+_grpc._tcp.loki-backend-headless.example.svc.cluster.local
      enabled: false
    chunk_store_config:
      chunk_cache_config:
        background:
          writeback_buffer: 500000
          writeback_goroutines: 1
          writeback_size_limit: 500MB
        default_validity: 0s
        memcached:
          batch_size: 4
          parallelism: 5
        memcached_client:
          addresses: dnssrvnoa+_memcached-client._tcp.loki-chunks-cache.example.svc
          consistent_hash: true
          max_idle_conns: 72
          timeout: 2000ms
    common:
      compactor_address: 'http://loki-backend:3100'
      path_prefix: /var/loki
      replication_factor: 3
      storage:
        gcs:
          bucket_name: loki-storage-nav-903dd5c3
          chunk_buffer_size: 0
          enable_http2: true
          request_timeout: 0s
    compactor:
      compaction_interval: 1m
      delete_request_cancel_period: 5m
      delete_request_store: gcs
      retention_delete_delay: 30m
      retention_delete_worker_count: 500
      retention_enabled: true
    frontend:
      scheduler_address: ""
      tail_proxy_url: ""
    frontend_worker:
      scheduler_address: ""
    index_gateway:
      mode: simple
    limits_config:
      allow_structured_metadata: false
      cardinality_limit: 200000
      deletion_mode: filter-and-delete
      ingestion_burst_size_mb: 1000
      ingestion_rate_mb: 10000
      max_cache_freshness_per_query: 10m
      max_concurrent_tail_requests: 100
      max_entries_limit_per_query: 1000000
      max_label_name_length: 10240
      max_label_names_per_series: 300
      max_label_value_length: 20480
      max_query_parallelism: 100
      max_query_series: 5000
      per_stream_rate_limit: 512M
      per_stream_rate_limit_burst: 1024M
      query_timeout: 300s
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      split_queries_by_interval: 15m
      volume_enabled: true
    memberlist:
      join_members:
      - loki-memberlist
    pattern_ingester:
      enabled: true
    querier:
      max_concurrent: 256
    query_range:
      align_queries_with_step: true
      cache_results: true
      results_cache:
        cache:
          background:
            writeback_buffer: 500000
            writeback_goroutines: 1
            writeback_size_limit: 500MB
          default_validity: 12h
          memcached_client:
            addresses: dnssrvnoa+_memcached-client._tcp.loki-results-cache.ny-namespace.svc
            consistent_hash: true
            timeout: 500ms
            update_interval: 1m
    query_scheduler:
      max_outstanding_requests_per_tenant: 200000
    ruler:
      storage:
        gcs:
          bucket_name: loki-ruler-nav-903dd5c3
          chunk_buffer_size: 0
          enable_http2: true
          request_timeout: 0s
        type: gcs
    runtime_config:
      file: /etc/loki/runtime-config/runtime-config.yaml
    schema_config:
      configs:
      - from: "2024-01-01"
        index:
          period: 24h
          prefix: index_
        object_store: gcs
        schema: v12
        store: tsdb
      - from: "2024-09-12"
        index:
          period: 24h
          prefix: index_
        object_store: gcs
        schema: v13
        store: tsdb
    server:
      grpc_listen_port: 9095
      grpc_server_max_recv_msg_size: 8388608
      grpc_server_max_send_msg_size: 8388608
      http_listen_port: 3100
      http_server_read_timeout: 600s
      http_server_write_timeout: 600s
      log_level: warn
    storage_config:
      bloom_shipper:
        working_directory: /var/loki/data/bloomshipper
      boltdb_shipper:
        index_gateway_client:
          server_address: dns+loki-backend-headless.example.svc.cluster.local:9095
      hedging:
        at: 250ms
        max_per_second: 20
        up_to: 3
      tsdb_shipper:
        index_gateway_client:
          server_address: dns+loki-backend-headless.example.svc.cluster.local:9095
    table_manager:
      retention_deletes_enabled: true
      retention_period: 90d
    tracing:
      enabled: false
@Jayclifford345
Copy link
Contributor

@ashwanthgoli do you have an idea what might be happening here? Just want to see if we can unstick our Grafana Champion. Many thanks in advance!

@JStickler
Copy link
Contributor

@Starefossen we fixed an out of memory issue with a fix that was released with Loki 3.2. Have you tried upgrading? Wondering if this is still causing problems for you?

@JStickler JStickler added the type/bug Somehing is not working as expected label Oct 30, 2024
@Starefossen
Copy link
Contributor Author

Starefossen commented Nov 4, 2024

Thanks for the comment @JStickler, we upgraded to Loki 3.2.0 a week ago and we sill have about the same amount of problems on when the backend needs to be restarted. Here is the memory usage for the last 7 days Image

@Starefossen
Copy link
Contributor Author

@JStickler is it correct that only one backend is processing at any given time? Same goes for the the CPU graphs, there is only ever one backend that is consuming CPU, the other ones looks more of less idle (distributed mode).

@seanocca
Copy link

seanocca commented Nov 4, 2024

+1 for version 3.0.0 (Helm chart 6.6.4)
I have autoscaling enabled for backend and it is definitely scaling on CPU/MEM but all the load is on 2 pods

@Starefossen
Copy link
Contributor Author

Image

Still have to restart the backend on a regular basis due to this issue.

@Starefossen
Copy link
Contributor Author

Our current solution has to set a lover memory limit and have Kubernetes restarting it automatically when it goes over, this seams to fix the query performance issues but it not a particularly satisfying solution to this issue.

@Jayclifford345
Copy link
Contributor

Hi @Starefossen, I spoke with the Loki team again on this today. @chaudum mentioned it would be helpful if we could pull a Profile to better understand what might be happening during a memory spike.

@Starefossen
Copy link
Contributor Author

I have upgraded most of our Loki clusters to v3.3.0 and will be reporting back if that makes any difference to the problem. If not I will upload a profile to better understand what is happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.0 type/bug Somehing is not working as expected
Projects
None yet
Development

No branches or pull requests

4 participants