Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bloom Builder is seemingly stuck unable to finish any tasks #14283

Open
zarbis opened this issue Sep 26, 2024 · 12 comments
Open

Bloom Builder is seemingly stuck unable to finish any tasks #14283

zarbis opened this issue Sep 26, 2024 · 12 comments

Comments

@zarbis
Copy link

zarbis commented Sep 26, 2024

Describe the bug
I've upgraded Loki to v3.2.0 in hopes that new planner/builder architecture will finally make blooms work for me.
I can see that blooms somewhat work by observing activity in loki_blooms_created_total and loki_bloom_gateway_querier_chunks_filtered_total metrics, confirming write and read path respectively.

But my problem is: builders never finish crunching... something.
Some graphs to illustrate my point:
I've ran day-long experiment.
I've had HPA with maxReplicas=100 for bloom-builder that was peaking the whole time until I gave up.
Image

Despite constant amount of BUSY builder replicas the amount of created blooms dropped significantly after an hour or so.
Image

Which correlates quite well with the amount of inflight tasks from the point of view of planner: it quickly crunches through the most of the backlog, but then gets stuck on last couple percent of tasks.
Image

And it's reflected in planner logs. This is a particular line I've noticed:

level=error ts=2024-09-25T09:29:09.612662713Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=1024 tasksSucceed=986 tasksFailed=38
level=error ts=2024-09-25T09:29:09.612673171Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=1024 tasksSucceed=982 tasksFailed=42
level=error ts=2024-09-25T11:07:15.61504996Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=42 tasksSucceed=15 tasksFailed=27
level=error ts=2024-09-25T11:07:15.615066562Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=38 tasksSucceed=12 tasksFailed=26
level=error ts=2024-09-25T12:27:26.153443515Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=20 tasksSucceed=0 tasksFailed=20
level=error ts=2024-09-25T12:27:26.153484978Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=21 tasksSucceed=0 tasksFailed=21
level=error ts=2024-09-25T13:34:14.48146035Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=20 tasksSucceed=0 tasksFailed=20
level=error ts=2024-09-25T13:34:14.481488344Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=16 tasksSucceed=0 tasksFailed=16
level=error ts=2024-09-25T16:38:09.108673794Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=17 tasksSucceed=2 tasksFailed=15
level=error ts=2024-09-25T16:38:09.108700267Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=16 tasksSucceed=1 tasksFailed=15
level=error ts=2024-09-25T17:53:09.826561108Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=14 tasksSucceed=0 tasksFailed=14
level=error ts=2024-09-25T17:53:09.826607695Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=13 tasksSucceed=0 tasksFailed=13
level=error ts=2024-09-25T18:00:15.54654397Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=14 tasksSucceed=0 tasksFailed=14
level=error ts=2024-09-25T18:00:15.546538932Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=13 tasksSucceed=0 tasksFailed=13
level=error ts=2024-09-25T19:06:15.889067817Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=13 tasksSucceed=0 tasksFailed=13
level=error ts=2024-09-25T19:06:15.889094442Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=14 tasksSucceed=0 tasksFailed=14

At some point tasks just cannot succeed anymore, constantly re-queueing tasks:
Image

From the builder side I notice this log line that corresponds to planner's ones:

level=error ts=2024-09-25T08:19:50.762194822Z caller=builder.go:147 component=bloom-builder builder_id=7c20b7dd-abf8-404f-a880-9457ee5c7f1d msg="failed to connect and build. Retrying" err="builder loop failed: failed to notify task completion to planner: failed to acknowledge task completion to planner: EOF"

So there seems to be some miscommunication between builder and planner:

  1. planner creates the task
  2. builder processes it, keeps CPU busy and HPA peaking
  3. builder fails to report task to planner
  4. planner re-queues task
  5. I'm in an infinite loop of wasted resources

To Reproduce
If needed, can provide complete pod logs and Loki configuration.

Relevant part of Helm values.yaml:

loki:
  deploymentMode: Distributed
  loki:
    ### FOR BLOOM BUILDER ###
    containerSecurityContext:
      readOnlyRootFilesystem: false
    image:
      tag: 3.2.0
    ### FOR BLOOM BUILDER ###
    limits_config:
      retention_period: 8760h # 365 days
      max_query_lookback: 8760h # 365 days
      query_timeout: 10m
      ingestion_rate_mb: 30
      ingestion_burst_size_mb: 100
      max_global_streams_per_user: 60000
      split_queries_by_interval: 30m # default = 30m
      tsdb_max_query_parallelism: 1024 # default = 512
      bloom_creation_enabled: true
      bloom_split_series_keyspace_by: 1024
      bloom_gateway_enable_filtering: true
      allow_structured_metadata: true
    storage_config:
      tsdb_shipper:
        active_index_directory: /var/loki/tsdb-index
        cache_location: /var/loki/tsdb-cache
    distributor:
      otlp_config:
        default_resource_attributes_as_index_labels:
        - service.name
        # - service.namespace
        # - service.instance.id
        # - deployment.environment
        # - cloud.region
        # - cloud.availability_zone
        - k8s.cluster.name
        - k8s.namespace.name
        # - k8s.pod.name
        # - k8s.container.name
        # - container.name
        # - k8s.replicaset.name
        - k8s.deployment.name
        - k8s.statefulset.name
        - k8s.daemonset.name
        - k8s.cronjob.name
        # - k8s.job.name
    compactor:
      retention_enabled: true
      delete_request_store: s3
    query_scheduler:
      max_outstanding_requests_per_tenant: 32768 # default = 100
    querier:
      max_concurrent: 16 # default = 10
    server:
      grpc_server_max_recv_msg_size: 16777216 # default = 4194304
      grpc_server_max_send_msg_size: 16777216 # default = 4194304
    # Things that are not yet in main config template
    structuredConfig:
      bloom_build:
        enabled: true
        planner:
          planning_interval: 6h
        builder:
          planner_address: loki-bloom-planner-headless.<namespace>.svc.cluster.local.:9095
      bloom_gateway:
        enabled: true
        client:
          addresses: dnssrvnoa+_grpc._tcp.loki-bloom-gateway-headless.<namespace>.svc.cluster.local.

  bloomPlanner:
    extraArgs:
    - -log.level=debug
    replicas: 1
    resources:
      requests:
        cpu: 100m
        memory: 1Gi
      limits:
        memory: 1Gi
    nodeSelector:
      zone: us-east-1a
    tolerations:
    - <<: *tolerate-arm64

  bloomBuilder:
    extraArgs:
    - -log.level=debug
    autoscaling:
      enabled: true
      minReplicas: 1
      maxReplicas: 10
    resources:
      requests:
        cpu: 2
        memory: 2Gi
      limits:
        memory: 2Gi
    extraVolumes:
      - name: blooms
        emptyDir: {}
    extraVolumeMounts:
      - name: blooms
        mountPath: /var/loki/blooms
    nodeSelector:
      zone: us-east-1a
    tolerations:
    - <<: *tolerate-arm64

  bloomGateway:
    extraArgs:
    - -log.level=debug
    replicas: 2
    resources:
      requests:
        cpu: 1
        memory: 1Gi
      limits:
        memory: 1Gi
    nodeSelector:
      zone: us-east-1a
    tolerations:
    - <<: *tolerate-arm64

Expected behavior
After initial bloom building builders stabilize at much lower resource consumption.

Environment:

  • Infrastructure: EKS 1.29
  • Deployment tool: Helm 6.12.0

Screenshots, Promtail config, or terminal output
If applicable, add any output to help explain your problem.

@chaudum
Copy link
Contributor

chaudum commented Oct 17, 2024

Hey @zarbis thanks for trying out the experimental bloom filter feature and reporting the issue. Did you see any restarts of the bloom-planner component while the builder processed the task?

@salvacorts Do you wanna take a look?

@zarbis
Copy link
Author

zarbis commented Oct 21, 2024

@chaudum checked back that day - zero restarts for planner.

@salvacorts
Copy link
Contributor

@zarbis besides unexpected restart due to errors OOMs, what about restarts due to k8s scheduling?

Can you verify the pod was up (and on the same node) all the way through the planning process?

@zarbis
Copy link
Author

zarbis commented Oct 30, 2024

@salvacorts sorry for the late reply. No, pod was changing nodes during the day. I run monitoring cluster fully on spots and up until now LGTM stack has zero problems with that.

This is the graph of bloom-planner pod changing nodes throughout build process (almost a full day).
Value is node age in minutes.
Image
Hope it helps.

@salvacorts
Copy link
Contributor

salvacorts commented Oct 30, 2024

Those node changes is what's making those task fail. The way I have in mind to resolve this case is:

  • Store the tasks queue state in the storage: So on restarts we restore it from there.
  • Backoff-retry in the builders to ask task completion when planner is down

Unfortunately I don't think we will be implementing this behaviour in the short-term unless we see this problem more often.

I run monitoring cluster fully on spots

Consider running the planner in non-spot nodes to mitigate this issue (also running other Loki components in spot nodes is not ideal. E.g. ingesters)

@zarbis
Copy link
Author

zarbis commented Oct 31, 2024

Thank you, will try that.

But I wonder how is it a problem, especially for ingesters, that have persistence AND redundancy in form of replication factor=3?

@zarbis
Copy link
Author

zarbis commented Oct 31, 2024

Hey @salvacorts, I'm back with the results.

So I've ran bloom-planner on on-demand node and made sure it didn't restart during the test.

Metrics

This is the overview screenshot of the process:
Image
Different graphs don't quite fit in a single panel, so I will go over them one by one, but at least you can see exact queries.

I've started with 10 builders and bumped it to 50 to finish the test today:
Image

Completion rate grew initially with additional replicas but soon fizzled out:
Image

Requeue rate has two spikes: after initial start and after I've added fresh builders:
Image

Interestingly loki_blooms_created_total is zero the entire time (maybe it's outdated and I should look elsewhere?):
Image

Eventually (notice time range zoom to 15 minutes) amount of inflight tasks is stuck roughly around amount of builder replicas which keeps them perfectly occupied with... something:
Image

Logs

I've enabled debug logs for both planner and builders, but there is just too much stuff, so I will start with error level logs.

Planner

There are only two types of errors I see no planner side:

caller=planner.go:1007 msg="error forwarding task to builder, Task requeued" retries=2 err="error receiving response from builder (7b778cb5-ac3e-4df9-bc07-0521d6e7e45d): rpc error: code = Canceled desc = context canceled"
caller=dns.go:78 msg="failed to resolve server addresses" err="lookup IP addresses \"loki-index-gateway-headless.centralized-monitoring.svc.cluster.local\": lookup loki-index-gateway-headless.centralized-monitoring.svc.cluster.local: i/o timeout"

Builder

There is more variety in builders' logs:

Some generic errors:

caller=builder.go:147 msg="failed to connect and build. Retrying" err="builder loop failed: failed to notify task completion to planner: failed to acknowledge task completion to planner: EOF"
caller=builder.go:445 msg="failed to generate bloom" err="failed to build bloom block: iterating store: context canceled"

Notice how this error re-surfaces even when I'm super confident that planner was always available.

Some consistency errors:

caller=memchunk.go:492 msg="block offset does not match expected one, will try reading with expected offset" actual=204708 expected=155229
caller=memchunk.go:489 msg="invalid block found" err="invalid chunk checksum"

Some network errors:

caller=memcached_client_selector.go:98 msg="could not resolve TCP address of server" server=loki-chunks-cache-0.loki-chunks-cache.centralized-monitoring.svc.cluster.local.:11211 err="lookup loki-chunks-cache-0.loki-chunks-cache.centralized-monitoring.svc.cluster.local. on 10.100.0.10:53: no such host"
caller=resolver.go:100 msg="failed to lookup SRV records" host=_memcached-client._tcp.loki-chunks-cache.centralized-monitoring.svc err="lookup _memcached-client._tcp.loki-chunks-cache.centralized-monitoring.svc on 10.100.0.10:53: no such host"
caller=dns.go:78 msg="failed to resolve server addresses" err="lookup IP addresses \"loki-index-gateway-headless.centralized-monitoring.svc.cluster.local\": lookup loki-index-gateway-headless.centralized-monitoring.svc.cluster.local: i/o timeout"

So the end result currently is the same: no matter how many builders I throw at a task - they find some infinite loop to work on.

Update

I left this setup running overnight and with enough retries it managed to process all work and scale down naturally:
Image

@zarbis
Copy link
Author

zarbis commented Nov 18, 2024

@salvacorts did you get a chance to look into this? Maybe I can provide more info?

In general, I've ran bloom generation for several days and every time I observe those extremely long tails, where remaining 1% of tasks take ~80% of total time to complete. I've stopped the process after several days since it's a huge resource waster at the moment for me, but I hope I can resume using this long-awaited feature.

@salvacorts
Copy link
Contributor

@zarbis We recently merged #14988 which I think should help here. You'll need to configure the bloom_build_builder_response_timeout for the tenant. You may end up with some failing tasks tho.

In addition to the above, we introduced a new planning strategy that should yield more evenly spread tasks in terms of the amount of data they need to process. You'll need to set bloom_planning_strategy: 'split_by_series_chunks_size' for your tenant.

These changes are available in main and will be available in the next Loki release.

@zarbis
Copy link
Author

zarbis commented Nov 19, 2024

@salvacorts can I set both values globally in loki.limits_config? And what is a good value for timeout? This is time I expect single task to finish, so when planner generates 4k tasks and most of them finish in 4 hours (so 1k/hour), should I put timeout to the average rate of 3600/1000 seconds times some margin? like 30 seconds?

@salvacorts
Copy link
Contributor

can I set both values globally in loki.limits_config

Yes

so when planner generates 4k tasks and most of them finish in 4 hours (so 1k/hour), should I put timeout to the average rate of 3600/1000 seconds times some margin? like 30 seconds?

No, some tasks may take longer than other regardless of the average time. I'd recommend setting whatever that feels long enough to worth cancelling and retrying. E.g. 15 minutes.

@zarbis
Copy link
Author

zarbis commented Nov 19, 2024

I was able to complete bloom building run in kinda linear fashion without extremely long tail this time:
Image

with following settings:

loki:
  image:
    tag: main-479b44e
  limits_config:
    bloom_build_builder_response_timeout: 15m
    bloom_planning_strategy: split_by_series_chunks_size

Although not all tasks succeeded:

caller=planner.go:477 component=bloom-planner table=loki_index_20046 tenant=<TENANT> msg="all tasks completed for tenant table" tasks=49 tasksSucceed=41 originalMetas=24 newMetas=41
caller=planner.go:540 component=bloom-planner table=loki_index_20046 tenant=<TENANT> phase=building msg="found outdated metas" outdated=14 upToDate=51
caller=planner.go:597 component=bloom-planner table=loki_index_20046 tenant=<TENANT> phase=building msg="deleted outdated metas and blocks" metas=14 blocks=15
caller=planner.go:338 component=bloom-planner table=loki_index_20046 tenant=<TENANT> msg="not all tasks succeeded for tenant table" tasks=49 tasksSucceed=41 tasksFailed=8
caller=planner.go:357 component=bloom-planner msg="bloom build iteration completed" tasks=80 tasksSucceed=53 duration=6907.993171493

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants