-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bloom Builder is seemingly stuck unable to finish any tasks #14283
Comments
Hey @zarbis thanks for trying out the experimental bloom filter feature and reporting the issue. Did you see any restarts of the bloom-planner component while the builder processed the task? @salvacorts Do you wanna take a look? |
@chaudum checked back that day - zero restarts for planner. |
@zarbis besides unexpected restart due to errors OOMs, what about restarts due to k8s scheduling? Can you verify the pod was up (and on the same node) all the way through the planning process? |
@salvacorts sorry for the late reply. No, pod was changing nodes during the day. I run monitoring cluster fully on spots and up until now LGTM stack has zero problems with that. This is the graph of bloom-planner pod changing nodes throughout build process (almost a full day). |
Those node changes is what's making those task fail. The way I have in mind to resolve this case is:
Unfortunately I don't think we will be implementing this behaviour in the short-term unless we see this problem more often.
Consider running the planner in non-spot nodes to mitigate this issue (also running other Loki components in spot nodes is not ideal. E.g. ingesters) |
Thank you, will try that. But I wonder how is it a problem, especially for ingesters, that have persistence AND redundancy in form of replication factor=3? |
Hey @salvacorts, I'm back with the results. So I've ran bloom-planner on on-demand node and made sure it didn't restart during the test. MetricsThis is the overview screenshot of the process: I've started with 10 builders and bumped it to 50 to finish the test today: Completion rate grew initially with additional replicas but soon fizzled out: Requeue rate has two spikes: after initial start and after I've added fresh builders: Interestingly Eventually (notice time range zoom to 15 minutes) amount of inflight tasks is stuck roughly around amount of builder replicas which keeps them perfectly occupied with... something: LogsI've enabled debug logs for both planner and builders, but there is just too much stuff, so I will start with PlannerThere are only two types of errors I see no planner side:
BuilderThere is more variety in builders' logs: Some generic errors:
Notice how this error re-surfaces even when I'm super confident that planner was always available. Some consistency errors:
Some network errors:
So the end result currently is the same: no matter how many builders I throw at a task - they find some infinite loop to work on. UpdateI left this setup running overnight and with enough retries it managed to process all work and scale down naturally: |
@salvacorts did you get a chance to look into this? Maybe I can provide more info? In general, I've ran bloom generation for several days and every time I observe those extremely long tails, where remaining 1% of tasks take ~80% of total time to complete. I've stopped the process after several days since it's a huge resource waster at the moment for me, but I hope I can resume using this long-awaited feature. |
@zarbis We recently merged #14988 which I think should help here. You'll need to configure the In addition to the above, we introduced a new planning strategy that should yield more evenly spread tasks in terms of the amount of data they need to process. You'll need to set These changes are available in |
@salvacorts can I set both values globally in |
Yes
No, some tasks may take longer than other regardless of the average time. I'd recommend setting whatever that feels long enough to worth cancelling and retrying. E.g. 15 minutes. |
Describe the bug
I've upgraded Loki to
v3.2.0
in hopes that new planner/builder architecture will finally make blooms work for me.I can see that blooms somewhat work by observing activity in
loki_blooms_created_total
andloki_bloom_gateway_querier_chunks_filtered_total
metrics, confirming write and read path respectively.But my problem is: builders never finish crunching... something.
Some graphs to illustrate my point:
I've ran day-long experiment.
I've had HPA with
maxReplicas=100
forbloom-builder
that was peaking the whole time until I gave up.Despite constant amount of BUSY builder replicas the amount of created blooms dropped significantly after an hour or so.
Which correlates quite well with the amount of inflight tasks from the point of view of planner: it quickly crunches through the most of the backlog, but then gets stuck on last couple percent of tasks.
And it's reflected in planner logs. This is a particular line I've noticed:
At some point tasks just cannot succeed anymore, constantly re-queueing tasks:
From the builder side I notice this log line that corresponds to planner's ones:
So there seems to be some miscommunication between builder and planner:
To Reproduce
If needed, can provide complete pod logs and Loki configuration.
Relevant part of Helm
values.yaml
:Expected behavior
After initial bloom building builders stabilize at much lower resource consumption.
Environment:
Screenshots, Promtail config, or terminal output
If applicable, add any output to help explain your problem.
The text was updated successfully, but these errors were encountered: