feat: add concurrency and adjust queue logic #3808
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
It seems the ARC controller is struggling to scale effectively under higher workloads. To avoid delving too deep into the technical details, here's a brief overview of the issues I've identified:
Lack of Concurrency Handling: The controller does not implement concurrency, which leads to diminishing performance as the volume of events increases. As the system processes more events, the filtering process becomes less efficient, eventually causing a bottleneck.
Suboptimal Queue/Requeue Logic: The current queueing mechanism seems to be based on assumptions about system behavior, which leads to overuse. As volume grows, events are requeued multiple times, preventing them from being fully processed until they've gone through the entire queue several times, significantly delaying completion.
Inaccurate Deletion Logic: The controller inaccurately tracks "running" runners. Some runners that have already completed are still considered "running" because the controller hasn't processed all events. This creates both logical and metric inconsistencies.
While the controller works well under lower loads, once we hit a range of 250-500 runners, we see performance degrade, with runners being consumed faster than they can be created. The only immediate solution has been to increase the
minRunners
threshold to 1000+, which buffers some demand but is far from optimal and is also quite costly.I've attached a log stacktrace for a single runner from start to finish. As you can see, there are significant delays in event processing (labeled as "delta"), which directly align with the issues I mentioned. These delays indicate a drag in the controller's performance as it processes each event:
Even ignoring the deletion process (which may contribute to the queue depth), we can see significant delays—often 5+ minutes between events—such as pod creation, which is particularly exacerbated during peak load hours.
As you can see, while increasing
minRunners
provides some relief, it's not a long-term solution. We'll need to optimize the controller's concurrency, queue handling, and deletion logic to address these issues more effectively.Within our PR we enabled concurrency and saw immediate improvements to the performance and scalability. I still believe there is some severely wrong logic in requeue and finishing a queued event, however, this alone helps get us out of the position we were finding ourselves in currently.
This was tested in our PROD environment at approx ~16:14 we deployed our patched version, and within 5+ minutes the P99 job latency metrics immediately began to come down to acceptable levels.