-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Isolating CPU for specific workloads(search or others) #12483
Comments
Can you expand on these reasons? It would probably be helpful context. Naively, if all search tasks were 100% CPU-bound then you could accomplish "User can say that they want to allocate 70% of CPU to search workloads" by defining the search thread pool size as allocated processors * 0.7. |
Sure, I can try. Usually when we have systems like OpenSearch which does a lot of I/O bound tasks(network, file I/O), we can have a situation where a lot of threads are not doing much apart from waiting and in this case other threads(which needs to do some computation) can be scheduled(internally by Operating system) to use the CPU instead. This achieves much better CPU utilization. In case we had a service which is only meant to heavy CPU work(like computations), then it would have made sense to keep number of threads equal to the number of cores. So I think we shouldn't solve this by reducing the thread pool size to allocated processors * 0.7, it will degrade the performance compared to the approach I have mentioned. |
Thanks @sgup432. The question behind my question is whether there is opportunity to structure the code such that we aren't doing blocking operations on "search" threads, which would make your scaling decisions about that thread pool much simpler and avoid needing to over-provision relative to the physical compute resources. |
@andrross - The blocking operations on "search" threads are for reading data from lucene segment files on disk. Hence IMO, unless we fundamentally change lucene or way opensearch works, we cannot avoid blocking of threads. |
In addition to what @jainankitk said, we can never avoid a situation where thread won't be interrupted or made space for another until unless we have an ideal system. This stackoverflow thread probably explains it better. |
There some interesting options here that Graalvm explores, see please https://medium.com/graalvm/isolates-and-compressed-references-more-flexible-and-efficient-memory-management-for-graalvm-a044cc50b67e
So this picture is a bit simplified and does not reflect the state we are now. Since we have concurrent search in GA (as of 2.12): there is a shared thread pool to manage segment traversals in parallel. In happy path, search threads do not do any I/O anymore but indexer search threads do.
We cannot limit CPUs/heap on a thread pool efficiently (at least I do think so and would appreciate correction here), we could certainly try to estimate the pool's CPU/heap usage or limit the number of active threads to some % of cores, but all of those would not be very effective I think, we would end up with underutilization (fe many blocking threads would tell that the pool is doing nothing) or overutilization (once those blocking thread become runnable at the same time). |
Thanks for the reference, I will take a look.
Accordingly to my limited knowledge, the library I mentioned above provides a way to create a specific thread factory which is pinned to certain CPU cores. I am guessing we could use that(maybe with some tweak) to achieve this. As eventually search threadpool is essentially created by initializing a thread factory which is passed to ExecutorService(correct me if I am wrong here). Though I need to do more deep dive on the library part but first wanted to see if the idea makes any sense and initiate some discussions around it.
Yeah I agree that we might end up either over-utilizing or under-utilizing the CPU/heap usage. But this RFC accepts that caveat and is more focussed towards providing resiliency by isolating underlying resources for search(or indexing) workloads. |
The pinning it not the same as CPU usage, right? Pinning helps with data locality / context switches, you can try to pin the thread to core and just block the thread (btw, the library actually lives under OpenHFT, https://github.com/OpenHFT/Java-Thread-Affinity). It may work in some system but OpenSearch has many many layers (Apache Lucene) and I am 100% that pinning will not be helpful (we have ~20 different threads pools).
This is not the caveat that may be practical, it's like asking people to use serializable isolation level on their databases all the time - very good isolation, horrible performance. |
Agreed this caveat may not be practical. Also, I believe we can achieve added resiliency via under-utilization today by overriding the default search thread pool size to a smaller value. |
Yeah I used pinning but the intention was not that exactly. Yeah atleast java affinity tries to solve data locality problem but I was more looking to use it in other ways like to isolate X% cpu cores for some Y threads. Again that was given as an example but I am open to other ways as well.
Again my intentions was not to enable it for all but only when needed. Again don't think we can generalize to "horrible" performance as user is controlling that setting.
As mentioned above, if we have below two cases: I guess we still might perform better in 1st case. As in 2nd case, CPU is going to be way under utilized. IMO if we want to solve it, it shouldn't do it via 2nd case. |
@reta @andrross - Thank you for providing your inputs. While I agree that the significant (I know this is vague) under-utilization of resources might not be acceptable tradeoff, maybe limiting the cores for search threadpool to 80% or something prevents CPU from spiking to 100% bringing whole cluster to standstill. @sgup432 - I doubt that this can be supported as dynamically configurable parameter. Are you thinking of this as static configuration with default being same as what we have today?
@reta - I am bit unclear on how concurrent segment search is preventing/reducing search threads from getting blocked. Can you please elaborate more on this? |
@jainankitk The search task (managed by search thread pool) will be waiting for numerous indexer searcher tasks (managed by indexer search pool) to be completed, in other words - it won't be doing any useful work and limit search pool will not help - the real work is done by other pool instead |
@sgup432 Any user could do this right now - every thread pool is configurable and limit could be set to whatever user thinks is reasonable, no additional implementation is required. |
As @reta mentioned, CPU-pinning is generally a low level technique to provide data locality / reduce context switching. The problem here (search operations can overwhelm the system) is probably better solved by limiting concurrency at the application level. This is indeed a hard problem but in my opinion improving search backpressure / cancellation is probably the way to go, as it can be adaptive to the workload in real time. If you want to put an artificial cap on a class of operations, you can do so today by overriding the default thread pool size to something smaller. I'm skeptical that limiting a thread pool to use a subset of CPU cores would be significantly different in practice to just tuning the thread pool size. |
@sgup432 Feel free to build a prototype and prove me wrong though! |
Yeah it will be a static setting and by default would function the way we have it today. |
@reta - If most of the work is being done by indexer searcher tasks going forward, maybe putting limitation on that makes sense!?
I am wondering if limiting concurrency at application level can lead to under-utilization of resources.
We are working on that as well, but will be sometime before we solve the problem for good using that approach. Since, cancellation is a hard problem in itself
@andrross - Yes, it should be different unless my understanding is incorrect (@sgup432 - please correct me if I am wrong). Limiting the cores is like restricting the lanes on freeway allowing the traffic to still flow smoothly (non-search operations) through carpool/express lane. Limiting number of threads is like reducing number of cars on the freeway, and stays reduced even when there is no congestion. |
While I am not sure about the proving wrong part! ;) It did get me wondering what it will take to build the prototype. Will it be quick or going to be few weeks of effort. |
@jainankitk I'm still skeptical :) If your problem was that search operations had filled the freeway up with cars, and then you limit them to a subset of lanes, then that will reduce the total throughput for search operations. Regardless of whether there is congestion or not the total possible throughput for search operations has been reduced. This sounds the same as if you just limited the number of search "cars" allowed on the freeway with no particular lane restrictions. |
@jainankitk To be clear, OpenSearch does this today. This is why there are like 20 different thread pools defined of all different sizes and characteristics. If we didn't want to limit concurrency of different functions, then we'd just have one giant thread pool and let everything fight over it. The problem here is that we have a heuristic (i.e. "150% of CPU cores + 1 gives the right balance for search operation concurrency") and this heuristic works well generally, but in some cases those search operations end up being more CPU-heavy than the average workload leading to over-utilization of the CPU. If the workload is predictable and known ahead of time then the search thread pool can be tuned to a different size. If the workload is not known ahead of time or is unpredictable, then I think you need to solve this with something that is adaptive like backpressure. |
May be something along the lines of thread priority can help to solve the problem of stability by making the ping threads, health check threads of highest priority and then indexing, search thread medium and background tasks threadpool as low priority. |
@sgup432 - During discussion in the lucene meetup, @sohami mentioned if we can use priority for this purpose. Do you think assigning lower priority to search, indexing and other intensive thread pools achieve similar purpose as isolating CPU? - https://www.baeldung.com/java-thread-priority |
@sohami @jainankitk |
I guess our objective is to ensure that critical tasks are not starved of CPU due to compute intensive search threads. Isolation is one of the ways, while priority achieves the same without introducing additional constraints.
With prioritization, the search thread would yield the CPU immediately whenever higher priority thread is waiting. So, I don't think the node will go down due to CPU starvation. Maybe we can try out this approach? |
@jainankitk |
Is your feature request related to a problem? Please describe
I want to propose this idea and looking for some opinions from the community. Hopefully it doesn't sound terrible.
Context:
As of now, we have search and indexing workload(and others as well) running on the same node under the same process with their own threadpool defined. A lot of times certain "expensive" search queries cause CPU to spike till 100% on some/all nodes in a cluster and causes cluster instability. This can happen when all search threads seems to occupy majority/all CPU cores, also considering that we define number of search threads greater than available CPU cores on a node for performance reasons (formula:
((allocatedProcessors * 3) / 2) + 1
).We don't have a nice way as of today for users to isolate a search workload(for example) to not take up more than 70%(lets say) CPU, thereby providing resiliency and avoid node drops.
We have had discussion around defining "search" only nodes but it is complex and might be expensive for users to maintain a different node type. Existing Search backpressure feature also doesn't solve this holistically and plus relies on a cancellation mechanism(for search tasks after CPU goes beyond X%) which has its own limitations and may not work necessarily.
Unfortunately there is no way in JAVA to isolate heap for specific group of threads, otherwise this could have been extended to JVM heap as well.
Use case:
User can say that they want to allocate 70% of CPU to search workloads via some cluster setting. And we will accordingly map search threadpool to X% CPU cores or something like that.
Describe the solution you'd like
In Java, a concept called Java affinity exists where you can map of group of threads to certain CPU cores. And it is mostly used in cases where OS's thread scheduling does not provide optimal performance. But we can use it for isolating a group of threads to only run on X cpu cores on a Y core node(for example).
There also exists a library around this which provides a way to assign a group of threads to certain CPU cores(by reading
/proc/cpuinfo
) using a ThreadFactory.I haven't done a POC yet as not sure of the feasibility.
Related component
Search:Resiliency
Describe alternatives you've considered
Have different processes for search/indexing but this seems to be way more complex.
Additional context
No response
The text was updated successfully, but these errors were encountered: