[RFC][QSB] Approached to enforcement of system resource limits #11846

kaushalmahi12 · 2024-01-10T22:39:58Z

Is your feature request related to a problem? Please describe

It is a meta request to QSB feature request to get some community feedback on possible approaches.

Describe the solution you'd like

Approaches to Enforce Resource Limits

There are basically two ways to enforce the resource consumption limits I can think of. First one can focuses on allocating or maintaining the fixed amount of resource usage for a sandbox while second one can be made flexible to make optimum use of resources available.

Reserved - With this approach we can assign a fixed percentage of a resource for a sandbox. All sandboxes cumulatively should not exceed 100. Going with this approach even though multiple sandboxes are underutilised it can trigger cancellation from sandboxes as soon as they hit their limit.
- Pros
  1. It will make the cancellation a bit easier as we only need to cancel when a sandbox exceeds its limit.
  2. This will help us free ourselves from the pain of tracking sandbox resource usage cumulatively. We need not employ Hierarchical topology for sandboxes.
  3. Sandboxes need not have priority.
  4. Efficacy of system resources overall is more since the #rejections > #cancellations
- Cons
  1. This can lead to underutilization of resources system wide.
  2. Additional overhead of validating the individual sandboxes resource limit each time Cx creates a sandbox.
Constrained - With this approach we will assign a limit which we will always honor. But one important thing we will do to make optimum usage of available resources is that cumulatively for a resource across all the sandboxes need not have sum up to system level duress limit. But this will create the problem of which sandbox should be selected to cancel the queries. To solve this problem we will have sandbox priority to help when contention happens.
- Pros
  1. Optimum use of available system resources.
  2. It can cause more cancellations than rejections if not configured properly (free flowing limits e,g; every sandbox with max limit configured)
- Cons
  1. It is complex to maintain tree topology and priority based cancellation in case of contention.
  2. Efficacy of system resources is less as #cancellations > #rejections in case where none of the snadboxes are hitting the configured limits but cumulatively they are duressing the node. Cancelling a task is wasting the resources on the cancelled task progress so far.

Lets understand them with the help of some examples here. For the sake of simplicity I am only using a single value for resource limit but there will be two limits for each system resource low and high.

Constrained

Lets say we have 3 Sandboxes in the System

Sandbox1 - { ResourceLimit: 60, Priority: 1}
Sandbox2 - { ResourceLimit: 20, Priority: 3}
Sandbox3 - { ResourceLimit: 40, Priority: 2}

System wide resource limit: 90

Lets caputre the current resource usage of the sandboxes at different times

Cancellation Case: sandbox limit breached

Time	Sandbox1	Sandbox2	Sandbox3
T1	40	10	30
T2	50	25	10

Sandbox2 will start rejecting new requests for this sandbox and cancel some.

Cancellation Case: system limit breached

Time	Sandbox1	Sandbox2	Sandbox3
T1	40	10	30
T2	50	15	30

here cells in bold will see cancellation as cumulatively it is breaching the system limit. It means that sandbox2 will face cancellation even though the sandbox level limits are not breaching here.

Rejection Case:

Time	Sandbox1	Sandbox2	Sandbox3
T1	40	10	30
T2	35	22	30

In this case Sandbox2 will face rejections as the sandbox level limits are breaching.

Reserved

Lets say we have 3 Sandboxes in the System

Sandbox1 - { ResourceLimit: 50, Priority: 1}
Sandbox2 - { ResourceLimit: 20, Priority: 3}
Sandbox3 - { ResourceLimit: 30, Priority: 2}

The sandbox limits for the example are taken in such a way that cumulative sum of the resource limits on sandboxes should sum up to 100 as inherent in the approach.

Cancellation Case: sanbox limit breached

Time	Sandbox1	Sandbox2	Sandbox3
T1	40	10	30
T2	40	20(1)	20
T3	45	25(2)	10

(1) At this point the sandbox2 will start rejecting new incoming requests
(2) At this point we will also start cancelling running requests from sandbox2 due to sandbox level resource limit breach.

Cancellation Case: system level limit breached

Time	Sandbox1	Sandbox2	Sandbox3
T1	40	10	30
T2	50	15	25
T3	50	18	30

In this case the sandbox2 will start cancelling the requests because it is the lowest priority sandbox.

Decision driving factors to select the Approach from one of the Above

We want to improve the efficacy of the system resources overall which means we would avoid wasting resources on tasks which potentially can shoot beyond enforced limits. Basically we will favor rejections over cancellations.
Our system should try its best to honor the user assigned limits for these sandboxes even though this can lead to underutilisation in the system. For example let say there are 3 sandboxes in the system having limits as 60, 20, 10 respectively, there might be a time when lets say only sandbox 2 has the traffic and the sandbox 2 is inundated with traffic hence it will start rejecting the requests even though system is still not under duress.
At any point in time sandbox assigned limit should be honored. For example if at any point in time sandbox should not face cancellation or rejection until defined limit breached.

Personal Verdict

We will go ahead with reserved approach for enforching resource limits considering the above points.

Problems with the selected approach to enforce sys resource limits and possible solutions

The only ambiguity with this approach is the ability to maintain the cumulative resource limit to 100 since the user can supply any random value for new sandboxes.

To understand this with the help of examples, lets say at any point in time we have 3 sandboxes in the system

sandbox1: { limit: 40 }
sandbox2: { limit: 30 }
sandbox3: { limit: 20 }

now lets say user want to create a new sandbox with resource limit of 30 the new cumulative sum will become 120 (>100). This warrants the readjustment of the existing sandbox limits or create the new sandbox with the limit of 10.

Now how do we resolve this conflict there are two ways I can think of resolving this

We re-adjust the resource limits of existing sandboxes in the same proportion on user's behalf. e,g; in the above scenario we can let the new sandbox be created with a limit of 30 * 10/12 and readjust the other sandboxe limits to 40 * 10/12, 30 * 10/12 and 20 * 10/12.
We error out the request to create the new sandbox and ask user to re-adjust the limit of existing sandboxes to accomodate the new one.

Personally I think the 2nd option provides better user experience. But I am looking forward to hear from the folks on this.

I am using Sandbox keyword as we had started envisioning this feature with it. But It is not the final name for the construct to be used in the implementation.

Main Issues

RFC
Proposal

Related component

Search:Resiliency

Describe alternatives you've considered

No response

Additional context

No response

kaushalmahi12 · 2024-01-11T00:14:25Z

@backslasht @Bukhtawar @msfroh
Can you guys provide your feedback on this ?

peternied · 2024-01-24T16:39:47Z

[Triage - attendees 1 2]
@kaushalmahi12 Thanks for writing up this great discussion / proposal

kaushalmahi12 added enhancement Enhancement or improvement to existing feature or request untriaged labels Jan 10, 2024

github-actions bot added the Search:Resiliency label Jan 10, 2024

kaushalmahi12 changed the title ~~[Meta Isuue][QSB] Approached to enforcement of system resource limits~~ [Meta Issue][QSB] Approached to enforcement of system resource limits Jan 11, 2024

peternied added RFC Issues requesting major changes and removed untriaged labels Jan 24, 2024

peternied added the discuss Issues intended to help drive brainstorming and decision making label Jan 24, 2024

peternied changed the title ~~[Meta Issue][QSB] Approached to enforcement of system resource limits~~ [QSB] Approached to enforcement of system resource limits Jan 24, 2024

peternied changed the title ~~[QSB] Approached to enforcement of system resource limits~~ [RFC][QSB] Approached to enforcement of system resource limits Jan 24, 2024

andrross added the Roadmap:Stability/Availability/Resiliency Project-wide roadmap label label May 31, 2024

github-project-automation bot added this to Search Project Board May 31, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board May 31, 2024

getsaurabh02 added this to OpenSearch Roadmap May 31, 2024

github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024

getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Search Project Board Aug 15, 2024

kkhatua assigned kaushalmahi12 Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][QSB] Approached to enforcement of system resource limits #11846

[RFC][QSB] Approached to enforcement of system resource limits #11846

kaushalmahi12 commented Jan 10, 2024 •

edited

Loading

kaushalmahi12 commented Jan 11, 2024

peternied commented Jan 24, 2024

[RFC][QSB] Approached to enforcement of system resource limits #11846

[RFC][QSB] Approached to enforcement of system resource limits #11846

Comments

kaushalmahi12 commented Jan 10, 2024 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Approaches to Enforce Resource Limits

Constrained

Reserved

Decision driving factors to select the Approach from one of the Above

Personal Verdict

Problems with the selected approach to enforce sys resource limits and possible solutions

Main Issues

Related component

Describe alternatives you've considered

Additional context

kaushalmahi12 commented Jan 11, 2024

peternied commented Jan 24, 2024

kaushalmahi12 commented Jan 10, 2024 •

edited

Loading