[Feature] Add `force` to worker group `suspend` API #2744

kevin85421 · 2025-01-13T19:19:25Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

#2663 adds an API to suspend a worker group, and @rueian is working on #2666 to avoid Autoscaler scaling up the suspending or suspended worker groups unexpectedly.

For RayJob, it's fine for KubeRay to kill worker Pods directly. However, we want to expose the suspend API of worker groups to users so that they can also terminate a worker group using the API.

In that case, it seems better to delegate the responsibility of shutting down the Ray pods to the Ray Autoscaler, as it can drain the Ray Pods.

This issue proposes an API change:

suspendOptions:
  force: bool
  timeout: int # only effective when `force` is false

Case 1: force: true
- When force is set to true, KubeRay kills the pods directly, and the Ray Autoscaler simply avoids scaling the worker group up or down. The RayJob should fall into this category.
- It is a bit tricky in the case where the Ray Autoscaler doesn’t handle suspend because the Ray Autoscaler will still try to update RayCluster CR after suspend is set to true. RayCluster controller should not create any Ray Pod for the worker group when suspend is set to true.
Case 2: force: false, timeout is not set.
- Case 2-1: Autoscaler is enabled
  - Autoscaler with the handling of suspend: Autoscaler drains the Ray Pods, and scales down the Ray Pods.
  - Old Ray versions: Undefined behavior. Users should avoid it.
- Case 2-2: Autoscaler is not enabled
  - This is not supported. RayCluster spec validation should detect it and create a Kubernetes event.
Case 3: force: false, timeout is set.
- Case 3-1: Autoscaler is enabled
  - Autoscaler with the handling of suspend: Autoscaler drains the Ray Pods, and scales down the Ray Pods. If Ray Autoscaler doesn't scale down within timeout, KubeRay deletes these Pods.
  - Old Ray versions: Undefined behavior. Users should avoid it.
- Case 3-2: Autoscaler is not enabled
  - (1) This is not supported. RayCluster spec validation should detect it and create a Kubernetes event.
  - (2) KubeRay deletes these Pods after timeout.
  - Both (1) and (2) are fine for me.

Another option is to rely on rayVersion to check whether the Ray Autoscaler supports suspend or not, and it is considered as a CR spec validation error if suspend is set to true but the Ray Autoscaler doesn't support suspend. Maybe this is easier for us, as we won’t need to maintain behavior that is hard to define.

Use case

No response

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

rueian · 2025-01-13T19:39:04Z

I guess we just need an optional gracePeriod field.

The idea is that KubeRay will kill the suspended group after the grace period anyway if the Ray Autoscaler is enabled. So that we can still stop the suspended group even with the old Ray versions that don’t read the suspend field from the CR.

If the Ray Autoscaler is not enabled, then it is KubeRay’s responsibility to kill the suspended group immediately.

rueian · 2025-01-13T19:47:16Z

@kevin85421

After reconsidering all the above proposals, I am afraid we must forbid suspend on worker groups for old Ray versions with Ray Autoscaler enabled. Because even with the above timeout or gracePeriod, old Ray Autoscalers can't know there are groups that have been suspended and will keep making wrong scaling decisions.

kevin85421 · 2025-01-13T20:54:14Z

@rueian Oops, I was interrupted by other tasks while writing the issue. I’ll finish it and ping you for feedback later.

kevin85421 · 2025-01-13T21:40:58Z

cc @rueian @andrewsykim need some inputs for this proposal. We can also discuss this in tomorrow's community sync.

Should we support worker group suspend with Ray Autoscaler which doesn't handle suspend? If not, we can use rayVersion.
Which option in "Case 3-2: Autoscaler is not enabled" do you prefer?

rueian · 2025-01-13T22:08:52Z

2. Which option in "Case 3-2: Autoscaler is not enabled" do you prefer?

I prefer (1) because KubeRay operator should kill the group immediately after it is marked with suspend=true when Autoscaler is not enabled.

rueian · 2025-01-15T01:42:59Z

To summarize all details related to the ongoing suspend feature to a specific worker group:

For KubeRay 1.3, we will introduce the suspend field to a worker group spec. This is done by #2663:

suspend: true

When a user sets suspend: true on a worker group of an Autoscaler disabled RayCluster, KubeRay will immediately delete all the Pods in the worker group to suspend the worker group.

However, the field will have NO effect on an Autoscaler-enabled RayCluster. That is because the RayCluster Autoscaler will malfunction if KubeRay kills the worker group. Such as:

Always trying to scale up a suspended worker group to keep the min_workers requirement.
Always trying to scale up a suspended worker group for the current queuing Ray tasks.

This is ongoing by #2748.

To avoid this malfunctioning, we have a patch for Ray's Autoscaler to let it be aware of the suspend: true on the RayCluster CR (ray-project/ray#49768). Once the patch has been merged and released, we can allow users to suspend worker groups on the future Autoscaler-enabled Ray by checking the RayVersion on the RayCluster CR.

But for #2748, setting suspend: true on a worker group of an Autoscaler-enabled RayCluster KubeRay is considered invalid and will produce an InvalidRayClusterSpec warning event. We will also add the check into the validation webhook to reject the invalid CR as soon as possible.

For KubeRay 1.4, besides allowing suspend worker groups on future Ray with Autoscaler, we plan to extend the API to:

suspend: true
suspendOptions:
  gracePeriod: 10

The new gracePeriod allows users to specify a delay for deleting the worker group. This should work for both Autoscaler disabled and enabled RayCluster.

kevin85421 · 2025-01-17T01:31:41Z

Add feature gate for suspend. Add comments said that this is not user-facing for now.
Add validation to avoid using Autoscaler with suspend in RayJob / RayCluster controllers for now.
Update Ray Autoscaler.

kevin85421 added enhancement New feature or request triage labels Jan 13, 2025

kevin85421 added raycluster autoscaler and removed triage labels Jan 13, 2025

kevin85421 assigned rueian Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add `force` to worker group `suspend` API #2744

[Feature] Add `force` to worker group `suspend` API #2744

kevin85421 commented Jan 13, 2025 •

edited

Loading

rueian commented Jan 13, 2025

rueian commented Jan 13, 2025

kevin85421 commented Jan 13, 2025

kevin85421 commented Jan 13, 2025

rueian commented Jan 13, 2025

rueian commented Jan 15, 2025 •

edited

Loading

kevin85421 commented Jan 17, 2025 •

edited

Loading

[Feature] Add force to worker group suspend API #2744

[Feature] Add force to worker group suspend API #2744

Comments

kevin85421 commented Jan 13, 2025 • edited Loading

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

rueian commented Jan 13, 2025

rueian commented Jan 13, 2025

kevin85421 commented Jan 13, 2025

kevin85421 commented Jan 13, 2025

rueian commented Jan 13, 2025

rueian commented Jan 15, 2025 • edited Loading

kevin85421 commented Jan 17, 2025 • edited Loading

[Feature] Add `force` to worker group `suspend` API #2744

[Feature] Add `force` to worker group `suspend` API #2744

kevin85421 commented Jan 13, 2025 •

edited

Loading

rueian commented Jan 15, 2025 •

edited

Loading

kevin85421 commented Jan 17, 2025 •

edited

Loading