Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add force to worker group suspend API #2744

Open
1 of 2 tasks
kevin85421 opened this issue Jan 13, 2025 · 7 comments
Open
1 of 2 tasks

[Feature] Add force to worker group suspend API #2744

kevin85421 opened this issue Jan 13, 2025 · 7 comments
Assignees
Labels

Comments

@kevin85421
Copy link
Member

kevin85421 commented Jan 13, 2025

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

#2663 adds an API to suspend a worker group, and @rueian is working on #2666 to avoid Autoscaler scaling up the suspending or suspended worker groups unexpectedly.

For RayJob, it's fine for KubeRay to kill worker Pods directly. However, we want to expose the suspend API of worker groups to users so that they can also terminate a worker group using the API.

In that case, it seems better to delegate the responsibility of shutting down the Ray pods to the Ray Autoscaler, as it can drain the Ray Pods.

This issue proposes an API change:

suspendOptions:
  force: bool
  timeout: int # only effective when `force` is false
  • Case 1: force: true

    • When force is set to true, KubeRay kills the pods directly, and the Ray Autoscaler simply avoids scaling the worker group up or down. The RayJob should fall into this category.
    • It is a bit tricky in the case where the Ray Autoscaler doesn’t handle suspend because the Ray Autoscaler will still try to update RayCluster CR after suspend is set to true. RayCluster controller should not create any Ray Pod for the worker group when suspend is set to true.
  • Case 2: force: false, timeout is not set.

    • Case 2-1: Autoscaler is enabled
      • Autoscaler with the handling of suspend: Autoscaler drains the Ray Pods, and scales down the Ray Pods.
      • Old Ray versions: Undefined behavior. Users should avoid it.
    • Case 2-2: Autoscaler is not enabled
      • This is not supported. RayCluster spec validation should detect it and create a Kubernetes event.
  • Case 3: force: false, timeout is set.

    • Case 3-1: Autoscaler is enabled
      • Autoscaler with the handling of suspend: Autoscaler drains the Ray Pods, and scales down the Ray Pods. If Ray Autoscaler doesn't scale down within timeout, KubeRay deletes these Pods.
      • Old Ray versions: Undefined behavior. Users should avoid it.
    • Case 3-2: Autoscaler is not enabled
      • (1) This is not supported. RayCluster spec validation should detect it and create a Kubernetes event.
      • (2) KubeRay deletes these Pods after timeout.
      • Both (1) and (2) are fine for me.

Another option is to rely on rayVersion to check whether the Ray Autoscaler supports suspend or not, and it is considered as a CR spec validation error if suspend is set to true but the Ray Autoscaler doesn't support suspend. Maybe this is easier for us, as we won’t need to maintain behavior that is hard to define.

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@kevin85421 kevin85421 added enhancement New feature or request triage labels Jan 13, 2025
@rueian
Copy link
Contributor

rueian commented Jan 13, 2025

I guess we just need an optional gracePeriod field.

The idea is that KubeRay will kill the suspended group after the grace period anyway if the Ray Autoscaler is enabled. So that we can still stop the suspended group even with the old Ray versions that don’t read the suspend field from the CR.

If the Ray Autoscaler is not enabled, then it is KubeRay’s responsibility to kill the suspended group immediately.

@rueian
Copy link
Contributor

rueian commented Jan 13, 2025

@kevin85421

After reconsidering all the above proposals, I am afraid we must forbid suspend on worker groups for old Ray versions with Ray Autoscaler enabled. Because even with the above timeout or gracePeriod, old Ray Autoscalers can't know there are groups that have been suspended and will keep making wrong scaling decisions.

@kevin85421
Copy link
Member Author

@rueian Oops, I was interrupted by other tasks while writing the issue. I’ll finish it and ping you for feedback later.

@kevin85421
Copy link
Member Author

cc @rueian @andrewsykim need some inputs for this proposal. We can also discuss this in tomorrow's community sync.

  1. Should we support worker group suspend with Ray Autoscaler which doesn't handle suspend? If not, we can use rayVersion.

  2. Which option in "Case 3-2: Autoscaler is not enabled" do you prefer?

@rueian
Copy link
Contributor

rueian commented Jan 13, 2025

2. Which option in "Case 3-2: Autoscaler is not enabled" do you prefer?

I prefer (1) because KubeRay operator should kill the group immediately after it is marked with suspend=true when Autoscaler is not enabled.

@rueian
Copy link
Contributor

rueian commented Jan 15, 2025

To summarize all details related to the ongoing suspend feature to a specific worker group:

For KubeRay 1.3, we will introduce the suspend field to a worker group spec. This is done by #2663:

suspend: true

When a user sets suspend: true on a worker group of an Autoscaler disabled RayCluster, KubeRay will immediately delete all the Pods in the worker group to suspend the worker group.

However, the field will have NO effect on an Autoscaler-enabled RayCluster. That is because the RayCluster Autoscaler will malfunction if KubeRay kills the worker group. Such as:

  • Always trying to scale up a suspended worker group to keep the min_workers requirement.
  • Always trying to scale up a suspended worker group for the current queuing Ray tasks.

This is ongoing by #2748.

To avoid this malfunctioning, we have a patch for Ray's Autoscaler to let it be aware of the suspend: true on the RayCluster CR (ray-project/ray#49768). Once the patch has been merged and released, we can allow users to suspend worker groups on the future Autoscaler-enabled Ray by checking the RayVersion on the RayCluster CR.

But for #2748, setting suspend: true on a worker group of an Autoscaler-enabled RayCluster KubeRay is considered invalid and will produce an InvalidRayClusterSpec warning event. We will also add the check into the validation webhook to reject the invalid CR as soon as possible.

For KubeRay 1.4, besides allowing suspend worker groups on future Ray with Autoscaler, we plan to extend the API to:

suspend: true
suspendOptions:
  gracePeriod: 10

The new gracePeriod allows users to specify a delay for deleting the worker group. This should work for both Autoscaler disabled and enabled RayCluster.

@kevin85421
Copy link
Member Author

kevin85421 commented Jan 17, 2025

  1. Add feature gate for suspend. Add comments said that this is not user-facing for now.
  2. Add validation to avoid using Autoscaler with suspend in RayJob / RayCluster controllers for now.
  3. Update Ray Autoscaler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants