Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple Ray Resources: Construct ray k8spods from Resources #2943

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

fiedlerNr9
Copy link
Contributor

Tracking issue

Related to flyteorg/flyte#5666

Why are the changes needed?

These changes update the flytekit ray plugin to let the user specify Resources for Ray Head & Worker nodes instead of specifying the whole k8spod Object.

What changes were proposed in this pull request?

  • removing exposed k8spod object
  • exposing requests[Resources] and limits[Resources] for WorkerNodeconfig & HeadNodeConfig
  • adding construct_k8s_pod_spec_from_resources() to construct k8spod pod definition from Resources

How was this patch tested?

  • adjusted unit tests
  • Running this
from flytekit import ImageSpec, Resources, task, workflow
from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig
from ray_helper import ProgressActor, sampling_task
import ray
import typing
from flytekit.models.task import K8sPod, K8sObjectMetadata

flytekit_hash = "2778db206bbea478908c4e529dcb63cd438b6065"
flytekitplugins_ray = f"git+https://github.com/flyteorg/flytekit.git@{flytekit_hash}#subdirectory=plugins/flytekit-ray"
new_flytekit = f"git+https://github.com/flyteorg/flytekit.git@{flytekit_hash}"

container_image = ImageSpec(
    name="ray-union-demo",
    python_version="3.11.9",
    apt_packages=["wget", "gdb", "git"],
    packages=[
        new_flytekit,
        flytekitplugins_ray,
        "kubernetes",
    ],
    registry="ghcr.io/fiedlerNr9",
)
ray_config = RayJobConfig(
    head_node_config=HeadNodeConfig(
        ray_start_params={"num-cpus": "0", "log-color": "true"},
        requests=Resources(cpu="1", mem="3Gi"),
    ),
    worker_node_config=[
        WorkerNodeConfig(
            group_name="ray-group",
            replicas=0,
            min_replicas=0,
            max_replicas=2,
        )
    ],
    shutdown_after_job_finishes=True,
    ttl_seconds_after_finished=120,
    enable_autoscaling=True,
)


@ray.remote
def f(x):
    return x * x


@task(
    task_config=ray_config,
    requests=Resources(mem="2Gi", cpu="3000m"),
    container_image=container_image,
)
def ray_task(n: int) -> typing.List[int]:
    futures = [f.remote(i) for i in range(n)]
    return ray.get(futures)


@workflow
def wf(n: int = 50):
    ray_task(n=n)

Ray Head node Resources

Limits:
      cpu:     1
      memory:  3Gi
    Requests:
      cpu:      1
      memory:   3Gi

Ray controller Resources

Limits:
      cpu:     2
      memory:  3Gi
    Requests:
      cpu:      2
      memory:   3Gi

Ray worker Resources

Limits:
      cpu:     2
      memory:  3Gi
    Requests:
      cpu:      2
      memory:   3Gi

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

@fiedlerNr9 fiedlerNr9 changed the title Construct ray k8spods Decouple Ray Resources: Construct ray k8spods from Resources Nov 20, 2024
Copy link

codecov bot commented Nov 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.49%. Comparing base (faee3da) to head (7676af6).
Report is 2 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2943       +/-   ##
===========================================
+ Coverage   79.32%   91.49%   +12.16%     
===========================================
  Files         199       92      -107     
  Lines       20870     3997    -16873     
  Branches     2684        0     -2684     
===========================================
- Hits        16555     3657    -12898     
+ Misses       3566      340     -3226     
+ Partials      749        0      -749     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Copy link
Collaborator

@eapolinario eapolinario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments, nothing major.

resources_map = {
"cpu": "cpu",
"mem": "memory",
"gpu": "nvidia.com/gpu",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you expose this as a parameter? Set its default value to "nvidia.com/gpu"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the gpu value of resource_map?

flytekit/core/resources.py Outdated Show resolved Hide resolved
plugins/flytekit-ray/flytekitplugins/ray/models.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@eapolinario eapolinario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for flip-flopping on this, I only realized that we were removing the pod spec template. We should instead build helper functions around that, but still produce pod specs that get passed to the ray idl objects.

requests: Optional[Resources],
limits: Optional[Resources],
) -> dict[str, Any]:
def _construct_k8s_pods_resources(resources: Optional[Resources], k8s_gpu_resource_key: str = "nvidia.com/gpu"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using other gpus is going to be hard, even if we push this parameter to the outer function (i.e. construct_k8s_pod_spec_from_resources).

):
self._group_name = group_name
self._replicas = replicas
self._max_replicas = max(replicas, max_replicas) if max_replicas is not None else replicas
self._min_replicas = min(replicas, min_replicas) if min_replicas is not None else replicas
self._ray_start_params = ray_start_params
self._k8s_pod = k8s_pod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep this as part of the interface and build helper functions that construct valid pod specs instead (as mentioned in the original flyte PR). This is going to help in the other problem we're having with passing the gpu resource name around (in other words, gpu can be an argument of one of the helper function that builds pod specs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what you are saying. So we want users to construct the pod specs themself like calling construct_k8s_pod_spec_from_resources() or specifying pod templates in user code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make the method name simple, maybe pod from resources

Comment on lines +116 to +117
requests: typing.Optional[Resources] = None,
limits: typing.Optional[Resources] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@@ -14,14 +15,22 @@ def __init__(
min_replicas: typing.Optional[int] = None,
max_replicas: typing.Optional[int] = None,
ray_start_params: typing.Optional[typing.Dict[str, str]] = None,
k8s_pod: typing.Optional[K8sPod] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, we should keep it. If someone specifies both k8s_pod and requests, then we should merge it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with Eduardo about this today and agreed to only expose k8s_pod and let the user use helper functions to construct the k8s_pod like construct_k8s_pod_spec_from_resources() in this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants