Decouple Ray Resources: Construct ray k8spods from Resources #2943

fiedlerNr9 · 2024-11-20T18:38:31Z

Tracking issue

Why are the changes needed?

These changes update the flytekit ray plugin to let the user specify Resources for Ray Head & Worker nodes instead of specifying the whole k8spod Object.

What changes were proposed in this pull request?

removing exposed k8spod object
exposing requests[Resources] and limits[Resources] for WorkerNodeconfig & HeadNodeConfig
adding construct_k8s_pod_spec_from_resources() to construct k8spod pod definition from Resources

How was this patch tested?

adjusted unit tests
Running this

from flytekit import ImageSpec, Resources, task, workflow
from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig
from ray_helper import ProgressActor, sampling_task
import ray
import typing
from flytekit.models.task import K8sPod, K8sObjectMetadata

flytekit_hash = "2778db206bbea478908c4e529dcb63cd438b6065"
flytekitplugins_ray = f"git+https://github.com/flyteorg/flytekit.git@{flytekit_hash}#subdirectory=plugins/flytekit-ray"
new_flytekit = f"git+https://github.com/flyteorg/flytekit.git@{flytekit_hash}"

container_image = ImageSpec(
    name="ray-union-demo",
    python_version="3.11.9",
    apt_packages=["wget", "gdb", "git"],
    packages=[
        new_flytekit,
        flytekitplugins_ray,
        "kubernetes",
    ],
    registry="ghcr.io/fiedlerNr9",
)
ray_config = RayJobConfig(
    head_node_config=HeadNodeConfig(
        ray_start_params={"num-cpus": "0", "log-color": "true"},
        requests=Resources(cpu="1", mem="3Gi"),
    ),
    worker_node_config=[
        WorkerNodeConfig(
            group_name="ray-group",
            replicas=0,
            min_replicas=0,
            max_replicas=2,
        )
    ],
    shutdown_after_job_finishes=True,
    ttl_seconds_after_finished=120,
    enable_autoscaling=True,
)


@ray.remote
def f(x):
    return x * x


@task(
    task_config=ray_config,
    requests=Resources(mem="2Gi", cpu="3000m"),
    container_image=container_image,
)
def ray_task(n: int) -> typing.List[int]:
    futures = [f.remote(i) for i in range(n)]
    return ray.get(futures)


@workflow
def wf(n: int = 50):
    ray_task(n=n)

Ray Head node Resources

Limits:
      cpu:     1
      memory:  3Gi
    Requests:
      cpu:      1
      memory:   3Gi

Ray controller Resources

Limits:
      cpu:     2
      memory:  3Gi
    Requests:
      cpu:      2
      memory:   3Gi

Ray worker Resources

Limits:
      cpu:     2
      memory:  3Gi
    Requests:
      cpu:      2
      memory:   3Gi

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

Docs link

Signed-off-by: Jan Fiedler <[email protected]>

codecov · 2024-11-20T19:21:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.49%. Comparing base (faee3da) to head (7676af6).
Report is 2 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #2943       +/-   ##
===========================================
+ Coverage   79.32%   91.49%   +12.16%     
===========================================
  Files         199       92      -107     
  Lines       20870     3997    -16873     
  Branches     2684        0     -2684     
===========================================
- Hits        16555     3657    -12898     
+ Misses       3566      340     -3226     
+ Partials      749        0      -749

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

Signed-off-by: Jan Fiedler <[email protected]>

eapolinario

Just a few comments, nothing major.

eapolinario · 2024-11-22T00:19:49Z

flytekit/core/resources.py

+        resources_map = {
+            "cpu": "cpu",
+            "mem": "memory",
+            "gpu": "nvidia.com/gpu",


can you expose this as a parameter? Set its default value to "nvidia.com/gpu"

Just the gpu value of resource_map?

flytekit/core/resources.py

plugins/flytekit-ray/flytekitplugins/ray/models.py

Signed-off-by: Jan Fiedler <[email protected]>

eapolinario

Sorry for flip-flopping on this, I only realized that we were removing the pod spec template. We should instead build helper functions around that, but still produce pod specs that get passed to the ray idl objects.

eapolinario · 2024-11-22T02:05:48Z

flytekit/core/resources.py

+    requests: Optional[Resources],
+    limits: Optional[Resources],
+) -> dict[str, Any]:
+    def _construct_k8s_pods_resources(resources: Optional[Resources], k8s_gpu_resource_key: str = "nvidia.com/gpu"):


Using other gpus is going to be hard, even if we push this parameter to the outer function (i.e. construct_k8s_pod_spec_from_resources).

eapolinario · 2024-11-22T02:06:48Z

plugins/flytekit-ray/flytekitplugins/ray/models.py

    ):
        self._group_name = group_name
        self._replicas = replicas
        self._max_replicas = max(replicas, max_replicas) if max_replicas is not None else replicas
        self._min_replicas = min(replicas, min_replicas) if min_replicas is not None else replicas
        self._ray_start_params = ray_start_params
-        self._k8s_pod = k8s_pod


We should keep this as part of the interface and build helper functions that construct valid pod specs instead (as mentioned in the original flyte PR). This is going to help in the other problem we're having with passing the gpu resource name around (in other words, gpu can be an argument of one of the helper function that builds pod specs).

I get what you are saying. So we want users to construct the pod specs themself like calling construct_k8s_pod_spec_from_resources() or specifying pod templates in user code?

I would make the method name simple, maybe pod from resources

eapolinario · 2024-11-22T02:08:08Z

plugins/flytekit-ray/flytekitplugins/ray/models.py

+        requests: typing.Optional[Resources] = None,
+        limits: typing.Optional[Resources] = None,


pingsutw · 2024-11-22T19:16:50Z

plugins/flytekit-ray/flytekitplugins/ray/models.py

@@ -14,14 +15,22 @@ def __init__(
        min_replicas: typing.Optional[int] = None,
        max_replicas: typing.Optional[int] = None,
        ray_start_params: typing.Optional[typing.Dict[str, str]] = None,
-        k8s_pod: typing.Optional[K8sPod] = None,


yup, we should keep it. If someone specifies both k8s_pod and requests, then we should merge it.

Talked with Eduardo about this today and agreed to only expose k8s_pod and let the user use helper functions to construct the k8s_pod like construct_k8s_pod_spec_from_resources() in this PR

fiedlerNr9 requested review from wild-endeavor, kumare3, eapolinario, pingsutw, cosmicBboy, samhita-alla, thomasjpfan and Future-Outlier as code owners November 20, 2024 18:38

fiedlerNr9 changed the title ~~Construct ray k8spods~~ Decouple Ray Resources: Construct ray k8spods from Resources Nov 20, 2024

fiedlerNr9 force-pushed the construct-ray-k8spods branch from 2778db2 to b79cea4 Compare November 20, 2024 18:40

fiedlerNr9 and others added 8 commits November 20, 2024 11:04

expose requests & limits instead of k8spod

5f551d9

Signed-off-by: Jan Fiedler <[email protected]>

put construct_k8s_pod_spec_from_resources into core/resources.py

efd58f0

Signed-off-by: Jan Fiedler <[email protected]>

adjust ray tests

7c74bb7

Signed-off-by: Jan Fiedler <[email protected]>

ruff check fix

3d84516

Signed-off-by: Jan Fiedler <[email protected]>

ruff format

d192625

Signed-off-by: Jan Fiedler <[email protected]>

remove demo files from PR

94b7797

Signed-off-by: Jan Fiedler <[email protected]>

remove kubernetes from ray plugin dependencies

0b72707

Signed-off-by: Jan Fiedler <[email protected]>

Update structured_dataset.py

da7d6ae

Signed-off-by: Jan Fiedler <[email protected]>

fiedlerNr9 force-pushed the construct-ray-k8spods branch from f2b811c to da7d6ae Compare November 20, 2024 19:04

fiedlerNr9 added 2 commits November 20, 2024 13:36

add tests for construct_k8s_pod_spec_from_resources

17d9d66

Signed-off-by: Jan Fiedler <[email protected]>

add kubernetes to pyproject.toml

72a3339

Signed-off-by: Jan Fiedler <[email protected]>

eapolinario reviewed Nov 22, 2024

View reviewed changes

fiedlerNr9 added 4 commits November 21, 2024 16:32

add underscore prefix to _construct_k8s_pods_resources

f3bd304

Signed-off-by: Jan Fiedler <[email protected]>

remove parantheses from_flyte_idl

7692641

Signed-off-by: Jan Fiedler <[email protected]>

expose k8s_gpu_resource_key

0edfb68

Signed-off-by: Jan Fiedler <[email protected]>

remove parantheses

7676af6

Signed-off-by: Jan Fiedler <[email protected]>

eapolinario reviewed Nov 22, 2024

View reviewed changes

pingsutw reviewed Nov 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple Ray Resources: Construct ray k8spods from Resources #2943

Decouple Ray Resources: Construct ray k8spods from Resources #2943

fiedlerNr9 commented Nov 20, 2024

codecov bot commented Nov 20, 2024 •

edited

Loading

eapolinario left a comment

eapolinario Nov 22, 2024

fiedlerNr9 Nov 22, 2024

eapolinario left a comment

eapolinario Nov 22, 2024

eapolinario Nov 22, 2024

fiedlerNr9 Nov 22, 2024

kumare3 Nov 22, 2024

eapolinario Nov 22, 2024

pingsutw Nov 22, 2024

fiedlerNr9 Nov 23, 2024

		requests: typing.Optional[Resources] = None,
		limits: typing.Optional[Resources] = None,

Decouple Ray Resources: Construct ray k8spods from Resources #2943

Are you sure you want to change the base?

Decouple Ray Resources: Construct ray k8spods from Resources #2943

Conversation

fiedlerNr9 commented Nov 20, 2024

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Ray Head node Resources

Ray controller Resources

Ray worker Resources

Check all the applicable boxes

Related PRs

Docs link

codecov bot commented Nov 20, 2024 • edited Loading

Codecov Report

eapolinario left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eapolinario left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 20, 2024 •

edited

Loading