Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayCluster][Feature] add GcsFaultToleranceOptions to the RayCluster CRD #2712

Closed
wants to merge 9 commits into from

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Jan 3, 2025

Why are these changes needed?

Currently, we need users to set the ray.io/ft-enabled annotation, their redis address in envs and their redis password in both envs and RayStartParams in their RayCluster CR to enable GCS FT. Too many places to set up.

This PR adds GcsFaultToleranceOptions to the RayCluster CRD, unifying all the GCS FT settings in one field to provide a better user experience and all required annotations, envs, and ray start parameters will be set according to this new GcsFaultToleranceOptions.

With the new GcsFaultToleranceOptions, a GCS FT enabled RayCluster CR will look like this:

---
apiVersion: v1
kind: Secret
metadata:
  name: redis-password-secret
type: Opaque
data:
  # echo -n "5241590000000000" | base64
  password: NTI0MTU5MDAwMDAwMDAwMA==
---
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-external-redis
spec:
  rayVersion: '2.9.0'
  gcsFaultToleranceOptions:
    redisAddress: redis:6379
    redisPassword:
      valueFrom:
        secretKeyRef:
          name: redis-password-secret
          key: password
  headGroupSpec:
    rayStartParams: {}
    template:
      spec:
        containers:
          - name: ray-head
            image: rayproject/ray:2.9.0-aarch64
            resources:
              limits:
                cpu: "1"
              requests:
                cpu: "1"
            ports:
              - containerPort: 6379
                name: redis
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client

Related issue number

Resolves #2695

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@rueian rueian changed the title Gcs ft crd [RayCluster][Feature] add GcsFaultToleranceOptions to the RayCluster CRD Jan 3, 2025
@@ -96,6 +96,9 @@ func initTemplateAnnotations(instance rayv1.RayCluster, podTemplate *corev1.PodT
podTemplate.Annotations[utils.RayExternalStorageNSAnnotationKey] = v
}
}
if options := instance.Spec.GcsFaultToleranceOptions; options != nil && options.ExternalStorageNamespace != "" {
podTemplate.Annotations[utils.RayExternalStorageNSAnnotationKey] = options.ExternalStorageNamespace
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Override the ray.io/external-storage-namespace annotation if the new ExternalStorageNamespace is specified. We rely on this annotation to set the RAY_external_storage_namespace env in the later setContainerEnvVars function.

@@ -322,7 +325,7 @@ func initLivenessAndReadinessProbe(rayContainer *corev1.Container, rayNodeType r
}

// BuildPod a pod config
func BuildPod(ctx context.Context, podTemplateSpec corev1.PodTemplateSpec, rayNodeType rayv1.RayNodeType, rayStartParams map[string]string, headPort string, enableRayAutoscaler *bool, creatorCRDType utils.CRDType, fqdnRayIP string) (aPod corev1.Pod) {
func BuildPod(ctx context.Context, podTemplateSpec corev1.PodTemplateSpec, rayNodeType rayv1.RayNodeType, gcsOptions *rayv1.GcsFaultToleranceOptions, rayStartParams map[string]string, headPort string, enableRayAutoscaler *bool, creatorCRDType utils.CRDType, fqdnRayIP string) (aPod corev1.Pod) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add a new gcsOptions parameter and pass it to the later setContainerEnvVars function.

Comment on lines +371 to +378
if rayNodeType == rayv1.HeadNode && gcsOptions != nil {
if gcsOptions.RedisUsername != nil {
rayStartParams["redis-username"] = "$REDIS_USERNAME"
}
if gcsOptions.RedisPassword != nil {
rayStartParams["redis-password"] = "$REDIS_PASSWORD"
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the new gcsOptions is specified with redis username and password, we add --redis-username and --redis-password flags to the ray start command.

The $REDIS_USERNAME and $REDIS_PASSWORD will be replaced with env vars by the shell.

Comment on lines +669 to +674
if !utils.EnvVarExists(utils.REDIS_USERNAME, container.Env) {
// setting the REDIS_USERNAME env var from the params
if value, ok := rayStartParams["redis-username"]; ok {
container.Env = append(container.Env, corev1.EnvVar{Name: utils.REDIS_USERNAME, Value: value})
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the support for Redis usernames in the Ray core has been merged but hasn't been released yet, I think we can add related code in KubeRay 1.3.

Comment on lines +1168 to +1171
if utils.EnvVarExists(utils.REDIS_USERNAME, pod.Spec.Containers[utils.RayContainerIndex].Env) {
pod.Spec.Containers[utils.RayContainerIndex].Args[0] += "sys.exit(1) if not cleanup_redis_storage(host=parsed.hostname, port=parsed.port, username=os.getenv('REDIS_USERNAME', parsed.username), password=os.getenv('REDIS_PASSWORD', parsed.password), use_ssl=parsed.scheme=='rediss', storage_namespace=os.getenv('RAY_external_storage_namespace')) else None\""
} else {
pod.Spec.Containers[utils.RayContainerIndex].Args[0] += "sys.exit(1) if not cleanup_redis_storage(host=parsed.hostname, port=parsed.port, password=os.getenv('REDIS_PASSWORD', parsed.password), use_ssl=parsed.scheme=='rediss', storage_namespace=os.getenv('RAY_external_storage_namespace')) else None\""
Copy link
Contributor Author

@rueian rueian Jan 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add the username=os.getenv('REDIS_USERNAME', parsed.username) positional argument only when the REDIS_USERNAME env is present because the cleanup_redis_storage function in old Ray doesn't support the argument.

@rueian rueian marked this pull request as ready for review January 4, 2025 18:58
@kevin85421
Copy link
Member

Would you mind separating this into several smaller PRs? Perhaps open a PR to add the fields in the CRD as the first step.

@rueian
Copy link
Contributor Author

rueian commented Jan 5, 2025

Would you mind separating this into several smaller PRs? Perhaps open a PR to add the fields in the CRD as the first step.

@kevin85421, no problem! Here is the first step which only contains the addition to the CRD. #2715

@rueian rueian mentioned this pull request Jan 7, 2025
2 tasks
@rueian
Copy link
Contributor Author

rueian commented Jan 8, 2025

This will be split into even smaller PRs.

@rueian rueian closed this Jan 8, 2025
@Moonquakes
Copy link

Hi @rueian, I had previously inquired whether turning on GcsFaultTolerance was effective for Ray Cluster, i.e. whether the data corresponding to the Ray Dashboard could be recovered after the head node reboot, and the answer I got was no.

I would like to ask if this feature is designed to support this?

@rueian
Copy link
Contributor Author

rueian commented Jan 9, 2025

Hi @Moonquakes, The answer is still no. Ray only stores its cluster metadata in the Global Control Service (GCS); other data, such as metrics and logs, are not stored there. You will need to prepare additional external log storage (e.g., NFS) and metrics storage (e.g., Prometheus) for them if you want them to be recovered from failures.

@Moonquakes
Copy link

@rueian Thank you for your answer! However, GCS should also store task-related information. I don’t need to obtain logs and indicators. I just want to restore the original ray job and task information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[GCS FT] Move GCS FT API from annotations to RayCluster CRD
3 participants