Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validators to check head node instance type and shared storage type w.r.t cluster size #6623

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

hanwen-cluster
Copy link

The requirements set in these validators are minimum. Users should leave more safety margin considering their workloads.

Tests

  • Manual testing with cluster configurations that trigger/not trigger the validators is successful

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hanwen-cluster hanwen-cluster requested review from a team as code owners December 13, 2024 17:10
@hanwen-cluster hanwen-cluster added the skip-changelog-update Disables the check that enforces changelog updates in PRs label Dec 13, 2024
f"Head node instance type {head_node_instance_type} has {head_node_memory} GB of memory. "
f"Please choose a head node instance type with at least {required_memory} GB of memory"
f" to manage {total_max_compute_nodes} compute nodes.",
FailureLevel.ERROR,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why decide to use FailureLevel.Error here but not a Warning? I think customers have the right to decide what instance type they pay for, but we have a responsibility to warn them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the requirement is very relaxed.
For example, the validator allows t3.micro to manage 10 nodes, t3.medium to manage 85 nodes, t3.xlarge to manage infinite nodes. If the requirement is violated, it is very likely the scaling up would fail.
Also, the customer could suppress the validation anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convinced me, I agree.

@hanwen-cluster hanwen-cluster changed the base branch from release-3.12 to develop December 13, 2024 17:23
…pe w.r.t cluster size

The requirements set in these validators are minimum. Users should leave more safety margin considering their workloads.

Signed-off-by: Hanwen <[email protected]>
@@ -1311,6 +1311,41 @@ def _validate(self, imds_secured: bool, scheduler: str):
)


class HeadNodeInstanceTypeValidator(Validator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be more specific that just instance type and mention memory in the name so folks know what its checking

)


class SharedEbsClusterSizeValidator(Validator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same naming comment as above, we should be able to understand what it does by it's name

if total_max_compute_nodes > 100:
self._add_failure(
"EBS shared storage is mounted on the head node and shared to the compute nodes. "
"This is a performance bottle neck if the compute nodes rely on this shared storage. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear why this is a performance bottleneck. Performance of what components?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip-changelog-update Disables the check that enforces changelog updates in PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants