-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add validators to check head node instance type and shared storage type w.r.t cluster size #6623
base: develop
Are you sure you want to change the base?
Conversation
f"Head node instance type {head_node_instance_type} has {head_node_memory} GB of memory. " | ||
f"Please choose a head node instance type with at least {required_memory} GB of memory" | ||
f" to manage {total_max_compute_nodes} compute nodes.", | ||
FailureLevel.ERROR, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why decide to use FailureLevel.Error here but not a Warning? I think customers have the right to decide what instance type they pay for, but we have a responsibility to warn them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the requirement is very relaxed.
For example, the validator allows t3.micro to manage 10 nodes, t3.medium to manage 85 nodes, t3.xlarge to manage infinite nodes. If the requirement is violated, it is very likely the scaling up would fail.
Also, the customer could suppress the validation anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convinced me, I agree.
…pe w.r.t cluster size The requirements set in these validators are minimum. Users should leave more safety margin considering their workloads. Signed-off-by: Hanwen <[email protected]>
e8d4b13
to
ec5c7db
Compare
@@ -1311,6 +1311,41 @@ def _validate(self, imds_secured: bool, scheduler: str): | |||
) | |||
|
|||
|
|||
class HeadNodeInstanceTypeValidator(Validator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be more specific that just instance type
and mention memory in the name so folks know what its checking
) | ||
|
||
|
||
class SharedEbsClusterSizeValidator(Validator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same naming comment as above, we should be able to understand what it does by it's name
if total_max_compute_nodes > 100: | ||
self._add_failure( | ||
"EBS shared storage is mounted on the head node and shared to the compute nodes. " | ||
"This is a performance bottle neck if the compute nodes rely on this shared storage. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear why this is a performance bottleneck. Performance of what components?
The requirements set in these validators are minimum. Users should leave more safety margin considering their workloads.
Tests
Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.