-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add validators to check head node instance type and shared storage type w.r.t cluster size #6623
base: develop
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1311,6 +1311,41 @@ def _validate(self, imds_secured: bool, scheduler: str): | |
) | ||
|
||
|
||
class HeadNodeInstanceTypeValidator(Validator): | ||
""" | ||
Head Node Instance Type Validator. | ||
Verify if the Head Node has enough memory to manage compute nodes. | ||
""" | ||
|
||
def _validate(self, head_node_instance_type: str, total_max_compute_nodes: int): | ||
head_node_memory = ( | ||
AWSApi.instance().ec2.get_instance_type_info(head_node_instance_type).ec2memory_size_in_mib() / 1024 | ||
) | ||
# Assume OS takes up 0.6GB memory. Only check upto 16GB memory to prevent usage of small instance types. | ||
required_memory = min(total_max_compute_nodes / 25 + 0.6, 16) | ||
if head_node_memory < required_memory: | ||
self._add_failure( | ||
f"Head node instance type {head_node_instance_type} has {head_node_memory} GB of memory. " | ||
f"Please choose a head node instance type with at least {required_memory} GB of memory" | ||
f" to manage {total_max_compute_nodes} compute nodes.", | ||
FailureLevel.ERROR, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why decide to use FailureLevel.Error here but not a Warning? I think customers have the right to decide what instance type they pay for, but we have a responsibility to warn them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because the requirement is very relaxed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Convinced me, I agree. |
||
) | ||
|
||
|
||
class SharedEbsClusterSizeValidator(Validator): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same naming comment as above, we should be able to understand what it does by it's name |
||
"""Warn potential performance bottleneck of using Shared EBS.""" | ||
|
||
def _validate(self, total_max_compute_nodes: int): | ||
if total_max_compute_nodes > 100: | ||
self._add_failure( | ||
"EBS shared storage is mounted on the head node and shared to the compute nodes. " | ||
"This is a performance bottle neck if the compute nodes rely on this shared storage. " | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not clear why this is a performance bottleneck. Performance of what components? |
||
"Please use FSx and EFS for better performance.", | ||
FailureLevel.WARNING, | ||
) | ||
|
||
|
||
class ComputeResourceLaunchTemplateValidator(_LaunchTemplateValidator): | ||
"""Try to launch the requested instances (in dry-run mode) to verify configuration parameters.""" | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be more specific that just
instance type
and mention memory in the name so folks know what its checking