-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[indexer_autoscaler_lambda] pause indexing if the available storage drops below a threshold #14
Conversation
d6a04f1
to
303f029
Compare
f703aff
to
48f6a76
Compare
48f6a76
to
e446554
Compare
303f029
to
e032ddc
Compare
elasticgraph-indexer_autoscaler_lambda/elasticgraph-indexer_autoscaler_lambda.gemspec
Outdated
Show resolved
Hide resolved
...-indexer_autoscaler_lambda/lib/elastic_graph/indexer_autoscaler_lambda/concurrency_scaler.rb
Outdated
Show resolved
Hide resolved
...-indexer_autoscaler_lambda/lib/elastic_graph/indexer_autoscaler_lambda/concurrency_scaler.rb
Show resolved
Hide resolved
...-indexer_autoscaler_lambda/lib/elastic_graph/indexer_autoscaler_lambda/concurrency_scaler.rb
Show resolved
Hide resolved
...-indexer_autoscaler_lambda/lib/elastic_graph/indexer_autoscaler_lambda/concurrency_scaler.rb
Outdated
Show resolved
Hide resolved
...raph-indexer_autoscaler_lambda/lib/elastic_graph/indexer_autoscaler_lambda/details_logger.rb
Outdated
Show resolved
Hide resolved
elasticgraph-indexer_autoscaler_lambda/sig/elastic_graph/cloudwatch_client.rbs
Outdated
Show resolved
Hide resolved
...toscaler_lambda/spec/unit/elastic_graph/indexer_autoscaler_lambda/concurrency_scaler_spec.rb
Outdated
Show resolved
Hide resolved
3ee57c3
to
4f002b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost ready to merge!
@@ -30,32 +30,43 @@ def initialize( | |||
} | |||
end | |||
|
|||
def log_increase(cpu_utilization:, current_concurrency:, new_concurrency:) | |||
def log_increase(cpu_utilization:, min_free_storage_in_mb:, current_concurrency:, new_concurrency:) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that min
is overloaded (since it can refer both to the threshold at which we pause indexing or the value we compare to that threshold). It's not very clear which this is. Can you include both (they both seem useful to log) and name them appropriately, similar to the renaming you did in the lambda logic? e.g. lowest_node_free_storage_in_mb
and required_free_storage_in_mb
.
(And please apply that throughout this file.)
@@ -26,7 +26,9 @@ def handle_request(event:, context:) | |||
min_cpu_target: event.fetch("min_cpu_target"), | |||
max_cpu_target: event.fetch("max_cpu_target"), | |||
maximum_concurrency: event.fetch("maximum_concurrency"), | |||
indexer_function_name: event.fetch("indexer_function_name") | |||
required_free_storage_in_mb: event.fetch("min_free_storage_in_mb"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/nit it feels inconsistent to call it required_free_storage_in_mb
internally but call it min_free_storage_in_mb
in the event payload. Can you call it required_free_storage_in_mb
in both spots? Every other event
key gets passed through as-is instead of getting renamed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yeah good point, that should match to be consistent
4f002b7
to
8451f65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
The amount of free space declines much faster during a backfill than live indexing and if there is a node that runs out of space, the node will crash and lose data if not caught in time. This adds an extra layer of protection during backfills to set the concurrency to the minimum to pause indexing if the node with the least amount of free storage drops below a client-set threshold.
This approach uses the CloudWatch client to retrieve the free storage metrics instead of directly from Elasticsearch. I compared the Elasticsearch disk space metrics to CloudWatch and they were ~10,000 MB higher which seemed high enough to prefer CloudWatch. CloudWatch metrics can be delayed ~10-15 minutes but this is fine enough for a backfill. Getting the metric directly from Opensearch also wouldn't be a minimum over time and could be prone to spikes.
How tested?