[indexer_autoscaler_lambda] pause indexing if the available storage drops below a threshold #14

akumar1214 · 2024-10-31T17:35:42Z

The amount of free space declines much faster during a backfill than live indexing and if there is a node that runs out of space, the node will crash and lose data if not caught in time. This adds an extra layer of protection during backfills to set the concurrency to the minimum to pause indexing if the node with the least amount of free storage drops below a client-set threshold.

This approach uses the CloudWatch client to retrieve the free storage metrics instead of directly from Elasticsearch. I compared the Elasticsearch disk space metrics to CloudWatch and they were ~10,000 MB higher which seemed high enough to prefer CloudWatch. CloudWatch metrics can be delayed ~10-15 minutes but this is fine enough for a backfill. Getting the metric directly from Opensearch also wouldn't be a minimum over time and could be prone to spikes.

How tested?

CloudWatch per node metrics at the same time
CloudWatch metric queried with the same expression used in the autoscaler

…eSpace

elasticgraph-indexer_autoscaler_lambda/elasticgraph-indexer_autoscaler_lambda.gemspec

...-indexer_autoscaler_lambda/lib/elastic_graph/indexer_autoscaler_lambda/concurrency_scaler.rb

...raph-indexer_autoscaler_lambda/lib/elastic_graph/indexer_autoscaler_lambda/details_logger.rb

elasticgraph-indexer_autoscaler_lambda/sig/elastic_graph/cloudwatch_client.rbs

...toscaler_lambda/spec/unit/elastic_graph/indexer_autoscaler_lambda/concurrency_scaler_spec.rb

myronmarston

Almost ready to merge!

myronmarston · 2024-11-07T23:30:09Z

...raph-indexer_autoscaler_lambda/lib/elastic_graph/indexer_autoscaler_lambda/details_logger.rb

@@ -30,32 +30,43 @@ def initialize(
        }
      end

-      def log_increase(cpu_utilization:, current_concurrency:, new_concurrency:)
+      def log_increase(cpu_utilization:, min_free_storage_in_mb:, current_concurrency:, new_concurrency:)


I still think that min is overloaded (since it can refer both to the threshold at which we pause indexing or the value we compare to that threshold). It's not very clear which this is. Can you include both (they both seem useful to log) and name them appropriately, similar to the renaming you did in the lambda logic? e.g. lowest_node_free_storage_in_mb and required_free_storage_in_mb.

(And please apply that throughout this file.)

myronmarston · 2024-11-07T23:33:55Z

...aph-indexer_autoscaler_lambda/lib/elastic_graph/indexer_autoscaler_lambda/lambda_function.rb

@@ -26,7 +26,9 @@ def handle_request(event:, context:)
          min_cpu_target: event.fetch("min_cpu_target"),
          max_cpu_target: event.fetch("max_cpu_target"),
          maximum_concurrency: event.fetch("maximum_concurrency"),
-          indexer_function_name: event.fetch("indexer_function_name")
+          required_free_storage_in_mb: event.fetch("min_free_storage_in_mb"),


/nit it feels inconsistent to call it required_free_storage_in_mb internally but call it min_free_storage_in_mb in the event payload. Can you call it required_free_storage_in_mb in both spots? Every other event key gets passed through as-is instead of getting renamed...

ah yeah good point, that should match to be consistent

myronmarston

👍

akumar1214 force-pushed the akumar/autoscaler-tune-lambda-concurrency branch from d6a04f1 to 303f029 Compare October 31, 2024 18:19

akumar1214 force-pushed the akumar/autoscaler-min-free-storage branch 11 times, most recently from f703aff to 48f6a76 Compare November 5, 2024 20:55

akumar1214 marked this pull request as ready for review November 6, 2024 00:28

akumar1214 requested review from myronmarston and BrianSigafoos-SQ as code owners November 6, 2024 00:28

akumar1214 changed the title ~~[indexer_autoscaler_lambda] incorporate free storage metrics into autoscaler tuning~~ [indexer_autoscaler_lambda] pause indexing if the available storage drops below a threshold Nov 6, 2024

akumar1214 added 5 commits November 5, 2024 16:45

incorporate free storage metrics into autoscaler tuning

9464bd8

data nodes are implied from FreeStorageSpace and not MasterFreeStorag…

7214b7e

…eSpace

fix steep errors

3eb0cf9

update expression

008fb13

add domain name to search expression

97aa51e

akumar1214 force-pushed the akumar/autoscaler-min-free-storage branch from 48f6a76 to e446554 Compare November 6, 2024 00:46

akumar1214 force-pushed the akumar/autoscaler-tune-lambda-concurrency branch from 303f029 to e032ddc Compare November 6, 2024 00:46

result is actually in MB not bytes

0f465f2

myronmarston requested changes Nov 6, 2024

View reviewed changes

akumar1214 force-pushed the akumar/autoscaler-min-free-storage branch 5 times, most recently from 3ee57c3 to 4f002b7 Compare November 6, 2024 19:40

myronmarston requested changes Nov 7, 2024

View reviewed changes

PR feedback

8451f65

akumar1214 force-pushed the akumar/autoscaler-min-free-storage branch from 4f002b7 to 8451f65 Compare November 9, 2024 00:54

myronmarston approved these changes Nov 12, 2024

View reviewed changes

Base automatically changed from akumar/autoscaler-tune-lambda-concurrency to main November 12, 2024 19:49

Merge branch 'main' into akumar/autoscaler-min-free-storage

580d65f

akumar1214 merged commit 342335d into main Nov 12, 2024
10 checks passed

akumar1214 deleted the akumar/autoscaler-min-free-storage branch November 12, 2024 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[indexer_autoscaler_lambda] pause indexing if the available storage drops below a threshold #14

[indexer_autoscaler_lambda] pause indexing if the available storage drops below a threshold #14

akumar1214 commented Oct 31, 2024 •

edited

Loading

myronmarston left a comment

myronmarston Nov 7, 2024

myronmarston Nov 7, 2024

akumar1214 Nov 8, 2024 •

edited

Loading

myronmarston left a comment

[indexer_autoscaler_lambda] pause indexing if the available storage drops below a threshold #14

[indexer_autoscaler_lambda] pause indexing if the available storage drops below a threshold #14

Conversation

akumar1214 commented Oct 31, 2024 • edited Loading

How tested?

myronmarston left a comment

Choose a reason for hiding this comment

myronmarston Nov 7, 2024

Choose a reason for hiding this comment

myronmarston Nov 7, 2024

Choose a reason for hiding this comment

akumar1214 Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

myronmarston left a comment

Choose a reason for hiding this comment

akumar1214 commented Oct 31, 2024 •

edited

Loading

akumar1214 Nov 8, 2024 •

edited

Loading