Create EBS CSI Driver scale-test tool #2292

AndrewSirenko · 2025-01-16T15:04:18Z

What type of PR is this?

/kind documentation

What is this PR about? / Why do we need it?

Validate that the EBS CSI Driver is capable of managing volumes at the scale of your largest EKS clusters with our scale-test tool. See /hack/ebs-scale-test/README.md for more info.

Karpenter cluster-type, more scalability test-types, and more scalability test observability features coming soon.

Scope of this PR: One-step run pre-allocated scale-sts scalability tests.

Running ./scale-test create && ./scale-test run && ./scale-test clean with no extra environment variables set will:

Create an EKS cluster
Install EBS CSI Driver with metrics enabled
Deploy StorageClass and StatefulSet, then scale that sts to 1000 replicas (which will provision and publish 1000 volumes)
Delete sts (unpublishing and deleting volumes)
Export metrics, ebs-plugin logs, driver deployment + daemonset yaml, and other test artifacts to a local directory and an S3 bucket
Clean up all scalability test resources from your AWS account

How was this change tested?

Please follow the README.md to run these scale tests yourself.

Timing:

Cluster Setup: ~18 min
Run: ~8 min for 1000 replica test
Cluster Cleanup: ~10 min

❯ ./scale-test setup && ./scale-test run && ./scale-test clean
Deploying EKS cluster. See configuration in /tmp/tmp.FEdux6Eoal/cluster-config.yaml
2025-01-16 14:17:05 [ℹ]  eksctl version 0.194.0-dev+02ef28ee3.2024-10-22T18:42:34Z
...
Applying /workplace/andsirey/aws-ebs-csi-driver/hack/ebs-scale-test/helpers/scale-test/scale-sts-test/scale-sts.yaml. Exported to /tmp/tmp.FGNUXsiFTA/scale-manifest.yaml
...
partitioned roll out complete: 101 new pods have been updated...
Deleting StatefulSet
statefulset.apps "ebs-scale-test" deleted
storageclass.storage.k8s.io "ebs-scale-test" deleted
Waiting for all PVs to be deleted
101 PVs still exist, waiting...
...
Exporting everything in /tmp/tmp.FGNUXsiFTA to S3
Metrics exported to s3://ebs-scale-tests/ebs-scale-pre-allocated-scale-sts-101-2025-01-16T14:35UTC/
2025-01-16 14:44:57 [ℹ]  deleting EKS cluster "ebs-scale-pre-allocated"
2025-01-16 14:52:04 [✔]  all cluster resources were deleted

Does this PR introduce a user-facing change?

Add `scale-test` tool for running EBS CSI Driver scalability tests

k8s-ci-robot · 2025-01-16T15:04:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andrewsirenko. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

AndrewSirenko · 2025-01-16T15:05:13Z

hack/ebs-scale-test/README.md

+
+EBS uses EBS CSI Driver scalability tests to validate that each release of our driver can manage EBS volume lifecycle for large-scale clusters. 
+
+Setup and run an EBS CSI Driver scalability test with our `scale-test` tool:  


Shall I add wording about results being exported to local dir + S3 bucket here? Or is that implicit.

I would prefer us to be explicit.

Also, should we add a note about the permissions one needs to successfully run this test?

AndrewSirenko · 2025-01-16T15:05:27Z

hack/ebs-scale-test/README.md

+
+## Pre-requisites
+
+REVIEWER NOTE: I'm open to relying on `make tools` /bin dependencies. But that might be confusing to those just wanting to run scale tests. 


I think it might be best to rely on these so that we can ensure the smallest level of variance based on personal setups. It makes it easier to help someone if they are having trouble running the tests as we know what dependencies they are using.

AndrewSirenko · 2025-01-16T15:05:50Z

hack/ebs-scale-test/README.md

+
+- `CLUSTER_TYPE` dictates what type of scalability cluster `scale-test` creates and which nodes are used during a scalability test run. Options include: 
+  - 'pre-allocated': Additional worker nodes are created during cluster setup. By default, we pre-allocate 1 `m7a.48xlarge` EC2 instance for every 100 StatefulSet replicas. 
+


Extra newline will fix in later revision

github-actions · 2025-01-16T15:06:21Z

Code Coverage Diff

This PR does not change the code coverage

AndrewSirenko · 2025-01-16T15:06:54Z

hack/ebs-scale-test/helpers/cluster-setup/scale-cluster-config.yaml

+        ebsCSIController: true
+
+managedNodeGroups:
+{{- if eq ( getenv "CLUSTER_TYPE" ) "pre-allocated" }}


We have to use templating here because for Karpenter cluster-type, we'll need a different nodegroup.

AndrewSirenko · 2025-01-16T15:08:55Z

hack/ebs-scale-test/helpers/scale-test/scale-sts-test/scale-sts.yaml

+    whenDeleted: Delete
+---
+apiVersion: storage.k8s.io/v1
+kind: StorageClass


Thoughts on adding "ebs-scale-test: $CLUSTER_NAME" tags to each volume? And then when we cleanup resources we can check for any leaked volumes (which doesn't happen in my testing but better safe than sorry).

+1 If there was for some reason an interruption in the test like someone forgot to use nohup and disconnected from the network :) having the tags would also allow for easy manual deletion of leaked resources.

AndrewSirenko · 2025-01-16T15:09:24Z

hack/ebs-scale-test/scale-test

+TEST_TYPE=${TEST_TYPE:="scale-sts"}
+REPLICAS=${REPLICAS:=1000}
+DRIVER_VALUES_FILEPATH=${DRIVER_VALUES_FILEPATH:="$BASE_DIR/helpers/cluster-setup/scale-driver-values.yaml"}
+# TODO Q: Does anyone need an $OVERRIDE_KUBECONFIG? Let's discuss.


Discuss! I'd prefer a wrapper script to run simultaneous scalability tests in different regions.

Don't need it now but if it is not too much effort on your end may be worth doing in case in the future we do need to do so for any reason.

AndrewSirenko · 2025-01-16T15:09:36Z

hack/ebs-scale-test/scale-test

+
+# Functions sourced from helpers/scale-test/...
+run-scale() {
+  # TODO Q: Is it worth restarting EBS CSI Controller pod to ensure clean metrics/logs? Or move EBS CSI Driver install/cleanup to 'run' instead of 'setup'? Let's discuss.


This is a good point. I think moving driver installation to run makes more sense if guaranteeing clean metrics/logs for each test run is a requirement and running multiple scalability tests is a supported use case. Otherwise, lets document the gotcha.

AndrewSirenko · 2025-01-16T15:11:58Z

hack/ebs-scale-test/helpers/scale-test/scale-sts-test/scale-sts.yaml

+{{- if eq ( getenv "CLUSTER_TYPE" ) "karpenter" }}
+      nodeSelector:
+        karpenter.sh/nodepool: ebs-scale-test
+{{- end }}


Example of why gomplate is useful for manifests. In an alternative version I was relying on Kustomize, but this approach was cleaner.

AndrewSirenko · 2025-01-16T15:15:33Z

hack/ebs-scale-test/scale-test

+  # TODO Q: Is it worth restarting EBS CSI Controller pod to ensure clean metrics/logs? Or move EBS CSI Driver install/cleanup to 'run' instead of 'setup'? Let's discuss.
+  aws eks update-kubeconfig --name "$CLUSTER_NAME"
+  sts_scale_test
+  export_to_s3


I'm considering adding a separate helper script that helps parse the exported scalability test data and auto-generates what we would use for our scalability regression tracking / dashboard.

But I realize that might be best to add when I do the next release where we scalability test, to see what is actually needed.

Yeah im okay with that coming during that time instead of in this PR

AndrewSirenko · 2025-01-16T15:45:22Z

hack/ebs-scale-test/README.md

+
+# Names
+CLUSTER_NAME              # Base name used by `eksctl` to create AWS resources.
+EXPORT_DIR                # Where to export scale test metrics/logs locally.


Should add .gitignore for default export_dir path

IMO yes it will make making changes to scale tests after running them easier.

AndrewSirenko · 2025-01-16T15:55:10Z

hack/ebs-scale-test/helpers/cluster-setup/scale-cluster-config.yaml

+addons:
+  - name: eks-pod-identity-agent
+  - name: snapshot-controller


I'm considering pinning these dependencies to a specific version... We'll eventually have snapshot scale tests.

And add as a step to our dependancy upgrade runbook.

torredil · 2025-01-16T17:29:56Z

hack/ebs-scale-test/README.md

+
+`scale-test` parses arguments and wraps scripts and configuration files in the `helpers` directory. These helper scripts manage the scalability cluster and test runs. 
+
+We rely on [gomplate](https://github.com/hairyhenderson/gomplate) to render configuration files based on environment variables. 


np: Can remove this snippet and include the relevant context directly in the Pre-requisites section above, ie:

## Pre-requisites Install the following commandline tools: - [gomplate](https://github.com/hairyhenderson/gomplate) - used to render configuration files based on environment variables.

torredil · 2025-01-17T16:52:45Z

hack/ebs-scale-test/helpers/scale-test/export-to-s3.sh

+    while true; do
+      curl "http://localhost:${port}/metrics" >>"$EXPORT_DIR/metrics.txt" && break
+      echo "Failed to collect metrics from port ${port}, retrying..."
+      sleep 5
+    done


Can we get stuck in an infinite loop here? if so, we should time out at some point.

torredil · 2025-01-17T18:58:45Z

hack/ebs-scale-test/scale-test

+
+# Functions sourced from helpers/scale-test/...
+run-scale() {
+  # TODO Q: Is it worth restarting EBS CSI Controller pod to ensure clean metrics/logs? Or move EBS CSI Driver install/cleanup to 'run' instead of 'setup'? Let's discuss.


This is a good point. I think moving driver installation to run makes more sense if guaranteeing clean metrics/logs for each test run is a requirement and running multiple scalability tests is a supported use case. Otherwise, lets document the gotcha.

torredil · 2025-01-17T20:23:42Z

hack/ebs-scale-test/helpers/cluster-setup/scale-driver-values.yaml

+  pullPolicy: Always
+controller:
+  logLevel: 7
+  replicaCount: 1


This is a very pragmatic solution for solving the issue of different controller pods having different leaders for the sidecars - can we add a comment explaining that? its not immediately obvious.

hack/ebs-scale-test/scale-test

rdpsin · 2025-01-17T22:51:07Z

hack/ebs-scale-test/scale-test

+CLUSTER_NAME=${CLUSTER_NAME:="ebs-scale-$CLUSTER_TYPE"}
+EXPORT_DIR=${EXPORT_DIR:=$(mktemp -d)}
+S3_BUCKET=${S3_BUCKET:="ebs-scale-tests"}
+SCALABILITY_TEST_RUN_NAME=${SCALABILITY_TEST_RUN_NAME:="$CLUSTER_NAME-$TEST_TYPE-$REPLICAS-$(date -u +%Y-%m-%dT%H:%M%Z)"}


Is it worth adding a timestamp even when provided with the test name? As it stands, wouldn't running the test multiple times with the same name overwrite the bucket contents of the previous runs?

Hmm, I was thinking that those who override test name would want complete control over it (and could add their own preferred timestamp format or timezone themselves).

But you bring up the good point of users accidentally overwriting repeat runs. I will make the change.

Thinking a bit more on this, would a better change be erroring if the run_name already exists within the S3 bucket? That would solve your root concern of overwriting bucket contents (and may warn users of a mistake before they wait 30+ minutes for the test to complete).

Or would you still prefer force-appending our default timestamp format on each run? @rdpsin

Both sound like decent options to me. Whatever the team decides. :)

Could we print a warning message of something like warning run_name already exists if you continue you will overwrite the contents of run_name are you sure you want to continue followed by a y/n prompt.

I prefer avoiding y/n prompts in scripts that may be used in automated processes (like a wrapper script that manages scalability clusters/runs simultaneously) because then the caller has to rely on yes or expect to work around that forced interaction.

Because I foresee us wrapping ebs-scale-test, I prefer that we make a consistent choice on behalf of caller (fast-failing would be my preference).

rdpsin · 2025-01-17T22:52:27Z

hack/ebs-scale-test/scale-test

+
+CLUSTER_NAME=${CLUSTER_NAME:="ebs-scale-$CLUSTER_TYPE"}
+EXPORT_DIR=${EXPORT_DIR:=$(mktemp -d)}
+S3_BUCKET=${S3_BUCKET:="ebs-scale-tests"}


Should we remove the default here and throw an error if not provided? Considering S3 buckets name are globally unique.

I prefer setting a reasonable default instead of putting the burden of setting S3_BUCKET for each run on every user. Though I'm open to making S3_BUCKET a required parameter if other reviewers agree with you.

Would choosing a rarer default bucket name satisfy your concern? Or clearer documentation? Something like ebs-scale-tests-default-bucket-name is even more unlikely to already be a bucket in user's account.

Yes, but now another customer using the same script won't be able to upload the data to S3 because whatever name we choose is already taken by some other customer.

what about ebs-csi-driver-scale-test-bucket in case there are other "ebs-scale-tests" buckets out there

@rdpsin I finally realized your point. Of course we can't hard-code ebs-scale-tests as a default 🤦‍♂️

Alternatively we can append something like hash of account-id (ebs-scale-tests-208...), but you may be right, it's safest just to error out.

@ElijahQuinones there can only be one bucket named ebs-csi-driver-scale-test-bucket across all AWS accounts ("S3 buckets name are globally unique" even across account ID). Unless we generate a unique name by default on behalf of user, they will get err BucketAlreadyExists when we attempt to create bucket. You can try this for yourself by trying to create a bucket named rdspin from your AWS account (aws s3api create-bucket --bucket rdspin --region us-east-1). You can't because I just took the name for myself ;)

I think a pragmatic solution here would be following what we do for the kOps bucket used in CI:

aws-ebs-csi-driver/hack/e2e/config.sh

Line 45 in b28e70d

KOPS_BUCKET=${KOPS_BUCKET:-${AWS_ACCOUNT_ID}-ebs-csi-e2e-kops}

Prefixing the bucket name with an account ID practically guarantees that it is unique.

Good point that we also already prepend for kOps clusters @torredil

Alternatively we can append something like hash of account-id

@rdpsin did you have any concern with this alternative we aren't thinking about? Only thing I can think of is that account-id is sensitive data and customer can accidently make bucket public? In that case maybe hash is safest, but that might be overkill.

Sounds good, either of the options.

rdpsin · 2025-01-17T22:59:35Z

hack/ebs-scale-test/README.md

+
+
+- `TEST_TYPE` dictates what type of scalability test we want to run. Options include: 
+  - 'scale-sts': Scales a StatefulSet to `$REPLICAS`. Waits for all pods to be ready. Delete Sts. Waits for all PVs to be deleted. Exercises  


'Exercises '?

Great catch. Was thinking about adding the sentence: "Exercises the complete dynamic provisioning lifecycle for block volumes."

ElijahQuinones · 2025-01-21T18:55:27Z

hack/ebs-scale-test/README.md

+We rely on [gomplate](https://github.com/hairyhenderson/gomplate) to render configuration files based on environment variables. 
+
+The `helpers` directory includes:
+- `/helpers/cluster-setup`: Holds scripts and configuration for cluster and add-on setup/cleanup.


np: add-on here suggests that it is the eks addon version of the ebs-csi-driver (along with the other addons ) that we are installing but looking at manage-cluster.sh below we are installing via helm.

Add-on refers to eks-pod-identity and snapshot-controller addons.

I can remove and add-on from the sentence though if that context is not needed.

Create EBS CSI Driver scale-test tool

49cbf49

k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Jan 16, 2025

k8s-ci-robot requested review from ElijahQuinones and torredil January 16, 2025 15:04

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 16, 2025

AndrewSirenko commented Jan 16, 2025

View reviewed changes

torredil reviewed Jan 17, 2025

View reviewed changes

rdpsin reviewed Jan 17, 2025

View reviewed changes

ElijahQuinones reviewed Jan 21, 2025

View reviewed changes


		EBS uses EBS CSI Driver scalability tests to validate that each release of our driver can manage EBS volume lifecycle for large-scale clusters.

		Setup and run an EBS CSI Driver scalability test with our `scale-test` tool:


		## Pre-requisites

		REVIEWER NOTE: I'm open to relying on `make tools` /bin dependencies. But that might be confusing to those just wanting to run scale tests.


		- `CLUSTER_TYPE` dictates what type of scalability cluster `scale-test` creates and which nodes are used during a scalability test run. Options include:
		- 'pre-allocated': Additional worker nodes are created during cluster setup. By default, we pre-allocate 1 `m7a.48xlarge` EC2 instance for every 100 StatefulSet replicas.


		`scale-test` parses arguments and wraps scripts and configuration files in the `helpers` directory. These helper scripts manage the scalability cluster and test runs.

		We rely on [gomplate](https://github.com/hairyhenderson/gomplate) to render configuration files based on environment variables.



		- `TEST_TYPE` dictates what type of scalability test we want to run. Options include:
		- 'scale-sts': Scales a StatefulSet to `$REPLICAS`. Waits for all pods to be ready. Delete Sts. Waits for all PVs to be deleted. Exercises

Create EBS CSI Driver scale-test tool #2292

Are you sure you want to change the base?

Create EBS CSI Driver scale-test tool #2292

Conversation

AndrewSirenko commented Jan 16, 2025

What type of PR is this?

What is this PR about? / Why do we need it?

How was this change tested?

Does this PR introduce a user-facing change?

k8s-ci-robot commented Jan 16, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 16, 2025

Code Coverage Diff

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ElijahQuinones Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewSirenko Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewSirenko Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

AndrewSirenko Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdpsin Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

AndrewSirenko Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewSirenko Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewSirenko Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ElijahQuinones Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ElijahQuinones Jan 21, 2025 •

edited

Loading

AndrewSirenko Jan 16, 2025 •

edited

Loading

AndrewSirenko Jan 21, 2025 •

edited

Loading

AndrewSirenko Jan 21, 2025 •

edited

Loading

rdpsin Jan 17, 2025 •

edited

Loading

AndrewSirenko Jan 21, 2025 •

edited

Loading

AndrewSirenko Jan 21, 2025 •

edited

Loading

AndrewSirenko Jan 21, 2025 •

edited

Loading

ElijahQuinones Jan 21, 2025 •

edited

Loading