-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create EBS CSI Driver scale-test tool #2292
base: master
Are you sure you want to change the base?
Create EBS CSI Driver scale-test tool #2292
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
EBS uses EBS CSI Driver scalability tests to validate that each release of our driver can manage EBS volume lifecycle for large-scale clusters. | ||
|
||
Setup and run an EBS CSI Driver scalability test with our `scale-test` tool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall I add wording about results being exported to local dir + S3 bucket here? Or is that implicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer us to be explicit.
Also, should we add a note about the permissions one needs to successfully run this test?
|
||
## Pre-requisites | ||
|
||
REVIEWER NOTE: I'm open to relying on `make tools` /bin dependencies. But that might be confusing to those just wanting to run scale tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discuss!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be best to rely on these so that we can ensure the smallest level of variance based on personal setups. It makes it easier to help someone if they are having trouble running the tests as we know what dependencies they are using.
|
||
- `CLUSTER_TYPE` dictates what type of scalability cluster `scale-test` creates and which nodes are used during a scalability test run. Options include: | ||
- 'pre-allocated': Additional worker nodes are created during cluster setup. By default, we pre-allocate 1 `m7a.48xlarge` EC2 instance for every 100 StatefulSet replicas. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra newline will fix in later revision
Code Coverage DiffThis PR does not change the code coverage |
ebsCSIController: true | ||
|
||
managedNodeGroups: | ||
{{- if eq ( getenv "CLUSTER_TYPE" ) "pre-allocated" }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to use templating here because for Karpenter cluster-type, we'll need a different nodegroup.
whenDeleted: Delete | ||
--- | ||
apiVersion: storage.k8s.io/v1 | ||
kind: StorageClass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on adding "ebs-scale-test: $CLUSTER_NAME" tags to each volume? And then when we cleanup resources we can check for any leaked volumes (which doesn't happen in my testing but better safe than sorry).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 If there was for some reason an interruption in the test like someone forgot to use nohup and disconnected from the network :) having the tags would also allow for easy manual deletion of leaked resources.
TEST_TYPE=${TEST_TYPE:="scale-sts"} | ||
REPLICAS=${REPLICAS:=1000} | ||
DRIVER_VALUES_FILEPATH=${DRIVER_VALUES_FILEPATH:="$BASE_DIR/helpers/cluster-setup/scale-driver-values.yaml"} | ||
# TODO Q: Does anyone need an $OVERRIDE_KUBECONFIG? Let's discuss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discuss! I'd prefer a wrapper script to run simultaneous scalability tests in different regions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need it now but if it is not too much effort on your end may be worth doing in case in the future we do need to do so for any reason.
|
||
# Functions sourced from helpers/scale-test/... | ||
run-scale() { | ||
# TODO Q: Is it worth restarting EBS CSI Controller pod to ensure clean metrics/logs? Or move EBS CSI Driver install/cleanup to 'run' instead of 'setup'? Let's discuss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discuss!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. I think moving driver installation to run
makes more sense if guaranteeing clean metrics/logs for each test run is a requirement and running multiple scalability tests is a supported use case. Otherwise, lets document the gotcha.
{{- if eq ( getenv "CLUSTER_TYPE" ) "karpenter" }} | ||
nodeSelector: | ||
karpenter.sh/nodepool: ebs-scale-test | ||
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example of why gomplate is useful for manifests. In an alternative version I was relying on Kustomize, but this approach was cleaner.
# TODO Q: Is it worth restarting EBS CSI Controller pod to ensure clean metrics/logs? Or move EBS CSI Driver install/cleanup to 'run' instead of 'setup'? Let's discuss. | ||
aws eks update-kubeconfig --name "$CLUSTER_NAME" | ||
sts_scale_test | ||
export_to_s3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm considering adding a separate helper script that helps parse the exported scalability test data and auto-generates what we would use for our scalability regression tracking / dashboard.
But I realize that might be best to add when I do the next release where we scalability test, to see what is actually needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah im okay with that coming during that time instead of in this PR
|
||
# Names | ||
CLUSTER_NAME # Base name used by `eksctl` to create AWS resources. | ||
EXPORT_DIR # Where to export scale test metrics/logs locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add .gitignore for default export_dir path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO yes it will make making changes to scale tests after running them easier.
addons: | ||
- name: eks-pod-identity-agent | ||
- name: snapshot-controller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm considering pinning these dependencies to a specific version... We'll eventually have snapshot scale tests.
And add as a step to our dependancy upgrade runbook.
|
||
`scale-test` parses arguments and wraps scripts and configuration files in the `helpers` directory. These helper scripts manage the scalability cluster and test runs. | ||
|
||
We rely on [gomplate](https://github.com/hairyhenderson/gomplate) to render configuration files based on environment variables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np: Can remove this snippet and include the relevant context directly in the Pre-requisites
section above, ie:
## Pre-requisites
Install the following commandline tools:
- [gomplate](https://github.com/hairyhenderson/gomplate) - used to render configuration files based on environment variables.
while true; do | ||
curl "http://localhost:${port}/metrics" >>"$EXPORT_DIR/metrics.txt" && break | ||
echo "Failed to collect metrics from port ${port}, retrying..." | ||
sleep 5 | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get stuck in an infinite loop here? if so, we should time out at some point.
|
||
# Functions sourced from helpers/scale-test/... | ||
run-scale() { | ||
# TODO Q: Is it worth restarting EBS CSI Controller pod to ensure clean metrics/logs? Or move EBS CSI Driver install/cleanup to 'run' instead of 'setup'? Let's discuss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. I think moving driver installation to run
makes more sense if guaranteeing clean metrics/logs for each test run is a requirement and running multiple scalability tests is a supported use case. Otherwise, lets document the gotcha.
pullPolicy: Always | ||
controller: | ||
logLevel: 7 | ||
replicaCount: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very pragmatic solution for solving the issue of different controller pods having different leaders for the sidecars - can we add a comment explaining that? its not immediately obvious.
CLUSTER_NAME=${CLUSTER_NAME:="ebs-scale-$CLUSTER_TYPE"} | ||
EXPORT_DIR=${EXPORT_DIR:=$(mktemp -d)} | ||
S3_BUCKET=${S3_BUCKET:="ebs-scale-tests"} | ||
SCALABILITY_TEST_RUN_NAME=${SCALABILITY_TEST_RUN_NAME:="$CLUSTER_NAME-$TEST_TYPE-$REPLICAS-$(date -u +%Y-%m-%dT%H:%M%Z)"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth adding a timestamp even when provided with the test name? As it stands, wouldn't running the test multiple times with the same name overwrite the bucket contents of the previous runs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I was thinking that those who override test name would want complete control over it (and could add their own preferred timestamp format or timezone themselves).
But you bring up the good point of users accidentally overwriting repeat runs. I will make the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking a bit more on this, would a better change be erroring if the run_name already exists within the S3 bucket? That would solve your root concern of overwriting bucket contents (and may warn users of a mistake before they wait 30+ minutes for the test to complete).
Or would you still prefer force-appending our default timestamp format on each run? @rdpsin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both sound like decent options to me. Whatever the team decides. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we print a warning message of something like warning run_name already exists if you continue you will overwrite the contents of run_name are you sure you want to continue
followed by a y/n prompt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer avoiding y/n
prompts in scripts that may be used in automated processes (like a wrapper script that manages scalability clusters/runs simultaneously) because then the caller has to rely on yes
or expect
to work around that forced interaction.
Because I foresee us wrapping ebs-scale-test
, I prefer that we make a consistent choice on behalf of caller (fast-failing would be my preference).
|
||
CLUSTER_NAME=${CLUSTER_NAME:="ebs-scale-$CLUSTER_TYPE"} | ||
EXPORT_DIR=${EXPORT_DIR:=$(mktemp -d)} | ||
S3_BUCKET=${S3_BUCKET:="ebs-scale-tests"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove the default here and throw an error if not provided? Considering S3 buckets name are globally unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer setting a reasonable default instead of putting the burden of setting S3_BUCKET for each run on every user. Though I'm open to making S3_BUCKET a required parameter if other reviewers agree with you.
Would choosing a rarer default bucket name satisfy your concern? Or clearer documentation? Something like ebs-scale-tests-default-bucket-name
is even more unlikely to already be a bucket in user's account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but now another customer using the same script won't be able to upload the data to S3 because whatever name we choose is already taken by some other customer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about ebs-csi-driver-scale-test-bucket
in case there are other "ebs-scale-tests" buckets out there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdpsin I finally realized your point. Of course we can't hard-code ebs-scale-tests
as a default 🤦♂️
Alternatively we can append something like hash of account-id (ebs-scale-tests-208...
), but you may be right, it's safest just to error out.
@ElijahQuinones there can only be one bucket named ebs-csi-driver-scale-test-bucket
across all AWS accounts ("S3 buckets name are globally unique" even across account ID). Unless we generate a unique name by default on behalf of user, they will get err BucketAlreadyExists
when we attempt to create bucket. You can try this for yourself by trying to create a bucket named rdspin from your AWS account (aws s3api create-bucket --bucket rdspin --region us-east-1
). You can't because I just took the name for myself ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a pragmatic solution here would be following what we do for the kOps bucket used in CI:
aws-ebs-csi-driver/hack/e2e/config.sh
Line 45 in b28e70d
KOPS_BUCKET=${KOPS_BUCKET:-${AWS_ACCOUNT_ID}-ebs-csi-e2e-kops} |
Prefixing the bucket name with an account ID practically guarantees that it is unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point that we also already prepend for kOps clusters @torredil
Alternatively we can append something like hash of account-id
@rdpsin did you have any concern with this alternative we aren't thinking about? Only thing I can think of is that account-id is sensitive data and customer can accidently make bucket public? In that case maybe hash is safest, but that might be overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, either of the options.
|
||
|
||
- `TEST_TYPE` dictates what type of scalability test we want to run. Options include: | ||
- 'scale-sts': Scales a StatefulSet to `$REPLICAS`. Waits for all pods to be ready. Delete Sts. Waits for all PVs to be deleted. Exercises |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Exercises '?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. Was thinking about adding the sentence: "Exercises the complete dynamic provisioning lifecycle for block volumes."
We rely on [gomplate](https://github.com/hairyhenderson/gomplate) to render configuration files based on environment variables. | ||
|
||
The `helpers` directory includes: | ||
- `/helpers/cluster-setup`: Holds scripts and configuration for cluster and add-on setup/cleanup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np: add-on here suggests that it is the eks addon version of the ebs-csi-driver (along with the other addons ) that we are installing but looking at manage-cluster.sh below we are installing via helm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add-on refers to eks-pod-identity and snapshot-controller addons.
I can remove and add-on
from the sentence though if that context is not needed.
What type of PR is this?
/kind documentation
What is this PR about? / Why do we need it?
Validate that the EBS CSI Driver is capable of managing volumes at the scale of your largest EKS clusters with our
scale-test
tool. See/hack/ebs-scale-test/README.md
for more info.Karpenter cluster-type, more scalability test-types, and more scalability test observability features coming soon.
Scope of this PR: One-step run pre-allocated scale-sts scalability tests.
Running
./scale-test create && ./scale-test run && ./scale-test clean
with no extra environment variables set will:How was this change tested?
Please follow the README.md to run these scale tests yourself.
Timing:
Cluster Setup: ~18 min
Run: ~8 min for 1000 replica test
Cluster Cleanup: ~10 min
Does this PR introduce a user-facing change?