-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create EBS CSI Driver scale-test tool #2292
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# EBS CSI Driver Scalability Tests | ||
|
||
EBS uses EBS CSI Driver scalability tests to validate that each release of our driver can manage EBS volume lifecycle for large-scale clusters. | ||
|
||
Setup and run an EBS CSI Driver scalability test with our `scale-test` tool: | ||
|
||
```shell | ||
# Set scalability parameters | ||
export CLUSTER_TYPE="pre-allocated" | ||
export TEST_TYPE="scale-sts" | ||
export REPLICAS="1000" | ||
|
||
# Setup an EKS scalability cluster and install EBS CSI Driver. | ||
./scale-test setup | ||
|
||
# Run a scalability test and export results to S3. | ||
./scale-test run | ||
|
||
# Cleanup all AWS resources related to scalability cluster. | ||
./scale-test cleanup | ||
``` | ||
|
||
## Pre-requisites | ||
|
||
REVIEWER NOTE: I'm open to relying on `make tools` /bin dependencies. But that might be confusing to those just wanting to run scale tests. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Discuss! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it might be best to rely on these so that we can ensure the smallest level of variance based on personal setups. It makes it easier to help someone if they are having trouble running the tests as we know what dependencies they are using. |
||
|
||
Install the following commandline tools: | ||
- [gomplate](https://github.com/hairyhenderson/gomplate) | ||
- [aws cli v2](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) | ||
- [eksctl](https://eksctl.io/installation/) | ||
- [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) | ||
|
||
## Overridable parameters | ||
|
||
You can modify the kind of scalability cluster test run, or the names of script artifacts, through environment variables. | ||
|
||
Note: The environment variables set when you run `scale-test setup` must remain the same for future `scale-test run`/`scale-test clean` commands on that scalability cluster. | ||
|
||
```sh | ||
# Affect test | ||
CLUSTER_TYPE # Type of scalability cluster to create. | ||
TEST_TYPE # Type of scale test to run. | ||
REPLICAS # Number of StatefulSet replicas to create. | ||
DRIVER_VALUES_FILEPATH # Custom values file passed to EBS CSI Driver Helm chart. | ||
|
||
# Names | ||
CLUSTER_NAME # Base name used by `eksctl` to create AWS resources. | ||
EXPORT_DIR # Where to export scale test metrics/logs locally. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should add .gitignore for default export_dir path There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO yes it will make making changes to scale tests after running them easier. |
||
S3_BUCKET # Name of S3 bucket used for holding scalability run results. | ||
SCALABILITY_TEST_RUN_NAME # Name of test run. Used as name of directory for adding run results in $S3_BUCKET. | ||
|
||
# Find default values at top of `scale-test` script. | ||
``` | ||
|
||
## Types of scalability tests | ||
|
||
Set the `CLUSTER_TYPE` and `TEST_TYPE` environment variables to set up and run different scalability tests. | ||
|
||
- `CLUSTER_TYPE` dictates what type of scalability cluster `scale-test` creates and which nodes are used during a scalability test run. Options include: | ||
- 'pre-allocated': Additional worker nodes are created during cluster setup. By default, we pre-allocate 1 `m7a.48xlarge` EC2 instance for every 100 StatefulSet replicas. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Extra newline will fix in later revision |
||
|
||
- `TEST_TYPE` dictates what type of scalability test we want to run. Options include: | ||
- 'scale-sts': Scales a StatefulSet to `$REPLICAS`. Waits for all pods to be ready. Delete Sts. Waits for all PVs to be deleted. Exercises | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 'Exercises '? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great catch. Was thinking about adding the sentence: "Exercises the complete dynamic provisioning lifecycle for block volumes." |
||
|
||
You can mix and match `CLUSTER_TYPE` and `TEST_TYPE`. | ||
|
||
## Contributing scalability tests | ||
|
||
`scale-test` parses arguments and wraps scripts and configuration files in the `helpers` directory. These helper scripts manage the scalability cluster and test runs. | ||
|
||
We rely on [gomplate](https://github.com/hairyhenderson/gomplate) to render configuration files based on environment variables. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. np: Can remove this snippet and include the relevant context directly in the
|
||
|
||
The `helpers` directory includes: | ||
- `/helpers/cluster-setup`: Holds scripts and configuration for cluster and add-on setup/cleanup. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. np: add-on here suggests that it is the eks addon version of the ebs-csi-driver (along with the other addons ) that we are installing but looking at manage-cluster.sh below we are installing via helm. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add-on refers to eks-pod-identity and snapshot-controller addons. I can remove |
||
- `/helpers/scale-test`: Holds directory for each scale test. Also holds utility scripts used by every test (like exporting logs/metrics to S3). |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
#!/bin/bash | ||
# Copyright 2025 The Kubernetes Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
### Helper script to create/delete eks ebs-scale-test clusters and install add-ons. | ||
|
||
set -euo pipefail | ||
|
||
# We expect this helper script is sourced from hack/ebs-scale-test | ||
path_to_cluster_setup_dir="${BASE_DIR}/helpers/cluster-setup/" | ||
|
||
## Cluster | ||
|
||
create_cluster() { | ||
if eksctl get cluster --name "$CLUSTER_NAME" --region "$AWS_REGION" >/dev/null 2>&1; then | ||
echo "EKS cluster '$CLUSTER_NAME' already up in $AWS_REGION." | ||
aws eks update-kubeconfig --name "$CLUSTER_NAME" --region "$AWS_REGION" | ||
else | ||
echo "Deploying EKS cluster. See configuration in $EXPORT_DIR/cluster-config.yaml" | ||
gomplate -f "$path_to_cluster_setup_dir/scale-cluster-config.yaml" -o "$EXPORT_DIR/cluster-config.yaml" | ||
eksctl create cluster -f "$EXPORT_DIR/cluster-config.yaml" | ||
fi | ||
} | ||
|
||
cleanup_cluster() { | ||
eksctl delete cluster "$CLUSTER_NAME" | ||
} | ||
|
||
## EBS CSI Driver | ||
|
||
deploy_ebs_csi_driver() { | ||
path_to_chart="${BASE_DIR}/../../charts/aws-ebs-csi-driver" | ||
echo "Deploying EBS CSI driver from chart $path_to_chart" | ||
|
||
helm upgrade --install aws-ebs-csi-driver \ | ||
--namespace kube-system \ | ||
--values "$DRIVER_VALUES_FILEPATH" \ | ||
--wait \ | ||
--timeout 15m \ | ||
"$path_to_chart" | ||
} | ||
|
||
(return 0 2>/dev/null) || ( | ||
echo "This script is not meant to be run directly, only sourced as a helper!" | ||
exit 1 | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# Copyright 2025 The Kubernetes Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
apiVersion: eksctl.io/v1alpha5 | ||
kind: ClusterConfig | ||
metadata: | ||
name: {{ .Env.CLUSTER_NAME }} | ||
version: {{ .Env.K8S_VERSION }} | ||
region: {{ .Env.AWS_REGION }} | ||
tags: | ||
karpenter.sh/discovery: {{ .Env.CLUSTER_NAME }} | ||
|
||
iam: | ||
withOIDC: true | ||
podIdentityAssociations: | ||
- namespace: kube-system | ||
serviceAccountName: ebs-csi-controller-sa | ||
wellKnownPolicies: | ||
ebsCSIController: true | ||
|
||
managedNodeGroups: | ||
{{- if eq ( getenv "CLUSTER_TYPE" ) "pre-allocated" }} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have to use templating here because for Karpenter cluster-type, we'll need a different nodegroup. |
||
- instanceType: m7a.48xlarge | ||
amiFamily: AmazonLinux2 | ||
name: pre-allocated-ng | ||
desiredCapacity: {{ .Env.PRE_ALLOCATED_NODES }} | ||
{{- end }} | ||
|
||
addons: | ||
- name: eks-pod-identity-agent | ||
- name: snapshot-controller | ||
Comment on lines
+40
to
+42
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm considering pinning these dependencies to a specific version... We'll eventually have snapshot scale tests. And add as a step to our dependancy upgrade runbook. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# Copyright 2025 The Kubernetes Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# Default values.yaml for ebs-scale-test installation of aws-ebs-csi-driver | ||
image: | ||
pullPolicy: Always | ||
controller: | ||
logLevel: 7 | ||
replicaCount: 1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a very pragmatic solution for solving the issue of different controller pods having different leaders for the sidecars - can we add a comment explaining that? its not immediately obvious. |
||
enableMetrics: true | ||
sidecars: | ||
provisioner: | ||
additionalArgs: ["--http-endpoint=:8081"] | ||
resizer: | ||
additionalArgs: ["--http-endpoint=:8082"] | ||
attacher: | ||
additionalArgs: ["--http-endpoint=:8084"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
#!/bin/bash | ||
# Copyright 2025 The Kubernetes Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# This script deploys the EBS CSI Driver and runs e2e tests | ||
# CLUSTER_NAME and CLUSTER_TYPE are expected to be specified by the caller | ||
# All other environment variables have default values (see config.sh) but | ||
# many can be overridden on demand if needed | ||
|
||
### Helper script for exporting EBS CSI Driver metrics to S3 bucket | ||
|
||
set -euo pipefail | ||
|
||
export_to_s3() { | ||
echo "Port-forwarding" | ||
controller_pod_name=$(kubectl get pod -n kube-system -l app=ebs-csi-controller -o jsonpath='{.items[0].metadata.name}') | ||
kubectl port-forward "$controller_pod_name" 3301:3301 -n kube-system & | ||
kubectl port-forward "$controller_pod_name" 8081:8081 -n kube-system & | ||
kubectl port-forward "$controller_pod_name" 8082:8082 -n kube-system & | ||
kubectl port-forward "$controller_pod_name" 8084:8084 -n kube-system & | ||
|
||
echo "Collecting metrics" | ||
for port in 3301 8081 8082 8084; do | ||
while true; do | ||
curl "http://localhost:${port}/metrics" >>"$EXPORT_DIR/metrics.txt" && break | ||
echo "Failed to collect metrics from port ${port}, retrying..." | ||
sleep 5 | ||
done | ||
Comment on lines
+35
to
+39
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we get stuck in an infinite loop here? if so, we should time out at some point. |
||
done | ||
|
||
echo "Collecting ebs-plugin logs" | ||
kubectl logs "$controller_pod_name" -n kube-system >"$EXPORT_DIR/ebs-plugin-logs.txt" | ||
|
||
echo "Collecting ebs-csi-controller Deployment and ebs-csi-node Daemonset yaml" | ||
kubectl get deployment ebs-csi-controller -n kube-system -o yaml >"$EXPORT_DIR/ebs-csi-controller.yaml" | ||
kubectl get daemonset ebs-csi-node -n kube-system -o yaml >"$EXPORT_DIR/ebs-csi-node.yaml" | ||
|
||
echo "Exporting everything in $EXPORT_DIR to S3" | ||
if ! aws s3 ls "s3://$S3_BUCKET"; then | ||
aws s3 mb "s3://$S3_BUCKET" --region "${AWS_REGION}" | ||
fi | ||
|
||
aws s3 sync "$EXPORT_DIR" "s3://$S3_BUCKET/$SCALABILITY_TEST_RUN_NAME" | ||
echo "Metrics exported to s3://$S3_BUCKET/$SCALABILITY_TEST_RUN_NAME/" | ||
} | ||
|
||
(return 0 2>/dev/null) || ( | ||
echo "This script is not meant to be run directly, only sourced as a helper!" | ||
exit 1 | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
#!/bin/bash | ||
# Copyright 2025 The Kubernetes Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
### Helper script for running EBS-backed StatefulSet scaling test | ||
|
||
# We expect this helper script is sourced from hack/ebs-scale-test | ||
path_to_scale_test_dir="${BASE_DIR}/helpers/scale-test/scale-sts-test" | ||
|
||
sts_scale_test() { | ||
manifest_path="$path_to_scale_test_dir/scale-sts.yaml" | ||
export_manifest_path="$EXPORT_DIR/scale-manifest.yaml" | ||
|
||
echo "Applying $manifest_path. Exported to $export_manifest_path" | ||
gomplate -f "$manifest_path" -o "$export_manifest_path" | ||
kubectl apply -f "$export_manifest_path" | ||
|
||
echo "Scaling StatefulSet $REPLICAS replicas" | ||
kubectl scale sts --replicas "$REPLICAS" ebs-scale-test | ||
kubectl rollout status statefulset ebs-scale-test | ||
|
||
echo "Deleting StatefulSet" | ||
kubectl delete -f "$export_manifest_path" | ||
|
||
echo "Waiting for all PVs to be deleted" | ||
wait_for_pvs_to_delete | ||
} | ||
|
||
wait_for_pvs_to_delete() { | ||
while true; do | ||
pv_count=$(kubectl get pv --no-headers | wc -l) | ||
if [ "$pv_count" -eq 0 ]; then | ||
echo "No PVs exist in the cluster, proceeding..." | ||
break | ||
else | ||
echo "$pv_count PVs still exist, waiting..." | ||
sleep 5 | ||
fi | ||
done | ||
} | ||
|
||
(return 0 2>/dev/null) || ( | ||
echo "This script is not meant to be run directly, only sourced as a helper!" | ||
exit 1 | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
# Copyright 2025 The Kubernetes Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
apiVersion: apps/v1 | ||
kind: StatefulSet | ||
metadata: | ||
name: ebs-scale-test | ||
spec: | ||
serviceName: "nginx" | ||
podManagementPolicy: "Parallel" | ||
replicas: 0 | ||
selector: | ||
matchLabels: | ||
app: ebs-scale-test | ||
template: | ||
metadata: | ||
labels: | ||
app: ebs-scale-test | ||
spec: | ||
containers: | ||
- name: nginx | ||
image: nginx:latest | ||
ports: | ||
- containerPort: 80 | ||
name: web | ||
volumeMounts: | ||
- name: vol | ||
mountPath: /usr/share/nginx/html | ||
resources: | ||
requests: | ||
memory: "256Mi" | ||
cpu: "250m" | ||
limits: | ||
memory: "256Mi" | ||
{{- if eq ( getenv "CLUSTER_TYPE" ) "karpenter" }} | ||
nodeSelector: | ||
karpenter.sh/nodepool: ebs-scale-test | ||
{{- end }} | ||
Comment on lines
+46
to
+49
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Example of why gomplate is useful for manifests. In an alternative version I was relying on Kustomize, but this approach was cleaner. |
||
volumeClaimTemplates: | ||
- metadata: | ||
name: vol | ||
spec: | ||
accessModes: [ "ReadWriteOnce" ] | ||
storageClassName: "ebs-scale-test" | ||
resources: | ||
requests: | ||
storage: 1Gi | ||
persistentVolumeClaimRetentionPolicy: | ||
whenDeleted: Delete | ||
--- | ||
apiVersion: storage.k8s.io/v1 | ||
kind: StorageClass | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thoughts on adding "ebs-scale-test: $CLUSTER_NAME" tags to each volume? And then when we cleanup resources we can check for any leaked volumes (which doesn't happen in my testing but better safe than sorry). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 If there was for some reason an interruption in the test like someone forgot to use nohup and disconnected from the network :) having the tags would also allow for easy manual deletion of leaked resources. |
||
metadata: | ||
name: ebs-scale-test | ||
provisioner: ebs.csi.aws.com | ||
reclaimPolicy: Delete | ||
volumeBindingMode: WaitForFirstConsumer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall I add wording about results being exported to local dir + S3 bucket here? Or is that implicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer us to be explicit.
Also, should we add a note about the permissions one needs to successfully run this test?