Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create EBS CSI Driver scale-test tool #2292

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions hack/ebs-scale-test/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# EBS CSI Driver Scalability Tests

EBS uses EBS CSI Driver scalability tests to validate that each release of our driver can manage EBS volume lifecycle for large-scale clusters.

Setup and run an EBS CSI Driver scalability test with our `scale-test` tool:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I add wording about results being exported to local dir + S3 bucket here? Or is that implicit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer us to be explicit.

Also, should we add a note about the permissions one needs to successfully run this test?


```shell
# Set scalability parameters
export CLUSTER_TYPE="pre-allocated"
export TEST_TYPE="scale-sts"
export REPLICAS="1000"

# Setup an EKS scalability cluster and install EBS CSI Driver.
./scale-test setup

# Run a scalability test and export results to S3.
./scale-test run

# Cleanup all AWS resources related to scalability cluster.
./scale-test cleanup
```

## Pre-requisites

REVIEWER NOTE: I'm open to relying on `make tools` /bin dependencies. But that might be confusing to those just wanting to run scale tests.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discuss!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be best to rely on these so that we can ensure the smallest level of variance based on personal setups. It makes it easier to help someone if they are having trouble running the tests as we know what dependencies they are using.


Install the following commandline tools:
- [gomplate](https://github.com/hairyhenderson/gomplate)
- [aws cli v2](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)
- [eksctl](https://eksctl.io/installation/)
- [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl)

## Overridable parameters

You can modify the kind of scalability cluster test run, or the names of script artifacts, through environment variables.

Note: The environment variables set when you run `scale-test setup` must remain the same for future `scale-test run`/`scale-test clean` commands on that scalability cluster.

```sh
# Affect test
CLUSTER_TYPE # Type of scalability cluster to create.
TEST_TYPE # Type of scale test to run.
REPLICAS # Number of StatefulSet replicas to create.
DRIVER_VALUES_FILEPATH # Custom values file passed to EBS CSI Driver Helm chart.

# Names
CLUSTER_NAME # Base name used by `eksctl` to create AWS resources.
EXPORT_DIR # Where to export scale test metrics/logs locally.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add .gitignore for default export_dir path

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO yes it will make making changes to scale tests after running them easier.

S3_BUCKET # Name of S3 bucket used for holding scalability run results.
SCALABILITY_TEST_RUN_NAME # Name of test run. Used as name of directory for adding run results in $S3_BUCKET.

# Find default values at top of `scale-test` script.
```

## Types of scalability tests

Set the `CLUSTER_TYPE` and `TEST_TYPE` environment variables to set up and run different scalability tests.

- `CLUSTER_TYPE` dictates what type of scalability cluster `scale-test` creates and which nodes are used during a scalability test run. Options include:
- 'pre-allocated': Additional worker nodes are created during cluster setup. By default, we pre-allocate 1 `m7a.48xlarge` EC2 instance for every 100 StatefulSet replicas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra newline will fix in later revision


- `TEST_TYPE` dictates what type of scalability test we want to run. Options include:
- 'scale-sts': Scales a StatefulSet to `$REPLICAS`. Waits for all pods to be ready. Delete Sts. Waits for all PVs to be deleted. Exercises
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Exercises '?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. Was thinking about adding the sentence: "Exercises the complete dynamic provisioning lifecycle for block volumes."


You can mix and match `CLUSTER_TYPE` and `TEST_TYPE`.

## Contributing scalability tests

`scale-test` parses arguments and wraps scripts and configuration files in the `helpers` directory. These helper scripts manage the scalability cluster and test runs.

We rely on [gomplate](https://github.com/hairyhenderson/gomplate) to render configuration files based on environment variables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np: Can remove this snippet and include the relevant context directly in the Pre-requisites section above, ie:

## Pre-requisites

Install the following commandline tools:
- [gomplate](https://github.com/hairyhenderson/gomplate) - used to render configuration files based on environment variables. 


The `helpers` directory includes:
- `/helpers/cluster-setup`: Holds scripts and configuration for cluster and add-on setup/cleanup.
Copy link
Member

@ElijahQuinones ElijahQuinones Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np: add-on here suggests that it is the eks addon version of the ebs-csi-driver (along with the other addons ) that we are installing but looking at manage-cluster.sh below we are installing via helm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add-on refers to eks-pod-identity and snapshot-controller addons.

I can remove and add-on from the sentence though if that context is not needed.

- `/helpers/scale-test`: Holds directory for each scale test. Also holds utility scripts used by every test (like exporting logs/metrics to S3).
57 changes: 57 additions & 0 deletions hack/ebs-scale-test/helpers/cluster-setup/manage-cluster.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/bin/bash
# Copyright 2025 The Kubernetes Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

### Helper script to create/delete eks ebs-scale-test clusters and install add-ons.

set -euo pipefail

# We expect this helper script is sourced from hack/ebs-scale-test
path_to_cluster_setup_dir="${BASE_DIR}/helpers/cluster-setup/"

## Cluster

create_cluster() {
if eksctl get cluster --name "$CLUSTER_NAME" --region "$AWS_REGION" >/dev/null 2>&1; then
echo "EKS cluster '$CLUSTER_NAME' already up in $AWS_REGION."
aws eks update-kubeconfig --name "$CLUSTER_NAME" --region "$AWS_REGION"
else
echo "Deploying EKS cluster. See configuration in $EXPORT_DIR/cluster-config.yaml"
gomplate -f "$path_to_cluster_setup_dir/scale-cluster-config.yaml" -o "$EXPORT_DIR/cluster-config.yaml"
eksctl create cluster -f "$EXPORT_DIR/cluster-config.yaml"
fi
}

cleanup_cluster() {
eksctl delete cluster "$CLUSTER_NAME"
}

## EBS CSI Driver

deploy_ebs_csi_driver() {
path_to_chart="${BASE_DIR}/../../charts/aws-ebs-csi-driver"
echo "Deploying EBS CSI driver from chart $path_to_chart"

helm upgrade --install aws-ebs-csi-driver \
--namespace kube-system \
--values "$DRIVER_VALUES_FILEPATH" \
--wait \
--timeout 15m \
"$path_to_chart"
}

(return 0 2>/dev/null) || (
echo "This script is not meant to be run directly, only sourced as a helper!"
exit 1
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Copyright 2025 The Kubernetes Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: {{ .Env.CLUSTER_NAME }}
version: {{ .Env.K8S_VERSION }}
region: {{ .Env.AWS_REGION }}
tags:
karpenter.sh/discovery: {{ .Env.CLUSTER_NAME }}

iam:
withOIDC: true
podIdentityAssociations:
- namespace: kube-system
serviceAccountName: ebs-csi-controller-sa
wellKnownPolicies:
ebsCSIController: true

managedNodeGroups:
{{- if eq ( getenv "CLUSTER_TYPE" ) "pre-allocated" }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to use templating here because for Karpenter cluster-type, we'll need a different nodegroup.

- instanceType: m7a.48xlarge
amiFamily: AmazonLinux2
name: pre-allocated-ng
desiredCapacity: {{ .Env.PRE_ALLOCATED_NODES }}
{{- end }}

addons:
- name: eks-pod-identity-agent
- name: snapshot-controller
Comment on lines +40 to +42
Copy link
Contributor Author

@AndrewSirenko AndrewSirenko Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering pinning these dependencies to a specific version... We'll eventually have snapshot scale tests.

And add as a step to our dependancy upgrade runbook.

28 changes: 28 additions & 0 deletions hack/ebs-scale-test/helpers/cluster-setup/scale-driver-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Copyright 2025 The Kubernetes Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Default values.yaml for ebs-scale-test installation of aws-ebs-csi-driver
image:
pullPolicy: Always
controller:
logLevel: 7
replicaCount: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very pragmatic solution for solving the issue of different controller pods having different leaders for the sidecars - can we add a comment explaining that? its not immediately obvious.

enableMetrics: true
sidecars:
provisioner:
additionalArgs: ["--http-endpoint=:8081"]
resizer:
additionalArgs: ["--http-endpoint=:8082"]
attacher:
additionalArgs: ["--http-endpoint=:8084"]
61 changes: 61 additions & 0 deletions hack/ebs-scale-test/helpers/scale-test/export-to-s3.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
#!/bin/bash
# Copyright 2025 The Kubernetes Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This script deploys the EBS CSI Driver and runs e2e tests
# CLUSTER_NAME and CLUSTER_TYPE are expected to be specified by the caller
# All other environment variables have default values (see config.sh) but
# many can be overridden on demand if needed

### Helper script for exporting EBS CSI Driver metrics to S3 bucket

set -euo pipefail

export_to_s3() {
echo "Port-forwarding"
controller_pod_name=$(kubectl get pod -n kube-system -l app=ebs-csi-controller -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward "$controller_pod_name" 3301:3301 -n kube-system &
kubectl port-forward "$controller_pod_name" 8081:8081 -n kube-system &
kubectl port-forward "$controller_pod_name" 8082:8082 -n kube-system &
kubectl port-forward "$controller_pod_name" 8084:8084 -n kube-system &

echo "Collecting metrics"
for port in 3301 8081 8082 8084; do
while true; do
curl "http://localhost:${port}/metrics" >>"$EXPORT_DIR/metrics.txt" && break
echo "Failed to collect metrics from port ${port}, retrying..."
sleep 5
done
Comment on lines +35 to +39
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get stuck in an infinite loop here? if so, we should time out at some point.

done

echo "Collecting ebs-plugin logs"
kubectl logs "$controller_pod_name" -n kube-system >"$EXPORT_DIR/ebs-plugin-logs.txt"

echo "Collecting ebs-csi-controller Deployment and ebs-csi-node Daemonset yaml"
kubectl get deployment ebs-csi-controller -n kube-system -o yaml >"$EXPORT_DIR/ebs-csi-controller.yaml"
kubectl get daemonset ebs-csi-node -n kube-system -o yaml >"$EXPORT_DIR/ebs-csi-node.yaml"

echo "Exporting everything in $EXPORT_DIR to S3"
if ! aws s3 ls "s3://$S3_BUCKET"; then
aws s3 mb "s3://$S3_BUCKET" --region "${AWS_REGION}"
fi

aws s3 sync "$EXPORT_DIR" "s3://$S3_BUCKET/$SCALABILITY_TEST_RUN_NAME"
echo "Metrics exported to s3://$S3_BUCKET/$SCALABILITY_TEST_RUN_NAME/"
}

(return 0 2>/dev/null) || (
echo "This script is not meant to be run directly, only sourced as a helper!"
exit 1
)
56 changes: 56 additions & 0 deletions hack/ebs-scale-test/helpers/scale-test/scale-sts-test/scale-sts.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#!/bin/bash
# Copyright 2025 The Kubernetes Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

### Helper script for running EBS-backed StatefulSet scaling test

# We expect this helper script is sourced from hack/ebs-scale-test
path_to_scale_test_dir="${BASE_DIR}/helpers/scale-test/scale-sts-test"

sts_scale_test() {
manifest_path="$path_to_scale_test_dir/scale-sts.yaml"
export_manifest_path="$EXPORT_DIR/scale-manifest.yaml"

echo "Applying $manifest_path. Exported to $export_manifest_path"
gomplate -f "$manifest_path" -o "$export_manifest_path"
kubectl apply -f "$export_manifest_path"

echo "Scaling StatefulSet $REPLICAS replicas"
kubectl scale sts --replicas "$REPLICAS" ebs-scale-test
kubectl rollout status statefulset ebs-scale-test

echo "Deleting StatefulSet"
kubectl delete -f "$export_manifest_path"

echo "Waiting for all PVs to be deleted"
wait_for_pvs_to_delete
}

wait_for_pvs_to_delete() {
while true; do
pv_count=$(kubectl get pv --no-headers | wc -l)
if [ "$pv_count" -eq 0 ]; then
echo "No PVs exist in the cluster, proceeding..."
break
else
echo "$pv_count PVs still exist, waiting..."
sleep 5
fi
done
}

(return 0 2>/dev/null) || (
echo "This script is not meant to be run directly, only sourced as a helper!"
exit 1
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Copyright 2025 The Kubernetes Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ebs-scale-test
spec:
serviceName: "nginx"
podManagementPolicy: "Parallel"
replicas: 0
selector:
matchLabels:
app: ebs-scale-test
template:
metadata:
labels:
app: ebs-scale-test
spec:
containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 80
name: web
volumeMounts:
- name: vol
mountPath: /usr/share/nginx/html
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "256Mi"
{{- if eq ( getenv "CLUSTER_TYPE" ) "karpenter" }}
nodeSelector:
karpenter.sh/nodepool: ebs-scale-test
{{- end }}
Comment on lines +46 to +49
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example of why gomplate is useful for manifests. In an alternative version I was relying on Kustomize, but this approach was cleaner.

volumeClaimTemplates:
- metadata:
name: vol
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "ebs-scale-test"
resources:
requests:
storage: 1Gi
persistentVolumeClaimRetentionPolicy:
whenDeleted: Delete
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on adding "ebs-scale-test: $CLUSTER_NAME" tags to each volume? And then when we cleanup resources we can check for any leaked volumes (which doesn't happen in my testing but better safe than sorry).

Copy link
Member

@ElijahQuinones ElijahQuinones Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 If there was for some reason an interruption in the test like someone forgot to use nohup and disconnected from the network :) having the tags would also allow for easy manual deletion of leaked resources.

metadata:
name: ebs-scale-test
provisioner: ebs.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
Loading
Loading