-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rework ansible-like loadtest helpers #48634
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
vars.env | ||
state |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,16 +1,57 @@ | ||
# Ansible-like OpenSSH sessions load test | ||
# ansible-like openssh loadtest | ||
|
||
This setup is designed to be ran from the home directory of a VM (the default working directory for a systemd user service); the proxy public address and cluster name should be changed in `gen_inventory.sh`, `proxy_templates.yaml` and `tbot.yaml` from `PROXYHOST` and `CLUSTERNAME` respectively. It requires openssh, jq, xargs, and dumb-init, as well as tbot and fdpass-teleport. | ||
This setup is intended to generate fake ansible-like load by spawning very massive numbers | ||
of sessions against a large number of teleport nodes. It uses tbot/machineid with ssh multiplexing | ||
to support the needed volume of sessions (as one would do with an ansible master that manages | ||
a massive number of servers via teleport). | ||
|
||
This setup assumes that nodes are being ran by the `node-agent` Helm chart, and proxy templates are applied to do predicate-based dialing on the NODENAME label, as the chart sets up. Commenting or blanking the `proxy_templates.yaml` file (and restarting tbot) will change it to hostname-based dialing. Changing the `proxy_templates.yaml` file (and restarting tbot) can also be used to test a simpler predicate, or to test search-based dialing rather than predicate-based dialing. | ||
This setup is designed to be run on a fresh VM instance, and will perform various | ||
installs and system configuration actions. | ||
|
||
Bot and token can be created with `tctl -f loadtest-bot.yaml`, after editing the IAM account and role in it. Token-based joining with tbot is incredibly annoying, so IAM joining or some other ambient-based joining method is recommended. Running the `node-agent` chart is left as an exercise for the reader. | ||
It expects the following to already be installed on the system: | ||
|
||
The machine running the client should be scaled depending on how many nodes are targeted in the inventory; for 60000 nodes (i.e. 60k shell scripts and 120k ssh processes running at peak) the memory usage with Teleport 15 seems to be ~20GiB for tbot and ~200 for the scripts and SSH, so something like an AWS 32xlarge or 48xlarge might be necessary (maybe the compute-optimized variants, as memory isn't really a problem). Depending on the scale of the test and the runner machine, tuning GOMAXPROCS and GOMEMLIMIT in tbot.service might be useful. | ||
- `openssh` | ||
- `jq` | ||
- `xargs` | ||
|
||
It will perform installation of the following: | ||
|
||
- all default teleport binaries (namely, `tbot`) | ||
- `fdpass-teleport` | ||
- `dumb-init` | ||
|
||
By default, this test setup assumes that the `node-agents` loadtest helm chart is being | ||
used. The proxy templates generated rely on labels set by that helm chart. After setup | ||
is run, it is possible to customize the proxy template used by editing `/etc/tbot/proxy-templates.yaml`. | ||
|
||
Given the extreme scale of tests run with this setup, it is typically necessary to use a | ||
very large VM. For example, 60k agent tests are typically run from a 32xlarge or 48xlarge | ||
instance, either general purpose of compute optimized. | ||
|
||
## Usage | ||
|
||
- Run `tbot_install.sh` to set up tbot (it will install a specific Teleport version as listed in the script, tweak it as required), or `systemctl --user restart tbot.service` if tbot is already set up. | ||
- Run the `gen_inventory.sh` script to produce a list of hosts in random order in the `inventory` file, check that it matches the expected list of hosts. | ||
- Choose a random host in the inventory and confirm that the setup is working with `ssh -F tbot_destdir_mux/ssh_config root@host`. | ||
- Run `run.sh >/dev/null` (in tmux, probably). In a different terminal or tab, check how many sockets are being opened in the ssh controlmaster directory with `ls -1 /run/user/1000/ssh-control | wc -l` to confirm that connections are being established and muxed by ssh. Logs for tbot can be viewed with `journalctl --user-unit tbot --follow`. | ||
- Copy `example.vars.env` to `vars.env` and edit the copy. The `PROXY_HOST` variable | ||
and `BOT_TOKEN` variable *must* be changed. | ||
|
||
- Run `install.sh` once to install `tbot`, `fdpass-teleport`, and `dumb-init`. This only need | ||
ever be run once. | ||
|
||
- Run `init.sh` to set up tbot directories/configuration and start the `bot.service`. If this needs | ||
to be re-run (e.g. if proxy host or token need to be changed), it may be necessary to first manually | ||
halt the tbot service. | ||
|
||
- Run `journalctl -u tbot.service` to verify that `tbot` has successfully authenticated with the cluster. | ||
|
||
- Run `gen-inventory.sh` to generate a list of all target teleport nodes. This only needs to be re-run | ||
if/when the set of agents changes. | ||
|
||
- Verify that the setup is functional by selecting a random host from `state/inventory` and attempting to | ||
access it via `ssh -F /opt/machine-id/ssh_config root@host` | ||
|
||
- Run `run.sh` to run the actual test scenario. This will invoke `run-node.sh` for each member of | ||
the generated inventory and report success/failure of individual attempted sessions. Note that for | ||
large scale tests the output of this script is enormous and may need to be piped to `/dev/null`. Long | ||
running invocations should be performed within a `tmux` session or similar. | ||
|
||
- Verify that ssh connections are being established and multiplexed by monitoring the control master | ||
directory with `ls -1 /run/user/1000/ssh-control | wc -l`. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# proxy host is the hostname of the target teleport cluster. | ||
export PROXY_HOST="<proxy-hostname>" | ||
|
||
# proxy port is the port of the teleport web api (typically 443 or 3080). | ||
export PROXY_PORT="443" | ||
|
||
# bot token is the join token that tbot will use to authenticate to the teleport | ||
# cluster. the provided token *must* have the requisite roles to allow for ssh | ||
# server access. | ||
export BOT_TOKEN="<tbot-join-token>" | ||
|
||
# bot user is the local user that the bot service should run at, and determines | ||
# the ownership of the credentials and sockets created in /opt/machine-id. this | ||
# must match the user that will be running the ssh load generation. | ||
export BOT_USER="$USER" | ||
|
||
# teleport artifact is the target artifact from which teleport binaries will be | ||
# installed. | ||
export TELEPORT_ARTIFACT="teleport-v17.0.0-alpha.4-linux-amd64-bin.tar.gz" | ||
|
||
# teleport CDN should likely be one of cdn.cloud.gravitational.io or cdn.teleport.dev, | ||
# the staging and prod cdns respectively. | ||
export TELEPORT_CDN="cdn.cloud.gravitational.io" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
#!/bin/bash | ||
|
||
set -euo pipefail | ||
|
||
cd "$(dirname "$0")" | ||
|
||
source vars.env | ||
|
||
mkdir -p state | ||
|
||
echo "attempting to build inventory..." >&2 | ||
|
||
tsh -i /opt/machine-id/identity --proxy "${PROXY_HOST:?}:${PROXY_PORT:?}" ls --format=json > state/inventory.json | ||
|
||
jq -r '.[] | select(.metadata.expires > (now | strftime("%Y-%m-%dT%H:%M:%SZ"))) | .spec.hostname + ".scale-crdb.cloud.gravitational.io"' < state/inventory.json | sort -R > state/inventory | ||
|
||
echo "successfully generated inventory node_count=$(cat state/inventory | wc -l)" >&2 |
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
#!/bin/bash | ||
|
||
set -euo pipefail | ||
|
||
cd "$(dirname "$0")" | ||
|
||
source vars.env | ||
|
||
if systemctl is-active -q tbot.service; then | ||
echo "stopping extant tbot.service..." >&2 | ||
sudo systemctl stop tbot.service | ||
fi | ||
|
||
sudo mkdir -p /etc/tbot | ||
|
||
sudo mkdir -p /var/lib/teleport/bot | ||
|
||
sudo chown -R "${BOT_USER:?}:${BOT_USER:?}" /var/lib/teleport/bot | ||
|
||
sudo mkdir -p /opt/machine-id | ||
|
||
sudo chown -R "${BOT_USER:?}:${BOT_USER:?}" /opt/machine-id | ||
|
||
|
||
echo "generating tbot config..." >&2 | ||
|
||
sudo tee /etc/tbot.yaml > /dev/null <<EOF | ||
version: v2 | ||
proxy_server: ${PROXY_HOST:?}:${PROXY_PORT:?} | ||
diag_addr: "0.0.0.0:3000" | ||
onboarding: | ||
join_method: token | ||
token: ${BOT_TOKEN:?} | ||
outputs: | ||
- type: identity | ||
destination: | ||
type: directory | ||
path: /opt/machine-id | ||
storage: | ||
type: directory | ||
path: /var/lib/teleport/bot | ||
Comment on lines
+40
to
+41
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With IAM we don't need a storage directory at all, which is sort of a recommended-ish stateless setup AFAIK. |
||
services: | ||
- type: ssh-multiplexer | ||
destination: | ||
type: directory | ||
path: /opt/machine-id | ||
Comment on lines
+38
to
+46
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should not be using the same directory for competing outputs, this is very much not supported and will probably break as soon as the wrong ssh_config ends up being used. |
||
enable_resumption: true | ||
proxy_command: | ||
- fdpass-teleport | ||
proxy_templates_path: /etc/tbot/proxy-templates.yaml | ||
EOF | ||
|
||
|
||
echo "generating proxy templates..." >&2 | ||
|
||
sudo tee /etc/tbot/proxy-templates.yaml > /dev/null <<EOF | ||
proxy_templates: | ||
- template: "^(.*).${PROXY_HOST:?}:[0-9]+$" # <nodename>.<clustername>:<port> | ||
query: 'contains(split(labels.NODENAME, ","), "\$1")' | ||
EOF | ||
|
||
|
||
echo "installing tbot systemd unit..." >&2 | ||
|
||
sudo tbot install systemd --write --force --config /etc/tbot.yaml --user "${BOT_USER:?}" --group "${BOT_USER:?}" | ||
|
||
|
||
echo "starting tbot.service..." >&2 | ||
|
||
sudo systemctl daemon-reload | ||
|
||
sudo systemctl start tbot.service |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
#!/bin/bash | ||
|
||
set -euo pipefail | ||
|
||
cd "$(dirname "$0")" | ||
|
||
source vars.env | ||
|
||
mkdir -p state | ||
|
||
cd state | ||
|
||
echo "installing teleport..." >&2 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we just use the repository or at least the distro packages to install Teleport? We are installing from a real tarball anyway, we should also have the packages. |
||
|
||
wget -q "https://${TELEPORT_CDN:?}/${TELEPORT_ARTIFACT:?}" | ||
|
||
tar -xf "${TELEPORT_ARTIFACT:?}" | ||
|
||
rm "${TELEPORT_ARTIFACT:?}" | ||
|
||
sudo ./teleport/install | ||
|
||
echo "installing fdpass-teleport..." >&2 | ||
|
||
sudo cp ./teleport/fdpass-teleport "$(dirname "$(which teleport)")" | ||
|
||
rm -rf ./teleport | ||
|
||
|
||
echo "installing dumb-init..." >&2 | ||
|
||
sudo wget -q -O /usr/local/bin/dumb-init https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_x86_64 | ||
|
||
sudo chmod +x /usr/local/bin/dumb-init | ||
Comment on lines
+30
to
+34
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. dumb-init is just a package on ubuntu - is it not installable in amazon linux? |
This file was deleted.
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,12 @@ | ||
#!/bin/sh | ||
cd "$( dirname -- "${0}" )" || exit 1 | ||
#!/bin/bash | ||
|
||
set -euo pipefail | ||
|
||
cd "$(dirname "$0")" | ||
|
||
sleep "$( echo "90 * $(od -An -N4 -tu4 /dev/urandom) / 4294967295" | bc -l )" | ||
|
||
ssh_opts="-qn -F tbot_destdir_mux/ssh_config -S /run/user/1000/ssh-control/%C -o ControlMaster=auto -o ControlPersist=60s -o Ciphers=^[email protected] -l root" | ||
ssh_opts="-qn -F /opt/machine-id/ssh_config -S /run/user/1000/ssh-control/%C -o ControlMaster=auto -o ControlPersist=60s -o Ciphers=^[email protected] -l root" | ||
|
||
i=0 | ||
while [ $i -lt 10000 ] ; do | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,9 @@ | ||
#!/bin/sh | ||
cd "$( dirname -- "${0}" )" || exit 1 | ||
#!/bin/bash | ||
|
||
set -euo pipefail | ||
|
||
cd "$(dirname "$0")" | ||
|
||
mkdir -p /run/user/1000/ssh-control | ||
|
||
exec dumb-init xargs -P 0 -I % ./run_node.sh % < inventory | ||
exec dumb-init xargs -P 0 -I % ./run-node.sh % < state/inventory |
This file was deleted.
This file was deleted.
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're planning on using AWS, why isn't this IAM?