Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rework ansible-like loadtest helpers #48634

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions assets/loadtest/ansible-like/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
vars.env
state
59 changes: 50 additions & 9 deletions assets/loadtest/ansible-like/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,57 @@
# Ansible-like OpenSSH sessions load test
# ansible-like openssh loadtest

This setup is designed to be ran from the home directory of a VM (the default working directory for a systemd user service); the proxy public address and cluster name should be changed in `gen_inventory.sh`, `proxy_templates.yaml` and `tbot.yaml` from `PROXYHOST` and `CLUSTERNAME` respectively. It requires openssh, jq, xargs, and dumb-init, as well as tbot and fdpass-teleport.
This setup is intended to generate fake ansible-like load by spawning very massive numbers
of sessions against a large number of teleport nodes. It uses tbot/machineid with ssh multiplexing
to support the needed volume of sessions (as one would do with an ansible master that manages
a massive number of servers via teleport).

This setup assumes that nodes are being ran by the `node-agent` Helm chart, and proxy templates are applied to do predicate-based dialing on the NODENAME label, as the chart sets up. Commenting or blanking the `proxy_templates.yaml` file (and restarting tbot) will change it to hostname-based dialing. Changing the `proxy_templates.yaml` file (and restarting tbot) can also be used to test a simpler predicate, or to test search-based dialing rather than predicate-based dialing.
This setup is designed to be run on a fresh VM instance, and will perform various
installs and system configuration actions.

Bot and token can be created with `tctl -f loadtest-bot.yaml`, after editing the IAM account and role in it. Token-based joining with tbot is incredibly annoying, so IAM joining or some other ambient-based joining method is recommended. Running the `node-agent` chart is left as an exercise for the reader.
It expects the following to already be installed on the system:

The machine running the client should be scaled depending on how many nodes are targeted in the inventory; for 60000 nodes (i.e. 60k shell scripts and 120k ssh processes running at peak) the memory usage with Teleport 15 seems to be ~20GiB for tbot and ~200 for the scripts and SSH, so something like an AWS 32xlarge or 48xlarge might be necessary (maybe the compute-optimized variants, as memory isn't really a problem). Depending on the scale of the test and the runner machine, tuning GOMAXPROCS and GOMEMLIMIT in tbot.service might be useful.
- `openssh`
- `jq`
- `xargs`

It will perform installation of the following:

- all default teleport binaries (namely, `tbot`)
- `fdpass-teleport`
- `dumb-init`

By default, this test setup assumes that the `node-agents` loadtest helm chart is being
used. The proxy templates generated rely on labels set by that helm chart. After setup
is run, it is possible to customize the proxy template used by editing `/etc/tbot/proxy-templates.yaml`.

Given the extreme scale of tests run with this setup, it is typically necessary to use a
very large VM. For example, 60k agent tests are typically run from a 32xlarge or 48xlarge
instance, either general purpose of compute optimized.

## Usage

- Run `tbot_install.sh` to set up tbot (it will install a specific Teleport version as listed in the script, tweak it as required), or `systemctl --user restart tbot.service` if tbot is already set up.
- Run the `gen_inventory.sh` script to produce a list of hosts in random order in the `inventory` file, check that it matches the expected list of hosts.
- Choose a random host in the inventory and confirm that the setup is working with `ssh -F tbot_destdir_mux/ssh_config root@host`.
- Run `run.sh >/dev/null` (in tmux, probably). In a different terminal or tab, check how many sockets are being opened in the ssh controlmaster directory with `ls -1 /run/user/1000/ssh-control | wc -l` to confirm that connections are being established and muxed by ssh. Logs for tbot can be viewed with `journalctl --user-unit tbot --follow`.
- Copy `example.vars.env` to `vars.env` and edit the copy. The `PROXY_HOST` variable
and `BOT_TOKEN` variable *must* be changed.

- Run `install.sh` once to install `tbot`, `fdpass-teleport`, and `dumb-init`. This only need
ever be run once.

- Run `init.sh` to set up tbot directories/configuration and start the `bot.service`. If this needs
to be re-run (e.g. if proxy host or token need to be changed), it may be necessary to first manually
halt the tbot service.

- Run `journalctl -u tbot.service` to verify that `tbot` has successfully authenticated with the cluster.

- Run `gen-inventory.sh` to generate a list of all target teleport nodes. This only needs to be re-run
if/when the set of agents changes.

- Verify that the setup is functional by selecting a random host from `state/inventory` and attempting to
access it via `ssh -F /opt/machine-id/ssh_config root@host`

- Run `run.sh` to run the actual test scenario. This will invoke `run-node.sh` for each member of
the generated inventory and report success/failure of individual attempted sessions. Note that for
large scale tests the output of this script is enormous and may need to be piped to `/dev/null`. Long
running invocations should be performed within a `tmux` session or similar.

- Verify that ssh connections are being established and multiplexed by monitoring the control master
directory with `ls -1 /run/user/1000/ssh-control | wc -l`.
24 changes: 24 additions & 0 deletions assets/loadtest/ansible-like/example.vars.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# proxy host is the hostname of the target teleport cluster.
export PROXY_HOST="<proxy-hostname>"

# proxy port is the port of the teleport web api (typically 443 or 3080).
export PROXY_PORT="443"

# bot token is the join token that tbot will use to authenticate to the teleport
# cluster. the provided token *must* have the requisite roles to allow for ssh
# server access.
export BOT_TOKEN="<tbot-join-token>"

# bot user is the local user that the bot service should run at, and determines
# the ownership of the credentials and sockets created in /opt/machine-id. this
# must match the user that will be running the ssh load generation.
export BOT_USER="$USER"

# teleport artifact is the target artifact from which teleport binaries will be
# installed.
export TELEPORT_ARTIFACT="teleport-v17.0.0-alpha.4-linux-amd64-bin.tar.gz"

# teleport CDN should likely be one of cdn.cloud.gravitational.io or cdn.teleport.dev,
# the staging and prod cdns respectively.
export TELEPORT_CDN="cdn.cloud.gravitational.io"

17 changes: 17 additions & 0 deletions assets/loadtest/ansible-like/gen-inventory.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/bin/bash

set -euo pipefail

cd "$(dirname "$0")"

source vars.env

mkdir -p state

echo "attempting to build inventory..." >&2

tsh -i /opt/machine-id/identity --proxy "${PROXY_HOST:?}:${PROXY_PORT:?}" ls --format=json > state/inventory.json

jq -r '.[] | select(.metadata.expires > (now | strftime("%Y-%m-%dT%H:%M:%SZ"))) | .spec.hostname + ".scale-crdb.cloud.gravitational.io"' < state/inventory.json | sort -R > state/inventory

echo "successfully generated inventory node_count=$(cat state/inventory | wc -l)" >&2
7 changes: 0 additions & 7 deletions assets/loadtest/ansible-like/gen_inventory.sh

This file was deleted.

72 changes: 72 additions & 0 deletions assets/loadtest/ansible-like/init.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
#!/bin/bash

set -euo pipefail

cd "$(dirname "$0")"

source vars.env

if systemctl is-active -q tbot.service; then
echo "stopping extant tbot.service..." >&2
sudo systemctl stop tbot.service
fi

sudo mkdir -p /etc/tbot

sudo mkdir -p /var/lib/teleport/bot

sudo chown -R "${BOT_USER:?}:${BOT_USER:?}" /var/lib/teleport/bot

sudo mkdir -p /opt/machine-id

sudo chown -R "${BOT_USER:?}:${BOT_USER:?}" /opt/machine-id


echo "generating tbot config..." >&2

sudo tee /etc/tbot.yaml > /dev/null <<EOF
version: v2
proxy_server: ${PROXY_HOST:?}:${PROXY_PORT:?}
diag_addr: "0.0.0.0:3000"
onboarding:
join_method: token
token: ${BOT_TOKEN:?}
Comment on lines +31 to +33
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're planning on using AWS, why isn't this IAM?

outputs:
- type: identity
destination:
type: directory
path: /opt/machine-id
storage:
type: directory
path: /var/lib/teleport/bot
Comment on lines +40 to +41
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With IAM we don't need a storage directory at all, which is sort of a recommended-ish stateless setup AFAIK.

services:
- type: ssh-multiplexer
destination:
type: directory
path: /opt/machine-id
Comment on lines +38 to +46
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be using the same directory for competing outputs, this is very much not supported and will probably break as soon as the wrong ssh_config ends up being used.

enable_resumption: true
proxy_command:
- fdpass-teleport
proxy_templates_path: /etc/tbot/proxy-templates.yaml
EOF


echo "generating proxy templates..." >&2

sudo tee /etc/tbot/proxy-templates.yaml > /dev/null <<EOF
proxy_templates:
- template: "^(.*).${PROXY_HOST:?}:[0-9]+$" # <nodename>.<clustername>:<port>
query: 'contains(split(labels.NODENAME, ","), "\$1")'
EOF


echo "installing tbot systemd unit..." >&2

sudo tbot install systemd --write --force --config /etc/tbot.yaml --user "${BOT_USER:?}" --group "${BOT_USER:?}"


echo "starting tbot.service..." >&2

sudo systemctl daemon-reload

sudo systemctl start tbot.service
34 changes: 34 additions & 0 deletions assets/loadtest/ansible-like/install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/bin/bash

set -euo pipefail

cd "$(dirname "$0")"

source vars.env

mkdir -p state

cd state

echo "installing teleport..." >&2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use the repository or at least the distro packages to install Teleport? We are installing from a real tarball anyway, we should also have the packages.


wget -q "https://${TELEPORT_CDN:?}/${TELEPORT_ARTIFACT:?}"

tar -xf "${TELEPORT_ARTIFACT:?}"

rm "${TELEPORT_ARTIFACT:?}"

sudo ./teleport/install

echo "installing fdpass-teleport..." >&2

sudo cp ./teleport/fdpass-teleport "$(dirname "$(which teleport)")"

rm -rf ./teleport


echo "installing dumb-init..." >&2

sudo wget -q -O /usr/local/bin/dumb-init https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_x86_64

sudo chmod +x /usr/local/bin/dumb-init
Comment on lines +30 to +34
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumb-init is just a package on ubuntu - is it not installable in amazon linux?

27 changes: 0 additions & 27 deletions assets/loadtest/ansible-like/loadtest-bot.yaml

This file was deleted.

5 changes: 0 additions & 5 deletions assets/loadtest/ansible-like/proxy_templates.yaml

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
#!/bin/sh
cd "$( dirname -- "${0}" )" || exit 1
#!/bin/bash

set -euo pipefail

cd "$(dirname "$0")"

sleep "$( echo "90 * $(od -An -N4 -tu4 /dev/urandom) / 4294967295" | bc -l )"

ssh_opts="-qn -F tbot_destdir_mux/ssh_config -S /run/user/1000/ssh-control/%C -o ControlMaster=auto -o ControlPersist=60s -o Ciphers=^[email protected] -l root"
ssh_opts="-qn -F /opt/machine-id/ssh_config -S /run/user/1000/ssh-control/%C -o ControlMaster=auto -o ControlPersist=60s -o Ciphers=^[email protected] -l root"

i=0
while [ $i -lt 10000 ] ; do
Expand Down
9 changes: 6 additions & 3 deletions assets/loadtest/ansible-like/run.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
#!/bin/sh
cd "$( dirname -- "${0}" )" || exit 1
#!/bin/bash

set -euo pipefail

cd "$(dirname "$0")"

mkdir -p /run/user/1000/ssh-control

exec dumb-init xargs -P 0 -I % ./run_node.sh % < inventory
exec dumb-init xargs -P 0 -I % ./run-node.sh % < state/inventory
16 changes: 0 additions & 16 deletions assets/loadtest/ansible-like/tbot.service

This file was deleted.

22 changes: 0 additions & 22 deletions assets/loadtest/ansible-like/tbot.yaml

This file was deleted.

11 changes: 0 additions & 11 deletions assets/loadtest/ansible-like/tbot_install.sh

This file was deleted.

Loading