Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor scripts to facilitate dynamic monitoring #7

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

thanethomson
Copy link
Contributor

@thanethomson thanethomson commented Jun 17, 2022

Closes #2

See the updated README for a high-level overview of this change. I'd recommend testing this change from this branch before merging it.

This PR is a pretty big overhaul to the current testnet scripts that will allow us to implement use cases that involve nodes being dynamically added to and removed from the network. It does this by swapping out Prometheus for a combination of InfluxDB and Telegraf.

The "monitor" server now runs an InfluxDB instance that listens for incoming data from Telegraf agents installed on all of the nodes. The Telegraf agents are configured to poll the Tendermint test app's Prometheus endpoint (collecting all Prometheus metrics) as well as system metrics (CPU, memory, disk usage, etc.). Telegraf regularly pushes these metrics to the monitor's InfluxDB server.

The InfluxDB server also provides a convenient web-based UI to explore stored data, with graphical visualization tools similar to what Prometheus provides.

This PR also:

  • simplifies the testnet deployment process,
  • refactors the Ansible playbooks into roles, making them more reusable across playbooks,
  • uses Terraform to reliably generate the Ansible hosts file (and delete it automatically once the infrastructure's been destroyed),
  • refactors the Terraform scripts according to a more standardized layout,
  • updates the usage instructions in the README.

It's supposed to facilitate log collection too (by tailing the Tendermint logs, which are being output in JSON format), but that seems to be a little buggy right now. I'll look into fixing that ASAP.

Additionally, we should eventually be able to reuse most of the infrastructural config here in the follow-ups I'm planning to tendermint/tendermint#8754

This commit is a pretty big overhaul to the current testnet scripts that
will allow us to implement use cases that involve nodes being
dynamically added to and removed from the network. It does this by
swapping out Prometheus for a combination of InfluxDB and Telegraf.

The "monitor" server now runs an InfluxDB instance that listens for
incoming data from Telegraf agents installed on all of the nodes. The
Telegraf agents are configured to poll the Tendermint test app's
Prometheus endpoint (collecting all Prometheus metrics) as well as
system metrics (CPU, memory, disk usage, etc.). Telegraf regularly
pushes these metrics to the monitor's InfluxDB server.

The InfluxDB server also provides a convenient web-based UI to explore
stored data, with graphical visualization tools similar to what
Prometheus provides.

This commit also:
- simplifies the testnet deployment process,
- refactors the Ansible playbooks into roles, making them more reusable
  across playbooks,
- uses Terraform to reliably generate the Ansible hosts file (and delete
  it automatically once the infrastructure's been destroyed),
- refactors the Terraform scripts according to a more standardized
  layout,
- updates the usage instructions in the README.

Signed-off-by: Thane Thomson <[email protected]>
Copy link
Contributor

@williambanfield williambanfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm excited for a lot of this to land. Many of these fixes are just clear improvements over what's there at the moment. I have basically no objection to any ansible and terraform based changes as long as the crucial functionality is preserved. That functionality includes:

  1. Starting many nodes at a particular version
  2. Updating the version of tendermint running to see if a change provides a fix
  3. Restarting a node/set of nodes to see if a problem that is observed can be undone if goroutines are restarted and connections are re-established.
  4. Monitor all metrics including host-level metrics such as memory usage, CPU usage, in-use file descriptors etc.

Those are the main needs at the moment, although more may come up as we make more use of these tools.

To the point of the metrics integration: I have no specific problem with influxDB. I haven't worked with it personally so it would be a bit of a learning curve for me to use. The main reason I would like to suggest we look a bit more for a prometheus-based solution is that that's what most users are likely to use. As we run and re-run the code in these test networks, using the same tool that consumers of the code will use provides a few benefits.

First, it means that we're familiar with the tools someone asking us for debug help will be using. No translation is necessary between what the user hands us and what we already understand. When someone links us to a prometheus instance, the queries and such that we use to debug our code will be top of mind.

Second, it means that gaps in debugging and diagnosing issues from prometheus-based queries on our metrics will be quickly caught by us when using the tool.

While the setup presented here definitely works to solve the problem, I think we may want to continue to consider a prometheus-based approach. Prometheus has a digital ocean service discovery configuration field that appears to allow integration directly into the DO API. Using the __meta_digitalocean_tags field, we should be able to keep only the instances marked with our testnet tags as scrape targets for prometheus.


[nodes]
%{ for node in nodes ~}
${node.name} ansible_host=${node.ipv4_address}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is ansible_host a special variable of some kind that it knows to use as the IP address for connecting to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The node name is used here as a kind of logical identifier for the host, and the IP address is given explicitly to Ansible through ansible_host. This allows for easier debugging of the generated hosts file as the node names match exactly those specified in testnet.toml, allowing easier correlation between logical testnet nodes and DO VMs.

Comment on lines +8 to +12
# Extract the IP addresses of all of the nodes (excluding the monitoring
# server) from the Ansible hosts file. IP addresses will be in the same order
# as those generated in the docker-compose.yml file, and will be separated by
# newlines.
NEW_IPS=`cat ${ANSIBLE_HOSTS} | grep -v 'monitor' | grep 'ansible_host' | awk -F' ansible_host=' '{print $2}' | head -c -1 | tr '\n' ','`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What ensures the order of these matches the order of the OLD_IPS? The previous iteration of this relied on sorting the hosts by node name to match the OLD_IPS order but I'm not sure we're doing that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I was under the impression that the Docker compose file's ordering was deterministic - is that not the case? The hosts file output is deterministic as far as I can tell.

@@ -0,0 +1,25 @@
---
# This playbook must be executed as root.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming you mean root on the machine you're deploying to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll clarify that 🙂

set -euo pipefail

if [ -f "{{ tendermint.pid_file }}" ]; then
kill `cat {{ tendermint.pid_file }}` || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the || true here ensure that the kill line does not return a non-zero code if the process doesn't exist? Otherwise this script will exit if the process is already gone and we'll never clean up the PID file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's sometimes possible that the process was already killed and this just makes that call idempotent. Otherwise it terminates the script immediately as per the set -euo pipefail line and never cleans up the PID file.


ANSIBLE_HOSTS=$1
LOAD_RUNNER_CMD=${LOAD_RUNNER_CMD:-"go run github.com/tendermint/tendermint/test/e2e/runner@51685158fe36869ab600527b852437ca0939d0cc"}
IP_LIST=`cat ${ANSIBLE_HOSTS} | grep -v 'monitor' | grep 'ansible_host' | awk -F' ansible_host=' '{print $2}' | head -c -1 | tr '\n' ','`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discovered recently that there's an ansible command that outputs the IP addresses of the items in the inventory.

ansible all --list-hosts -i ./ansible/hosts

I'm not certain if it works with the ansible_host key used in the inventory file, but it's a bit more convenient, I've found, than just cat'ing the hosts file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know! I may refactor this to use that then 👍

@@ -1,13 +0,0 @@
[Unit]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node-exporter is a process that runs on each host and queries the /proc filesystem for things like: open TCP connections, in-use file descriptors, memory, network, and CPU usage. It exposes these as prometheus metrics on a /metrics endpoint for scraping. I'm not seeing this duplicated in the new code, but this is very important for checking host-level metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the input plugins section of the Telegraf config. Telegraf will automatically monitor all of those things and push those metrics, along with the polled Prometheus metrics from the Tendermint node, to the InfluxDB instance.

@thanethomson
Copy link
Contributor Author

The main reason I would like to suggest we look a bit more for a prometheus-based solution is that that's what most users are likely to use.

Which users do you mean here, and do we have evidence for this? Operators such as Cephalopod, for example, use Nagios for operational metric monitoring and are currently building an ELK stack for log monitoring.

First, it means that we're familiar with the tools someone asking us for debug help will be using. No translation is necessary between what the user hands us and what we already understand. When someone links us to a prometheus instance, the queries and such that we use to debug our code will be top of mind.

Two things here:

  1. Is the vision for this repo is to be able to provide users with a comprehensive set of debugging tools for Tendermint testnets? I was under the impression that this is an internal tool for us to stress test Tendermint?
  2. Just like Grafana, InfluxDB allows us to configure dashboards and export them as a JSON file that we can build into the testnet deployment scripts such that we don't need to interpret users' queries (still not sure who those users are?).

Personally I don't have a strong preference either way as I've used neither of these systems before (InfluxDB v2 is basically a totally different product to InfluxDB v1, which I used several years ago). Learning how to construct queries in InfluxDB was about 5 mins' work. Even Prometheus' own comparison between InfluxDB and Prometheus shows very little meaningful difference between the two products.

My main concern here is that the approach you're suggesting is not yet proven to work and I could end up wasting more time on implementing it. I already brought up the option of using InfluxDB nearly 2 weeks ago and, given there was no feedback on the idea, I assumed it would be fine to spend several days' worth of effort implementing it as such.

@thanethomson
Copy link
Contributor Author

A quick follow-up here, I've spent some time investigating options toward getting InfluxDB to successfully ingest the Tendermint logs so it provides more than just metrics monitoring, and it appears as though Telegraf's just not capable of grokking them without substantial effort.

Therefore I'll refactor this to rather make use of Prometheus for now. A good follow-up to this would be to, in the longer term, consider eventually spinning up an ELK stack to handle both metric and log monitoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prometheus can monitor peers that are added to the network after startup
2 participants