Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ported monitoring stack to k3s #449

Open
wants to merge 106 commits into
base: main
Choose a base branch
from
Open

Conversation

wtripp180901
Copy link
Contributor

@wtripp180901 wtripp180901 commented Oct 14, 2024

Monitoring stack (prometheus/node exporter/grafana/alertmanager) binary installs removed from site and fatimage, now installs kube-prometheus-stack Helm chart into k3s cluster during site run. Containers are pre-pulled by podman and exported into k3s during fatimage build.

As a consequence, the grafana, alertmanager and node exporter groups have been removed and associated roles are now all managed by the prometheus role, which is short for kube_prometheus_stack

Also reduced metrics collected by node exporter down to minimal set described in docs/monitoring-and-logging.README.md, which was previously unimplemented

Note that because of how OOD's proxying interacts with Grafana's server config and kubernetes, OOD being enabled means that Grafana is only accessible through the OOD proxy. In the caas environment, this means that accessing Grafana requires authenticating with OOD's basic auth. Therefore, accessing Grafana through caas no longer logs you in as the admin user, you instead access the dashboards anonymously

Tests as of 8ca0407:

  • Prometheus data from cloudalchemy roles successfully migrated to containerised Prometheus, although will likely be under a different job label than the one KPS is hardcoded to use
  • Reimage and upgrade from cloudalchemy: TODO
  • Final caas test: TODO

@wtripp180901
Copy link
Contributor Author

no image changes since last build so last commit should be ready to merge barring review changes

@wtripp180901
Copy link
Contributor Author

Copy link
Collaborator

@sjpb sjpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note in the PR (with better wording!) that

  • the "prometheus" group is bascally short for "kube-prometheus-stack" group!
  • the monitoring link in CaaS now accesses grafana with anonymous auth (b/c it has to go via OOD), so CaaS users can't change their dashboards

environments/common/inventory/group_vars/all/defaults.yml Outdated Show resolved Hide resolved
docs/monitoring-and-logging.md Outdated Show resolved Hide resolved
docs/monitoring-and-logging.md Outdated Show resolved Hide resolved
ansible/roles/kube_prometheus_stack/tasks/install.yml Outdated Show resolved Hide resolved
ansible/fatimage.yml Show resolved Hide resolved
@sjpb
Copy link
Collaborator

sjpb commented Nov 15, 2024

@wtripp180901 not a high priority but would be nice to know if this PR reduces the size of the data in the image. And/or whether we can reduce the required root disk size at all - which isn't the same thing, b/c e.g. dnf caches which we throw away require additional size during build.

I think you'd need qemu-img info to see the former. And monitoring disk usage during build to see the latter.

@wtripp180901
Copy link
Contributor Author

Base automatically changed from feature/k3s-ansible-init to main November 19, 2024 09:58
@wtripp180901
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SELinux not disabled by default, causes Prometheus install to fail
3 participants