Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki Setup. Promtail and Prometheus-Adapter #6

Merged
merged 3 commits into from
Apr 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/terraform-linting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ on:
workflow_dispatch:

env:
TF_VERSION: "1.7.5"
TF_VERSION: "1.8.1"
GITHUB_TOKEN: ${{ github.token }}

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
Expand Down
119 changes: 111 additions & 8 deletions NOTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ spec:
```
Notice how this ingress for ArgoCD is created in the `traefik` namespace (metadata.namespace). And then the `services` definition within the `routes` now inclues a `namespace` section pointing to `argocd`. This allows traefik in the traefik namespace to forward traffic to ArgoCD, which has been setup in the `argocd` namespace

# The Traefik Dashboard Can Be Fully Secured With Basic Auth + Certificates From cert-manager Within its Helm Chart
## The Traefik Dashboard Can Be Fully Secured With Basic Auth + Certificates From cert-manager Within its Helm Chart
Theres no documentation for this example scenario, and I guess in retropsect it is rather intuitive. But here is an example of how to set it up.

With traefik setup with Helm, configure the following in your `values.yaml`:
Expand Down Expand Up @@ -109,7 +109,7 @@ You can also create this within `extraObjects` section above, but I did it seper

Once all of that is applied, the changes may not be immediate, and thats because cert-manager needs to provision your certificate still. During that time, traefik will server the website using its default built in certificate. Once the certificate is ready, traefik will be reloaded with it!

# Setup DNS01 validation with non-wildcard domains and sub-domains using cert-manager and CloudFlare
## Setup DNS01 validation with non-wildcard domains and sub-domains using cert-manager and CloudFlare
This workflow is not well documented either. But the configuration all exists, it just takes way too much digging then it should.

There at a point was actually a bug in this workflow, where cert-manager couldn't actually find the domain on cloudflare. If you run into these issues, there is a couple things you can do and try to resolve it:
Expand All @@ -126,10 +126,10 @@ dns01RecursiveNameservers: "1.1.1.1:53,1.0.0.1:53,8.8.8.8:53,8.8.4.4:53"
```
As a bonus, I also included Google's DNS servers as well. They always are pretty quick to pickup changes as well

# Setup Dev and Prod Issuers with LetsEncrypt
## Setup Dev and Prod Issuers with LetsEncrypt
Having both is helpful during the debugging process as it allows you to end-to-end create certificates and not run into issues with limits. By using LetsEncrypts dev endpoint, you can save yourself some debugging headache

# ArgoCD Needs to be setup with the --insecure flag if you want it to be public facing
## ArgoCD Needs to be setup with the --insecure flag if you want it to be public facing
ArgoCD has its own certificate to work with when you use the kubectl proxy. But if you want it to be public facing, you'll need to disable this functionality. Otherwise, you will end up with constant redirect loops

To have argocd run insecure, from Helm, configure the following in your `values.yaml`:
Expand All @@ -141,14 +141,14 @@ server:

Or wherever you run the ArgoCD container, make sure to pass the argument `--insecure` to the binary

# Terraform Kubernetes provider 'kubernetes_manifest' has a bug for cluster setup workflows
## Terraform Kubernetes provider 'kubernetes_manifest' has a bug for cluster setup workflows
Youll need to use the kubectl provider instead, and specifically the fork created by `alekc` as the original provider also had its own bug and the project has not been regularly maintained by the owner anymore

# cert-manager has a bug in how its CRDs are installed
## cert-manager has a bug in how its CRDs are installed

The deprecated 'installCRDS' variable is actually the only way to install the CRDs via helm. The replaced options of `crds.keep` and `crds.install` do not actually work

# Install your CRDS seperatly first
## Install your CRDS seperatly first

Not doing this is a pain in the ass when it comes to IaC and wanting to cross configure various tools within your cluster.

Expand Down Expand Up @@ -206,4 +206,107 @@ Oh not to mention, these days there is also an option where people are insisting
## Helm is a Design Flaw. It deviates away from Kubernetes design and architecture
CRDs are meant to be the powerhouse of Kubernetes. To make something Cloud/Kubernetes native. You create CRDs which are the building blocks to create and configure your application within the cluster.

Helm ignores this feature, and instead focuses on trying to template out all components. It leave this to working with the Kubernetes primitive, Pod/Service/Secrets services. Which are the basics, but aren't the full capabilities of the framework. They are really just the surface, and Helm encourage people away from those advanced and powerful capabilities with its workflows.
Helm ignores this feature, and instead focuses on trying to template out all components. It leave this to working with the Kubernetes primitive, Pod/Service/Secrets services. Which are the basics, but aren't the full capabilities of the framework. They are really just the surface, and Helm encourage people away from those advanced and powerful capabilities with its workflows.

## Prometheus-Adapter has a bug in it, out the gate:
https://github.com/kubernetes-sigs/prometheus-adapter/issues/385

## S3 external storage documentation and secure configuration of keys is basically all out of date, scattered around, or broken!
The grafana docs are complete shit. I've read it from multiple forums already, but this is my first experience where its truly shown its colors. In order to get proper cloud storage setup, i've had to jump between a bunch of forums, blind guess through a whole bunch of possibilities, and then stumble on a makeshift of a couple options in order to get everything working

Additionally, loki logging in its components won't give their additional debugging and help outputs if your S3 configuration is incorrect. So your stuck blind debugging until you get it mostly right!

Here is some of the places I looked that ended up completely wrong:
* https://github.com/grafana/loki/issues/12218
* https://github.com/grafana/loki/issues/8572
* https://community.grafana.com/t/provide-s3-credentials-using-environment-variables/100132/2

This one ended up being half right, but the format is out of date with the latest versions and helm charts
* https://akyriako.medium.com/kubernetes-logging-with-grafana-loki-promtail-in-under-10-minutes-d2847d526f9e

And this one, for digital ocean itself, was a complete mess of outdated information:
* https://www.digitalocean.com/community/developer-center/how-to-install-loki-stack-in-doks-cluster

And it was only some blind guessing around with this example on Grafana that I found something that accidently worked: https://grafana.com/docs/loki/latest/configure/storage/#aws-deployment-s3-single-store

Unfortunatly, I can't even really tell you why what I have works. But at the very least I can show you what did work for me


## Setting Up S3 / Digital Ocean Backed Storage with Loki and Securely Storing Access Keys

Im installing Loki via helm using the loki-distributed chart (because theres multiple of them and they seem to differ some even there in what they can and can not do). I am using version `0.79.0`

My `storageConfig` section was setup like this:
```yaml
storageConfig:
boltdb_shipper:
shared_store: aws
active_index_directory: /var/loki/index
cache_location: /var/loki/cache
cache_ttl: 1h
filesystem:
directory: /var/loki/chunks
# -- Uncomment to configure each storage individually
# azure: {}
# gcs: {}
aws:
s3: s3://${S3_LOKI_ACCESS_KEY}:${S3_LOKI_SECRET_ACCESS_KEY}@nyc3
bucketnames: k8stack-resources
endpoint: nyc3.digitaloceanspaces.com
region: nyc3
s3forcepathstyle: false
insecure: false
http_config:
idle_conn_timeout: 90s
response_header_timeout: 0s
insecure_skip_verify: false
```
It seems like using `secretAccessKey` or `accessKeyId` does not resolve the variables that are in the environment. It only appears to work within the `s3` string. And that value is custom to this project too! This is not AWS s3 connection syntax from what I have experienced.

Being Digital Ocean, I had to be a bit of tinkering with the `endpoint` value. Fortunatly, the log output spelt that one out for me

A key piece I also had to do is go through this configuration and look for all the `shared_store` values, as these sometimes were set to `s3`. From the Grafana docs I read `s3` and `aws` is an alias. But I don't trust it. So I'd recommend changing those values to `aws`. I _think_ what is happening here is this `aws` value used elsewhere is being used to find the `aws` key listed under `storageConfig` so as to find the access credentials etc

I then configured my secrets within the `extraEnv` section for each component I was deploying:
```yaml
extraEnvFrom:
- secretRef:
name: loki-s3-credentials
```
This is an opaque secret with the following data:
```
S3_LOKI_ACCESS_KEY : <my access key>
S3_LOKI_SECRET_ACCESS_KEY: <my secret access key>
```
Don't listen to some of the documentation talking about these values needing to be URL encoded. Pass them in as they are when you received them. Kubernetes will base64 encode them as always, but you don't need to do anything to them yourself. Copy, paste, and let kubernetes do the rest


## Debugging and Post Deployment Checks

Once things appear to have booted successfully for you I would check your S3 or Digital Ocean bucket. Loki should have filled it with some content. If you have no content, something has _definitly_ gone wrong without you knowing it. Loki doesn't seem to be very obvious or giving about any issues

Some helpful commands I used with `kubectl` were:
```bash
# Get an overview, are things running or rebooting and failing ?
kubectl get all -n loki

# Get details of a pod. This includes boot highlights, but also allows you to confirm what environment variables were passed to your container
kubectl describe pod <one of the loki pods> -n loki

# Finally, output the log output of the container. Again, this will be pretty useless until you have it mostly right!
kubectl logs <the compactor pod> -n loki --follow
```
These allowed me to deduce what the hell was going on

To get more verbose output, also pass these arguments in the `extraArgs` section of each of the components you are deploying:
```yaml
extraArgs:
- -config.expand-env=true # you NEED this in order for environment variables to work in your storageConfig
- --log.level=debug
- --print-config-stderr
```
Again, `--log.level=debug` and `--print-config-stderr` are pretty useless until you get your `aws.s3` configuration correct. You'll be stuck with generic errors until you get that sorted


## Bonus Garbage
Oh, also. A whole bunch of these docs talk about using boltdb_shipper. That thing is deprecated! (https://grafana.com/docs/loki/latest/configure/storage/#boltdb-deprecated) There is a new one (https://grafana.com/docs/loki/latest/configure/storage/#tsdb-recommended), but man...documentation ? Where is it ? Nobody appears to be using this yet either
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,10 @@ Below is a table of each piece installed in my cluster at the moment, and what r
| Traefik | Ingress Controller | |
| Kyverno | RBAC and Admissions Controller | |
| Prometheus | Observability - Metrics Server | |
| Grafana | Observability - Metrics Dashbaord | |
| Elasticsearch | Observability - Logging Database | |
| Kibana | Observability - Logging Dashboard | Coming Soon |
| Prometheus Adapter | Metrics for Kubernetes Metrics API | Replaces metrics-server to work with Prometheus instead |
| Grafana | Observability - Metrics & Logging Dashbaord | |
| Loki| Observability - Logging Database | |
| Promtail | Observability - Container Stdout/Stderr Log Scraping | Forwards to Loki |
| Vault | Secrets Manager | Coming Soon |

Below now is another table of the tech being used for managing and configuring my Kubernetes cluster:
Expand Down
9 changes: 6 additions & 3 deletions main.tf
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
terraform {
required_version = "~> 1.7.5"
required_version = "~> 1.8.1"

required_providers {
digitalocean = {
Expand Down Expand Up @@ -34,8 +34,7 @@ terraform {


module "k8infra" {
source = "./modules/k8infra"
do_token = var.do_token
source = "./modules/k8infra"

providers = {
digitalocean = digitalocean
Expand All @@ -58,6 +57,10 @@ module "k8config" {
cf_token = var.cf_token
domain = var.domain


s3_access_key_id = var.do_spaces_access_key_id
s3_secret_access_key = var.do_spaces_secret_access_key

providers = {
kubernetes = kubernetes
helm = helm
Expand Down
38 changes: 34 additions & 4 deletions modules/k8config/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -116,15 +116,45 @@ module "kyverno" {
]
}

module "elasticsearch" {
source = "./modules/elasticsearch"

module "loki" {
source = "./modules/loki"

s3_access_key_id = var.s3_access_key_id
s3_secret_access_key = var.s3_secret_access_key

providers = {
kubectl = kubectl
helm = helm
helm = helm
}

depends_on = [
time_sleep.wait_60_seconds
]
}

module "promtail" {
source = "./modules/promtail"

providers = {
helm = helm
}

depends_on = [
time_sleep.wait_60_seconds,
module.loki
]
}


module "prometheus-adapter" {
source = "./modules/prometheus-adapter"

providers = {
helm = helm
}

depends_on = [
time_sleep.wait_60_seconds,
module.prometheus
]
}
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,18 @@ resource "helm_release" "elasticsearch" {
file("${abspath(path.module)}/res/elasticsearch-values.yaml")
]
}

resource "time_sleep" "wait_60_seconds" {
depends_on = [helm_release.elasticsearch]
create_duration = "60s"
}

/*
resource "kubectl_manifest" "elasticsearch_cluster" {
yaml_body = file("${abspath(path.module)}/res/elasticsearch.yaml")

depends_on = [
time_sleep.wait_60_seconds
]
}
*/
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# https://github.com/elastic/cloud-on-k8s/blob/main/config/samples/elasticsearch/elasticsearch.yaml
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: elasticsearch
namespace: elasticsearch
spec:
version: 8.13.2
volumeClaimDeletePolicy: DeleteOnScaledownOnly
nodeSets:
- name: default
count: 1 # 3 cluster nodes
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
storageClassName: do-block-storage
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
podTemplate:
spec:
containers:
- name: elasticsearch
resources:
requests:
memory: 1Gi
cpu: 1
limits:
memory: 1Gi
cpu: 1
#config:
# node.store.allow_mmap: false
http:
service:
spec:
type: ClusterIP
4 changes: 2 additions & 2 deletions modules/k8config/modules/_archive/grafana/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ resource "helm_release" "grafana" {
name = "grafana"

repository = "https://grafana.github.io/helm-charts"
chart = "grafana"
chart = "grafana-agent-operator"

atomic = true

Expand All @@ -34,7 +34,7 @@ resource "helm_release" "grafana" {
dependency_update = true

values = [
file("${abspath(path.module)}/res/grafana-values.yaml")
file("${abspath(path.module)}/res/grafana-agent-operator-values.yaml")
]

depends_on = [
Expand Down
Loading
Loading