debenstack · bensoer · Apr 22, 2024 · Apr 22, 2024 · Apr 22, 2024 · Apr 22, 2024
diff --git a/.github/workflows/terraform-linting.yml b/.github/workflows/terraform-linting.yml
@@ -12,7 +12,7 @@ on:
   workflow_dispatch:
 
 env:
-  TF_VERSION: "1.7.5"
+  TF_VERSION: "1.8.1"
   GITHUB_TOKEN: ${{ github.token }}
 
 # A workflow run is made up of one or more jobs that can run sequentially or in parallel

diff --git a/NOTES.md b/NOTES.md
@@ -49,7 +49,7 @@ spec:
 ```
 Notice how this ingress for ArgoCD is created in the `traefik` namespace (metadata.namespace). And then the `services` definition within the `routes` now inclues a `namespace` section pointing to `argocd`. This allows traefik in the traefik namespace to forward traffic to ArgoCD, which has been setup in the `argocd` namespace
 
-# The Traefik Dashboard Can Be Fully Secured With Basic Auth + Certificates From cert-manager Within its Helm Chart
+## The Traefik Dashboard Can Be Fully Secured With Basic Auth + Certificates From cert-manager Within its Helm Chart
 Theres no documentation for this example scenario, and I guess in retropsect it is rather intuitive. But here is an example of how to set it up.
 
 With traefik setup with Helm, configure the following in your `values.yaml`:
@@ -109,7 +109,7 @@ You can also create this within `extraObjects` section above, but I did it seper
 
 Once all of that is applied, the changes may not be immediate, and thats because cert-manager needs to provision your certificate still. During that time, traefik will server the website using its default built in certificate. Once the certificate is ready, traefik will be reloaded with it!
 
-# Setup DNS01 validation with non-wildcard domains and sub-domains using cert-manager and CloudFlare
+## Setup DNS01 validation with non-wildcard domains and sub-domains using cert-manager and CloudFlare
 This workflow is not well documented either. But the configuration all exists, it just takes way too much digging then it should.
 
 There at a point was actually a bug in this workflow, where cert-manager couldn't actually find the domain on cloudflare. If you run into these issues, there is a couple things you can do and try to resolve it:
@@ -126,10 +126,10 @@ dns01RecursiveNameservers: "1.1.1.1:53,1.0.0.1:53,8.8.8.8:53,8.8.4.4:53"
 ```
 As a bonus, I also included Google's DNS servers as well. They always are pretty quick to pickup changes as well
 
-# Setup Dev and Prod Issuers with LetsEncrypt
+## Setup Dev and Prod Issuers with LetsEncrypt
 Having both is helpful during the debugging process as it allows you to end-to-end create certificates and not run into issues with limits. By using LetsEncrypts dev endpoint, you can save yourself some debugging headache
 
-# ArgoCD Needs to be setup with the --insecure flag if you want it to be public facing
+## ArgoCD Needs to be setup with the --insecure flag if you want it to be public facing
 ArgoCD has its own certificate to work with when you use the kubectl proxy. But if you want it to be public facing, you'll need to disable this functionality. Otherwise, you will end up with constant redirect loops
 
 To have argocd run insecure, from Helm, configure the following in your `values.yaml`:
@@ -141,14 +141,14 @@ server:
 
 Or wherever you run the ArgoCD container, make sure to pass the argument `--insecure` to the binary
 
-# Terraform Kubernetes provider 'kubernetes_manifest' has a bug for cluster setup workflows
+## Terraform Kubernetes provider 'kubernetes_manifest' has a bug for cluster setup workflows
 Youll need to use the kubectl provider instead, and specifically the fork created by `alekc` as the original provider also had its own bug and the project has not been regularly maintained by the owner anymore
 
-# cert-manager has a bug in how its CRDs are installed
+## cert-manager has a bug in how its CRDs are installed
 
 The deprecated 'installCRDS' variable is actually the only way to install the CRDs via helm. The replaced options of `crds.keep` and `crds.install` do not actually work
 
-# Install your CRDS seperatly first
+## Install your CRDS seperatly first
 
 Not doing this is a pain in the ass when it comes to IaC and wanting to cross configure various tools within your cluster.
 
@@ -206,4 +206,107 @@ Oh not to mention, these days there is also an option where people are insisting
 ## Helm is a Design Flaw. It deviates away from Kubernetes design and architecture
 CRDs are meant to be the powerhouse of Kubernetes. To make something Cloud/Kubernetes native. You create CRDs which are the building blocks to create and configure your application within the cluster.
 
-Helm ignores this feature, and instead focuses on trying to template out all components. It leave this to working with the Kubernetes primitive, Pod/Service/Secrets services. Which are the basics, but aren't the full capabilities of the framework. They are really just the surface, and Helm encourage people away from those advanced and powerful capabilities with its workflows.
+Helm ignores this feature, and instead focuses on trying to template out all components. It leave this to working with the Kubernetes primitive, Pod/Service/Secrets services. Which are the basics, but aren't the full capabilities of the framework. They are really just the surface, and Helm encourage people away from those advanced and powerful capabilities with its workflows.
+
+## Prometheus-Adapter has a bug in it, out the gate:
+https://github.com/kubernetes-sigs/prometheus-adapter/issues/385
+
+## S3 external storage documentation and secure configuration of keys is basically all out of date, scattered around, or broken! 
+The grafana docs are complete shit. I've read it from multiple forums already, but this is my first experience where its truly shown its colors. In order to get proper cloud storage setup, i've had to jump between a bunch of forums, blind guess through a whole bunch of possibilities, and then stumble on a makeshift of a couple options in order to get everything working
+
+Additionally, loki logging in its components won't give their additional debugging and help outputs if your S3 configuration is incorrect. So your stuck blind debugging until you get it mostly right!
+
+Here is some of the places I looked that ended up completely wrong:
+* https://github.com/grafana/loki/issues/12218
+* https://github.com/grafana/loki/issues/8572
+* https://community.grafana.com/t/provide-s3-credentials-using-environment-variables/100132/2
+
+This one ended up being half right, but the format is out of date with the latest versions and helm charts
+* https://akyriako.medium.com/kubernetes-logging-with-grafana-loki-promtail-in-under-10-minutes-d2847d526f9e
+
+And this one, for digital ocean itself, was a complete mess of outdated information:
+* https://www.digitalocean.com/community/developer-center/how-to-install-loki-stack-in-doks-cluster
+
+And it was only some blind guessing around with this example on Grafana that I found something that accidently worked: https://grafana.com/docs/loki/latest/configure/storage/#aws-deployment-s3-single-store
+
+Unfortunatly, I can't even really tell you why what I have works. But at the very least I can show you what did work for me
+
+
+## Setting Up S3 / Digital Ocean Backed Storage with Loki and Securely Storing Access Keys
+
+Im installing Loki via helm using the loki-distributed chart (because theres multiple of them and they seem to differ some even there in what they can and can not do). I am using version `0.79.0`
+
+My `storageConfig` section was setup like this:
+```yaml
+  storageConfig:
+    boltdb_shipper:
+      shared_store: aws
+      active_index_directory: /var/loki/index
+      cache_location: /var/loki/cache
+      cache_ttl: 1h
+    filesystem:
+      directory: /var/loki/chunks
+# -- Uncomment to configure each storage individually
+#   azure: {}
+#   gcs: {}
+    aws:
+      s3: s3://${S3_LOKI_ACCESS_KEY}:${S3_LOKI_SECRET_ACCESS_KEY}@nyc3
+      bucketnames: k8stack-resources
+      endpoint: nyc3.digitaloceanspaces.com
+      region: nyc3
+      s3forcepathstyle: false
+      insecure: false
+      http_config:
+        idle_conn_timeout: 90s
+        response_header_timeout: 0s
+        insecure_skip_verify: false
+```
+It seems like using `secretAccessKey` or `accessKeyId` does not resolve the variables that are in the environment. It only appears to work within the `s3` string. And that value is custom to this project too! This is not AWS s3 connection syntax from what I have experienced.
+
+Being Digital Ocean, I had to be a bit of tinkering with the `endpoint` value. Fortunatly, the log output spelt that one out for me
+
+A key piece I also had to do is go through this configuration and look for all the `shared_store` values, as these sometimes were set to `s3`. From the Grafana docs I read `s3` and `aws` is an alias. But I don't trust it. So I'd recommend changing those values to `aws`. I _think_ what is happening here is this `aws` value used elsewhere is being used to find the `aws` key listed under `storageConfig` so as to find the access credentials etc
+
+I then configured my secrets within the `extraEnv` section for each component I was deploying:
+```yaml
+  extraEnvFrom:
+    - secretRef:
+        name: loki-s3-credentials
+```
+This is an opaque secret with the following data:
+```
+S3_LOKI_ACCESS_KEY : <my access key>
+S3_LOKI_SECRET_ACCESS_KEY: <my secret access key>
+```
+Don't listen to some of the documentation talking about these values needing to be URL encoded. Pass them in as they are when you received them. Kubernetes will base64 encode them as always, but you don't need to do anything to them yourself. Copy, paste, and let kubernetes do the rest
+
+
+## Debugging and Post Deployment Checks
+
+Once things appear to have booted successfully for you I would check your S3 or Digital Ocean bucket. Loki should have filled it with some content. If you have no content, something has _definitly_ gone wrong without you knowing it. Loki doesn't seem to be very obvious or giving about any issues
+
+Some helpful commands I used with `kubectl` were:
+```bash
+# Get an overview, are things running or rebooting and failing ?
+kubectl get all -n loki
+
+# Get details of a pod. This includes boot highlights, but also allows you to confirm what environment variables were passed to your container
+kubectl describe pod <one of the loki pods> -n loki
+
+# Finally, output the log output of the container. Again, this will be pretty useless until you have it mostly right!
+kubectl logs <the compactor pod> -n loki --follow
+```
+These allowed me to deduce what the hell was going on
+
+To get more verbose output, also pass these arguments in the `extraArgs` section of each of the components you are deploying:
+```yaml
+  extraArgs:
+    - -config.expand-env=true # you NEED this in order for environment variables to work in your storageConfig
+    - --log.level=debug
+    - --print-config-stderr
+```
+Again, `--log.level=debug` and `--print-config-stderr` are pretty useless until you get your `aws.s3` configuration correct. You'll be stuck with generic errors until you get that sorted
+
+
+## Bonus Garbage
+Oh, also. A whole bunch of these docs talk about using boltdb_shipper. That thing is deprecated! (https://grafana.com/docs/loki/latest/configure/storage/#boltdb-deprecated) There is a new one (https://grafana.com/docs/loki/latest/configure/storage/#tsdb-recommended), but man...documentation ? Where is it ? Nobody appears to be using this yet either
diff --git a/README.md b/README.md
@@ -32,9 +32,10 @@ Below is a table of each piece installed in my cluster at the moment, and what r
 | Traefik | Ingress Controller | |
 | Kyverno | RBAC and Admissions Controller | |
 | Prometheus | Observability - Metrics Server | |
-| Grafana | Observability - Metrics Dashbaord | |
-| Elasticsearch | Observability - Logging Database | |
-| Kibana | Observability - Logging Dashboard | Coming Soon |
+| Prometheus Adapter | Metrics for Kubernetes Metrics API | Replaces metrics-server to work with Prometheus instead |
+| Grafana | Observability - Metrics & Logging Dashbaord | |
+| Loki| Observability - Logging Database | |
+| Promtail | Observability - Container Stdout/Stderr Log Scraping | Forwards to Loki |
 | Vault | Secrets Manager | Coming Soon |
 
 Below now is another table of the tech being used for managing and configuring my Kubernetes cluster:

diff --git a/main.tf b/main.tf
@@ -1,5 +1,5 @@
 terraform {
-  required_version = "~> 1.7.5"
+  required_version = "~> 1.8.1"
 
   required_providers {
     digitalocean = {
@@ -34,8 +34,7 @@ terraform {
 
 
 module "k8infra" {
-  source   = "./modules/k8infra"
-  do_token = var.do_token
+  source = "./modules/k8infra"
 
   providers = {
     digitalocean = digitalocean
@@ -58,6 +57,10 @@ module "k8config" {
   cf_token     = var.cf_token
   domain       = var.domain
 
+
+  s3_access_key_id     = var.do_spaces_access_key_id
+  s3_secret_access_key = var.do_spaces_secret_access_key
+
   providers = {
     kubernetes = kubernetes
     helm       = helm

diff --git a/modules/k8config/main.tf b/modules/k8config/main.tf
@@ -116,15 +116,45 @@ module "kyverno" {
   ]
 }
 
-module "elasticsearch" {
-  source = "./modules/elasticsearch"
+
+module "loki" {
+  source = "./modules/loki"
+
+  s3_access_key_id     = var.s3_access_key_id
+  s3_secret_access_key = var.s3_secret_access_key
 
   providers = {
-    kubectl = kubectl
-    helm    = helm
+    helm = helm
   }
 
   depends_on = [
     time_sleep.wait_60_seconds
   ]
 }
+
+module "promtail" {
+  source = "./modules/promtail"
+
+  providers = {
+    helm = helm
+  }
+
+  depends_on = [
+    time_sleep.wait_60_seconds,
+    module.loki
+  ]
+}
+
+
+module "prometheus-adapter" {
+  source = "./modules/prometheus-adapter"
+
+  providers = {
+    helm = helm
+  }
+
+  depends_on = [
+    time_sleep.wait_60_seconds,
+    module.prometheus
+  ]
+}
diff --git a/.../k8config/modules/elasticsearch/README.md → .../modules/_archive/elasticsearch/README.md b/.../k8config/modules/elasticsearch/README.md → .../modules/_archive/elasticsearch/README.md
diff --git a/...es/k8config/modules/elasticsearch/main.tf → ...ig/modules/_archive/elasticsearch/main.tf b/...es/k8config/modules/elasticsearch/main.tf → ...ig/modules/_archive/elasticsearch/main.tf
@@ -33,3 +33,18 @@ resource "helm_release" "elasticsearch" {
     file("${abspath(path.module)}/res/elasticsearch-values.yaml")
   ]
 }
+
+resource "time_sleep" "wait_60_seconds" {
+  depends_on      = [helm_release.elasticsearch]
+  create_duration = "60s"
+}
+
+/*
+resource "kubectl_manifest" "elasticsearch_cluster" {
+  yaml_body = file("${abspath(path.module)}/res/elasticsearch.yaml")
+
+  depends_on = [
+    time_sleep.wait_60_seconds
+  ]
+}
+*/
diff --git a/...asticsearch/res/elasticsearch-values.yaml → ...asticsearch/res/elasticsearch-values.yaml b/...asticsearch/res/elasticsearch-values.yaml → ...asticsearch/res/elasticsearch-values.yaml
diff --git a/modules/k8config/modules/_archive/elasticsearch/res/elasticsearch.yaml b/modules/k8config/modules/_archive/elasticsearch/res/elasticsearch.yaml
@@ -0,0 +1,39 @@
+# https://github.com/elastic/cloud-on-k8s/blob/main/config/samples/elasticsearch/elasticsearch.yaml
+apiVersion: elasticsearch.k8s.elastic.co/v1
+kind: Elasticsearch
+metadata:
+  name: elasticsearch
+  namespace: elasticsearch
+spec:
+  version: 8.13.2
+  volumeClaimDeletePolicy: DeleteOnScaledownOnly
+  nodeSets:
+  - name: default
+    count: 1 # 3 cluster nodes
+    volumeClaimTemplates:
+      - metadata:
+          name: elasticsearch-data
+        spec:
+          storageClassName: do-block-storage
+          accessModes:
+            - ReadWriteOnce
+          resources:
+            requests:
+              storage: 5Gi
+    podTemplate:
+      spec:
+        containers:
+          - name: elasticsearch
+            resources:
+              requests:
+                memory: 1Gi
+                cpu: 1
+              limits:
+                memory: 1Gi
+                cpu: 1
+    #config:
+    #  node.store.allow_mmap: false
+  http:
+    service:
+      spec:
+        type: ClusterIP
diff --git a/...config/modules/_archive/loki/variables.tf → ...dules/_archive/elasticsearch/variables.tf b/...config/modules/_archive/loki/variables.tf → ...dules/_archive/elasticsearch/variables.tf
diff --git a/modules/k8config/modules/_archive/grafana/main.tf b/modules/k8config/modules/_archive/grafana/main.tf
@@ -19,7 +19,7 @@ resource "helm_release" "grafana" {
   name = "grafana"
 
   repository = "https://grafana.github.io/helm-charts"
-  chart      = "grafana"
+  chart      = "grafana-agent-operator"
 
   atomic = true
 
@@ -34,7 +34,7 @@ resource "helm_release" "grafana" {
   dependency_update = true
 
   values = [
-    file("${abspath(path.module)}/res/grafana-values.yaml")
+    file("${abspath(path.module)}/res/grafana-agent-operator-values.yaml")
   ]
 
   depends_on = [