better doc

datacenter · Sep 24, 2024 · 2bcb20b · 2bcb20b
1 parent 664d0c3
commit 2bcb20b
Show file tree

Hide file tree

Showing 2 changed files with 348 additions and 292 deletions.
diff --git a/README.md b/README.md
@@ -32,316 +32,94 @@ password: guest
 
 # Your Stack
 
-Here you can see a high level diagram of the components used and how they interact together  
-```mermaid
-flowchart-elk
-  subgraph ACI Monitoring Stack
-    G["Grafana"]
-    P[("Prometheus")]
-    L["Loki"]
-    PT["Promtail"]
-    SL["Syslog-ng"]
-    AM["Alertmanager"]
-    A["aci-exporter"]
-    G--"PromQL"-->P
-    G--"LogQL"-->L
-    P-->AM
-    PT-->L
-    SL-->PT
-    P--"Service Discovery"-->A
-  end
-  subgraph ACI
-    S["Switches"]
-    APIC["APIC"]
-  end
-  U["User"]
-  N["Notifications (Mail/Webex etc...)"]
-  V{Ver >= 6.1}
-  A--"API Queries"-->S
-  A--"API Queries"-->APIC
-  U-->G
-  AM-->N
-  S--"Syslog"-->V
-  APIC--"Syslog"-->V
-  V -->|Yes| PT
-  V -->|No| SL
-```
-
-# Stack Development
-If you want to contribute to this project star from [Here](docs/development.md)
-
-# Stack Deployment
-
-## Pre Requisites
-- Familiarity with Kubernetes: This installation guide is intended to assist with the setup of the ACI Monitoring stack and assumes prior familiarity with Kubernetes; it is not designed to provide instruction on Kubernetes itself.
-- A Kubernetes Cluster: Currently the stack has been tested on `Upstream Kubernetes 1.30.x` and `Minikube`.
-  - Persistent Volumes: 10G should be plenty for a small/demo environment. Many storage provisioner support Volume expansion so should be easy to increase this post installation.
-  - Ability to expose services for:
-      - Access to the Grafana/Prometheus and Alert Manager dashboards: This will be ideally achieved via an `Ingress Controller`
-        - (Optional) Wildcard DNS Entries for the ingress controller domain.
-      - Syslog ingestion from ACI: Since the syslog can be sent via `UDP` or `TCP` it is more flexible to use expose these service directly via either a `NodePort` or a `LoadBalancer` service Type
-  - Cluster Compute Resources: This stack has been tested against a 500 node ACI fabric and was consuming roughly 8GB of RAM, CPU resources didn't seem to play a major role and any modern CPU should suffice.
-  - 1 Dedicated Namespace per instance: One Instance can monitor at least 500 switches.
-    - This is not strictly required but is suggested to keep the HELM configuration simple so the default K8s service names can be re-used see the [Config Preparation](#config-preparation) section for more details.
-
-- Helm: This stack is distributed as a helm chart and relies on 3rd party helm charts as well
-- Connectivity from your Kubernetes Cluster to ACI either over Out Of Band or In Band
-
-# Installation
-
-If you are installing on Minikube please follow the [Minikube Preparation Steps](docs/minikube.md) and then **come back here.**
-
-## Config Preparation
-
-The ACI Monitoring Stack is a combination of several [Charts](charts/aci-monitoring-stack/charts), if you are familiar with Helm you are aware of the struggle to propagate dynamic values to sub-charts. For example, it is not possible to pass to a sub-chart the name of a service in a dynamic way. 
+To gain a comprehensive understanding of the ACI Monitoring Stack and its components it is helpful to break down the stack into separate functions. Each function focuses on a different aspect of monitoring the Cisco Application Centric Infrastructure (ACI) environment.
 
-In order to simplify the user experience the `chart` comes with a few pre-configured parameters that are populated in the configurations of the various sub-charts. 
+## Fabric Discovery:
 
-For example the aci-exporter Service Name is pre-configured as `aci-exporter-svc` and this value is then passed to Prometheus as service Discovery URL.
+The ACI monitoring stack uses Prometheus Service Discovery (HTTP SD) to dynamically discover and scrape targets by periodically querying a specified HTTP endpoint for a list of target configurations in JSON format.
 
-All these values can be customized and if you need to you can refer to the [Values](charts/aci-monitoring-stack/values.yaml) file.
+The ACI Monitoring Stack needs only the IP addresses of the APICs, the Switches will be Auto Discovered. If switches are added or removed from the fabric no action is required from the end user.
 
-*Note:* This is the first HELM char `camrossi` created, and he is sure it can be improved. If you have suggestions they are extremely welcome! :) 
-
-### The aci-exporter
-
-The aci-exporter is the bridge between your Cisco ACI environment and the Prometheus monitoring ecosystem, for it to works it needs to know:
-- `fabrics`: A list of fabrics and how to connect to the APICs.
-  - Requires a **ReadOnly** **Admin** User
-- `service_discovery`: Configure if devices are reachable via Out Of Band (`oobMgmtAddr`) or InBand (`inbMgmtAddr`). 
-
-*Note:* The switches are auto-discovered.
-
-This is done by setting the following Values in Helm:
-
-```yaml
-aci_exporter:
-  # Profiles for different fabrics
-  fabrics:
-    fab1:
-      username: <username>
-      password: <password>
-      apic:
-        - https://IP1
-        - https://IP2
-        - https://IP3
-      # service_discovery oobMgmtAddr|inbMgmtAddr
-      service_discovery: oobMgmtAddr
-    fab2:
-      username: <username>
-      password: <password>
-      apic:
-        - https://IP1
-        - https://IP2
-        - https://IP3
-      # service_discovery oobMgmtAddr|inbMgmtAddr
-      service_discovery: inbMgmtAddr
-```
-### Prometheus and Alert Manager
-
-Prometheus is installed via its [own Chart](https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus) the options you need to set are:
-
-- The `ingress` config and the baseURL: These most likely are the same URL which can access `prometheus` and `alertmanager`
-- Persistent Volume Capacity
-- (Optional) `retentionSize`: this is only needed if you want to limit the retention by size. Keep in mind that if you run out of disk space Prometheus WILL stop working. 
-- (Optional) alertmanager `route`: these are used to send notifications via Mail/Webex etc... the complete syntax is available [Here](https://prometheus.io/docs/alerting/latest/configuration/#receiver-integration-settings) 
-Below an example:
-```yaml
-prometheus:
-  server:
-    ingress:
-      enabled: true
-      ingressClassName: "traefik"
-      hosts:
-        - aci-exporter-prom.apps.c1.cam.ciscolabs.com
-    baseURL: "http://aci-exporter-prom.apps.c1.cam.ciscolabs.com"
-    service:
-      retentionSize: 5GB
-    persistentVolume:
-      accessModes: ["ReadWriteOnce"]
-      size: 5Gi
-
-  alertmanager:
-    baseURL: "http://aci-exporter-alertmanager.apps.c1.cam.ciscolabs.com"
-    ingress:
-      enabled: true
-      ingressClassName: "traefik"
-      hosts:
-        - host: aci-exporter-alertmanager.apps.c1.cam.ciscolabs.com
-          paths:
-            - path: /
-              pathType: ImplementationSpecific
-    config:
-      route:
-        group_by: ['alertname']
-        group_interval: 30s
-        repeat_interval: 30s
-        group_wait: 30s
-        receiver: 'webex'
-      receivers:
-        - name: webex
-          webex_configs:
-            - send_resolved: false
-              api_url: "https://webexapis.com/v1/messages"
-              room_id: "<room_id>"
-              http_config:
-                authorization:
-                  credentials: "<credentials>"
-```
+```mermaid
+    flowchart-elk RL
+      P[("Prometheus")]
+      A["aci-exporter"]
+      APIC["APIC"]
 
-If you use Webex here some [config steps](docs/webex.md) for you!
-
-### Grafana
-
-Grafana is installed via its [own Chart](https://github.com/grafana/helm-charts/tree/main/charts/grafana) the main options you need to set are:
-
-- The `ingress` config: External URL which can access Grafana.
-- Persistent Volume Capacity
-- (Optional) `adminPassword`: If not set will be auto generated and can be found in the `grafana` secret
-- (Optional) `viewers_can_edit`: This allows users with a `view only` role to modify the dashboards and access `Explorer` to execute queries against `Pormetheus` and `Loki`. However, the user will not be able to save any changes.
-- (Optional) `deploymentStrategy`: if Grafana `Persistent Volume` is of type `ReadWriteOnce` rolling updates will get stuck as the new pod cannot start before the old one releases the PVC. Setting `deploymentStrategy.type` to `Recreate` destroy the original pod before starting the new one.
-
-Below an example:
-
-```yaml
-grafana:
-  grafana.ini:
-    users:
-      viewers_can_edit: "True"
-  adminPassword: <adminPassword>
-  deploymentStrategy:
-    type: Recreate
-  ingress:
-    ingressClassName: "traefik"
-    enabled: true
-    hosts:
-      - aci-exporter-grafana.apps.c1.cam.ciscolabs.com
-  persistence:
-    enabled: true
-    size: 2Gi
+      APIC -- "API Query" --> A
+      A -- "HTTP SD" --> P
 ```
-### Syslog config
-
-The syslog config is the most complicated part as it relies on 3 components (`promtail`, `loki` and `syslog-ng`) with their own individual configs. Furthermore, there are two issues we need to overcome:
-
-- The Syslog messages don't contain the ACI Fabric name: to be able to distinguish the messaged from one fabric to another the only solution is to use dedicated `external services` with unique `IP:Port` pair per Fabric.
-- Until ACI 6.1 we need `syslog-ng` between `ACI` and `Promtail` to convert from RFC 3164 to 5424
-  *Note*: Promtail 3.1.0 adds support for RFC 3164 however this **DOES NOT** work for Cisco Switches and still requires syslog-ng. syslog-ng `syslog-parser` has extensive logic to handle all the complexities (and inconsistencies) of RFC 3164 messages.
 
-#### Loki
+## ACI Object Scraping: 
 
-Loki is deployed with the [Simple Scalable](https://grafana.com/docs/loki/latest/get-started/deployment-modes/#simple-scalable) Profile and is composed of a `backend`, `read` and `write` deployment with a replica of 3.
+`Prometheus` scraping is the process by which `Prometheus` periodically collects metrics data by sending HTTP requests to predefined endpoints on monitored targets. The `aci-exporter` translates ACI-specific metrics into a format that `Prometheus` can ingest, ensuring that all crucial data points are captured and monitored effectively.
 
-The `backend` and `write` deployments requires persistent volumes. This chart is pre-configured to allocate 2Gi Volumes for each deployment (a total of 6 PVC will be created):
-- `3 x data-loki-backend-X`
-- `3 x data-loki-write-X`
-
-The PVC Size can be easily changed if required.
-
-Loki also requires an `Object Store`. This chart is pre-configured to deploy [minio](https://min.io/). *Note:* Currently [Loki Chart](https://github.com/grafana/loki/tree/main/production/helm/loki) is deploying a very old version of `Minio` and there is a [PR open](https://github.com/grafana/loki/pull/11409) to address this already.
-
-Loki also support `chunks-cache` via `memcached`. The default config allocates 8G of memory. I have decreased this to 1G by default.
-
-If you want to change any of these parameters check the `loki` section in the [Values](charts/aci-monitoring-stack/values.yaml) file.
-
-Assuming the default parameters are acceptable the only required config for loki is to set the `rulerConfig.external_url` to point to the Grafana `ingress` URL
-
-```yaml
-loki: 
-  loki:
-    rulerConfig:
-      external_url: http://aci-exporter-grafana.apps.c1.cam.ciscolabs.com
+```mermaid
+    flowchart-elk RL
+      P[("Prometheus")]
+      A["aci-exporter"]
+      subgraph ACI
+        S["Switches"]
+        APIC["APIC"]
+      end
+      A--"Scraping"-->P
+      S--"API Queries"-->A
+      APIC--"API Queries"-->A
 ```
+## Syslog Ingestion:
 
-### Promtail and Syslog-ng
-
-These two components are tightly coupled together.
-
-- Syslog-ng translates logs from RFC 3164 to RFC 5424 and forwards them to Promtail. 
-- Promtail is ingesting logs in RFC 5424 format and forwards them to Loki. 
-
-Promtail is pre-configured with:
-
-- Deployment Mode with 1 replica
-- Loki Push Gateway url: `loki-gateway` This is the Loki Gateway K8s service name. 
-- Auto generated `scrapeConfigs` that will map a Fabric to a `IP:Port` Pair. 
+The syslog config is composed of 3 components: `promtail`, `loki` and `syslog-ng`.
+Prior to ACI 6.1 `syslog-ng` is required between `ACI` and `Promtail` to convert from RFC 3164 to 5424 syslog message format.
 
-These setting can be easily changed if required check the `Promtail` section in the [Values](charts/aci-monitoring-stack/values.yaml) file for more details.
+```mermaid
+    flowchart-elk LR
+      L["Loki"]
+      PT["Promtail"]
+      SL["Syslog-ng"]
+      PT-->L
+      SL-->PT
+      subgraph ACI
+        S["Switches"]
+        APIC["APIC"]
+      end
+      V{Ver >= 6.1}
+      S--"Syslog"-->V
+      APIC--"Syslog"-->V
+      V -->|Yes| PT
+      V -->|No| SL
+```
 
-Syslog-ng is pre-configured with:
-- Deployment Mode with 1 replica
+## Data Visualization
 
-If you are happy with my defaults the only configs required are setting the `extraPorts` for Loki and `services` for Syslog-ng. You will need one entry per fabric and the portsd needs to "match", see the diagram below for a visual representation.
-`Syslog-ng` is only needed for ACI < 6.1
+The Data Visualization is handled by `Grafana`, an open-source analytics and monitoring platform that allows users to visualize, query, and analyze data from various sources through customizable and interactive dashboards. It supports a wide range of data sources, including `Prometheus` and `Loki` enabling users to create real-time visualizations, alerts, and reports to monitor system performance and gain actionable insights.
 
-Below a diagram of what is our goal for an ACI 6.1 fabric and an ACI 5.2 one.
 ```mermaid
-flowchart-elk
-  subgraph K8s Cluster
-    subgraph Promtail
-      PT1513["TCP:1513 label:fab1"]
-      PT1514["TCP:1514 label:fab2"]
-    end
-    subgraph Syslog-ng
-    SL["UDP:1514"]
-    end
-    F1SVC["LoadBalancerIP TCP:1513"]
-    F2SVC["LoadBalancerIP UDP:1514"]
-
-    F1SVC --> PT1513
-    F2SVC --> SL
-  end
-  ACI61["ACI Fab1 Ver. 6.1"] --> F1SVC
-  ACI52["ACI Fab2 Ver. 5.2"] --> F2SVC
-  SL --> PT1514
-
-```
-
-The above architecture can be achieved with the following config:
-
-- `name`: This will set the `fabric` labels for the logs received by Loki
-- `containerPort`: The port the container listen to. This is mapping a logs stream to a fabric
-- `service.type`: I would suggest to set this to either `NodePort` or `LoadBalancer`. Regardless this IP allocated MUST be reachable by all the Fabric Nodes. 
-- `service.port`: The port the `LoadBalancer` service is listening to, this will be the port you set into the ACI Syslog config.
-- `service.nodePort`: The port the `NodePort` service is listening to, this will be the port you set into the ACI Syslog config.
-
-```yaml
-promtail:
-  extraPorts:
-    fab1:
-      name: fab1
-      containerPort: 1513
-      service:
-        type: LoadBalancer
-        port: 1513
-    fab2:
-      name: fab2
-      containerPort: 1516
-      service:
-        type: ClusterIP
-
-syslog:
-  services:
-    fab2:
-        name: fab2
-        containerPort: 1516
-        protocol: UDP
-        service:
-          type: LoadBalancer
-          port: 1516
+    flowchart-elk RL
+      G["Grafana"]
+      L["Loki"]
+      P[("Prometheus")]
+      U["User"]
+
+      P--"PromQL"-->G
+      L--"LogQL"-->G
+      G-->U
 ```
+## Alerting
 
-### ACI Syslog Config
-If you need a reminder on how to configure ACI Syslog take a look [Here](docs/syslog.md)
+`Alertmanager` is a component of the `Prometheus` ecosystem designed to handle alerts generated by `Prometheus`. It manages the entire lifecycle of alerts, including deduplication, grouping, silencing, and routing notifications to various communication channels like email, `Webex`, `Slack`, and others, ensuring that alerts are delivered to the right people in a timely and organized manner.
 
-Here an [Example Config for 4 Fabrics](docs/4-fabric-example.yaml)
-
-## Chart Deployment
+In the ACI Monitoring Stack both `Prometheus` and `Loki` are configured with alerting rules.
+```mermaid
+flowchart-elk LR
+  L["Loki"]
+  P["Prometheus"]
+  AM["Alertmanager"]
+  N["Notifications (Mail/Webex etc...)"]
+  L --> AM
+  P --> AM 
+  AM --> N
+```
 
-- Create a file containing all your configs i.e.: `aci-mon-stack-config.yaml`
+# Click here for the [Stack Deployment](docs/development.md) Guide
 
-```shell
-helm repo add aci-monitoring-stack https://datacenter.github.io/aci-monitoring-stack
-helm repo update
-helm -n aci-mon-stack upgrade --install --create-namespace aci-mon-stack aci-monitoring-stack/aci-monitoring-stack -f aci-mon-stack-config.yaml
-```
+# Click here for the [Stack Development](docs/development.md) Guide