Nopo11y health check improvements (#43)

znsio · Dec 16, 2024 · 924a5c9 · 924a5c9
1 parent 7709000
commit 924a5c9
Show file tree

Hide file tree

Showing 2 changed files with 218 additions and 45 deletions.
diff --git a/tools/system-health-check/README.md b/tools/system-health-check/README.md
@@ -1,53 +1,125 @@
 # System health check
 --------
-This program queries the prometheus running inside or outside of your cluster to check health of pods, pvcs, and nodes. This health check program checks for pods, pvcs and nodes health, it also checks for SLO alerts configured in your environment (using sloth) and reports the status to the kuberhealthy.
+
+This Kuberhealthy health check program integrates seamlessly with your Kubernetes cluster to provide proactive monitoring using Prometheus. It queries the Prometheus server to:
+
+- Evaluate the health of pods, nodes, and PVCs based on customizable thresholds.
+- Detect any critical SLO alerts currently in a firing state.
+
+The health check program reports its findings directly to Kuberhealthy, enabling streamlined observability and alerting within your cluster.
 
 [READ][Kuberhealthy](https://github.com/kuberhealthy/kuberhealthy)
 
 
-## How it works.
+## Supported Checks  
 ---------------
+This Kuberhealthy health check program performs three types of checks: **Namespace**, **Node**, and **SLO**. The type of check to perform is determined by an environment variable (`HEALTH_CHECK_TYPE`). Based on the selected check type, the program functions as follows:  
 
-### Pod health check
-It checks for the pod's readiness, CPU, and Memory utilization, it marks the pod as unhealthy if it is not in ready state or pod's CPU or Memory utilization is above configured threshold (default is 80%), if pod is ready and its CPU and Memory utilization is not above the configured threshold then it mark that pod as ready. Then it checks for no of pods expected by the deployment against the healthy no of pods of that deployment, if the percentage of healthy pods of the deployment is lesser than the configured threshold (by default 30%) then it mark that deployment as un healthy and report the failure to the Kuberhealthy
+1. **Namespace Check**  
+   - If `HEALTH_CHECK_TYPE` is set to `namespace`, the program validates the health of **pods** and **PVCs** across all deployments within the specified namespace.  
+   - The namespace to monitor must be provided via the `NAMESPACE` environment variable.  
+   - The health evaluation is based on thresholds defined in the environment variables you set for pod and PVC health.
 
-### PVC health check
-It checks for the avialable space of PVCs in configured namespace, if the available space on PVC is less than configured threshold (default is  200mb) then it marks that PVC as unhealthy and resports the failur to the Kuberhealthy.
+2. **Node Check**  
+   - If `HEALTH_CHECK_TYPE` is set to `node`, the program assesses the health of all **nodes** in the Kubernetes cluster.  
+   - It ensures nodes meet the predefined thresholds for node health metrics.
 
-### Node health Check
-It checks for the node's readiness, root disk, CPU and Memory utilization, it marks node unhealthy if it is not in ready state or root disk space of node is lesser than the configured thresold (default 200mb) or CPU or Memory utilization of node is above the configured threshold (default 80% and 400m CPU available and 1000mb Memory available), if all of these checks are ok then it mark that node as healthy, if it find any unhealthy node then it reports the failure to Kuberhealthy.
+3. **SLO Check**  
+   - If `CHECK_TYPE` is set to `slo`, the program verifies if any **critical SLO alerts** are in an active firing state.  
+   - This check is particularly useful for ensuring compliance with service-level objectives and identifying critical issues promptly.   
 
-### SLO alert check
-It checks if any critical SLO alerts are in active state i cluster, if it find any active critical SLO alert then it reports the failure to the Kuberhealthy.
-[READ][Sloth SLOs](https://sloth.dev/)
+Here's a well-structured explanation of **How the Checks Work** for your README:  
 
-## How to configure
---------
-This health check program except below environment variables, if you don't provide any it takes default values.
-
-|Environment Variables|Default|Description|
-|---------------------|-------|-----------|
-|PROMETHEUS_ENDPOINT|http://nopo11y-stack-kube-prometh-prometheus:9090|Prometheus URL on which you want to run your queries|
-|NAMESPACE|default|Kubernetes namespace where you have your services deployed|
-|HEALTHY_PODS_PERCENTAGE|30%|Percentage of healthy pods for deployments in give namespace|
-|HEALTHY_POD_CPU_UTILIZATION_THRESHOLD|80%|CPU utilization threshold for healthy pods|
-|HEALTHY_POD_MEMORY_UTILIZATION_THRESHOLD|80%|Memory utilization threshold for healthy pods|
-|HEALTHY_PVC_FREE_SPACE|200mb|Available space threshold for healthy pvcs|
-|HEALTHY_NODE_CPU_UTILIZATION_THRESHOLD|90%|CPU utilization for healthy nodes|
-|HEALTHY_NODE_CPU_AVAILABLE|400m|CPU milicores available for healthy nodes|
-|HEALTHY_NODE_MEMORY_UTILIZATION_THRESHOLD|90%|Memory utilization threshold for healthy nodes|
-|HEALTHY_NODE_MEMORY_AVAILABLE|1000mb|Free Memory in mbs for healthy nodes|
-|HEALTHY_NODE_ROOT_DISK_AVAILABLR_SPACE|200mb|Free space available on node's root disk in mbs for healthy nodes|
-
-[Check][Example](./examples/health-check.yaml)
-
-## Sample Kuberhealthy check using nopoo11y-health-check
+---
+
+## How the Checks Work  
+--------------
+This program performs detailed health checks based on the selected `HEALTH_CHECK_TYPE`. Below is an explanation of how each check operates:  
+
+### 1. **Namespace Health Check**  
+   - **Scope:** Validates the health of pods within all deployments in the specified namespace.  
+   - **Checks Performed:**  
+     - **Pod Readiness:** Verifies if each pod in the namespace is in a ready state.  
+     - **Resource Utilization:** Ensures that each pod's CPU and memory usage is below the defined thresholds:
+       - CPU: `HEALTHY_POD_CPU_UTILIZATION_THRESHOLD`
+       - Memory: `HEALTHY_POD_MEMORY_UTILIZATION_THRESHOLD`  
+     - **Unhealthy Pod Evaluation:** If a pod is not ready or exceeds the CPU/memory thresholds, it is marked as **unhealthy**.  
+   - **Deployment Health:**  
+     - The program calculates the percentage of healthy vs. unhealthy pods in each deployment.  
+     - If the percentage of unhealthy pods exceeds the threshold defined by the `HEALTHY_PODS_PERCENTAGE` environment variable, the deployment is marked **unhealthy**.  
+   - **Reporting:** Any unhealthy deployment triggers a failure report to Kuberhealthy.
+
+### 2. **Node Health Check**  
+   - **Scope:** Monitors the health of all nodes in the Kubernetes cluster.  
+   - **Checks Performed:**  
+     - **Node Readiness:** Ensures all nodes are in a ready state.  
+     - **Resource Utilization:**  
+       - **CPU Utilization:**  
+         - Verifies that each node’s CPU utilization is below the threshold set by `HEALTHY_NODE_CPU_UTILIZATION_THRESHOLD`.  
+         - Confirms sufficient available CPU (`HEALTHY_NODE_CPU_AVAILABLE`).  
+       - **Memory Utilization:**  
+         - Ensures each node's memory usage is below the threshold set by `HEALTHY_NODE_MEMORY_UTILIZATION_THRESHOLD`.  
+         - Confirms that sufficient memory is available on the node, defined by `HEALTHY_NODE_MEMORY_AVAILABLE`.  
+     - **Disk Space:** Validates that the root disk available space on each node meets the minimum requirement defined by `HEALTHY_NODE_ROOT_DISK_AVAILABLE_SPACE`.  
+   - **Unhealthy Node Identification:** If any node fails one or more of these checks, it is marked **unhealthy**.  
+   - **Reporting:** Unhealthy nodes are reported as failures to Kuberhealthy.  
+
+### 3. **SLO Alert Check**  
+   - **Scope:** Monitors for any active **critical SLO alerts** firing in the cluster.  
+   - **Checks Performed:**  
+     - Queries Prometheus to identify if any critical SLO alert is currently in a firing state.  
+   - **Reporting:** If any active SLO alert is detected, it is reported as a failure to Kuberhealthy.  
+
+---
+
+## Customizing the Health Check  
+-------------
+The behavior of the Kuberhealthy health check can be customized using the following environment variables. These variables allow users to configure thresholds, select check types, and define endpoints for Prometheus or Thanos Query.  
+
+#### **Environment Variables**  
+
+1. **Prometheus/Thanos Configuration**  
+   - `PROMETHEUS_ENDPOINT`: URL of the Prometheus server to query metrics from.  
+   - `THANOS_QUERY_ENDPOINT`: (Optional) URL of the Thanos Query endpoint, if Thanos is used for metrics aggregation.  
+
+2. **Health Check Type**  
+   - `HEALTH_CHECK_TYPE`: Specifies the type of health check to perform. Supported values are:  
+     - `namespace`: Check health of deployments within a specific namespace.  
+     - `node`: Check health of all nodes in the cluster.  
+     - `slo`: Check for active critical SLO alerts.  
+
+3. **Namespace Configuration (for Namespace Check)**  
+   - `NAMESPACE`: Specifies the namespace to monitor when `HEALTH_CHECK_TYPE` is set to `namespace`.  
+
+4. **Thresholds for Namespace Health Check**  
+   - `HEALTHY_PODS_PERCENTAGE`: Minimum percentage of healthy pods required for a deployment to be considered healthy.  
+   - `HEALTHY_POD_CPU_UTILIZATION_THRESHOLD`: Maximum CPU utilization (in percentage) for pods to be considered healthy.  
+   - `HEALTHY_POD_MEMORY_UTILIZATION_THRESHOLD`: Maximum memory utilization (in percentage) for pods to be considered healthy.  
+   - `HEALTHY_PVC_FREE_SPACE`: Minimum free space (in MB) required for PVCs to be considered healthy.  
+
+5. **Thresholds for Node Health Check**  
+   - `HEALTHY_NODE_CPU_UTILIZATION_THRESHOLD`: Maximum CPU utilization (in percentage) for nodes to be considered healthy.  
+   - `HEALTHY_NODE_CPU_AVAILABLE`: Minimum available CPU (in milicores) required for nodes to be considered healthy.  
+   - `HEALTHY_NODE_MEMORY_UTILIZATION_THRESHOLD`: Maximum memory utilization (in percentage) for nodes to be considered healthy.  
+   - `HEALTHY_NODE_MEMORY_AVAILABLE`: Minimum available memory (in MB) required for nodes to be considered healthy.  
+   - `HEALTHY_NODE_ROOT_DISK_AVAILABLE_SPACE`: Minimum root disk available space (in MB) required for nodes to be considered healthy.  
+
+---
+
+## Sample Kuberhealthy Checks  
+----------
+Below are sample configurations for integrating the Kuberhealthy health check with different types of monitoring:  
+
+---------
+
+#### 1. **Namespace Health Check**  
+Monitors the health of all deployments within the specified namespace (`apps`) using Prometheus metrics.  
 
 ```yaml
 apiVersion: comcast.github.io/v1
 kind: KuberhealthyCheck
 metadata:
-  name: sample-nopo11y-health-check
+  name: apps-nopo11y-health-check
   namespace: observability
 spec:
   podSpec:
@@ -57,8 +129,10 @@ spec:
         value: kuberhealthy:80
       - name: PROMETHEUS_ENDPOINT
         value: http://nopo11y-stack-kube-prometh-prometheus:9090/prometheus
+      - name: HEALTH_CHECK_TYPE
+        value: namespace
       - name: NAMESPACE
-        value: sample
+        value: apps
       image: ghcr.io/znsio/nopo11y/system-health-check:latest
       imagePullPolicy: IfNotPresent
       name: main
@@ -76,9 +150,84 @@ spec:
   timeout: 5m
 ```
 
-## Build docker image
----------------
-Dockerfile present in this directory has a instructions for building the docker image.
-```sh
-docker build -t <your-registry>:<docker-tag> .
+---
+
+#### 2. **Node Health Check**  
+Checks the health of all nodes in the cluster based on Prometheus metrics for CPU, memory, and disk utilization.  
+
+```yaml
+apiVersion: comcast.github.io/v1
+kind: KuberhealthyCheck
+metadata:
+  name: nodes-nopo11y-health-check
+  namespace: observability
+spec:
+  podSpec:
+    containers:
+    - env:
+      - name: KH_REPORTING_URL
+        value: kuberhealthy:80
+      - name: PROMETHEUS_ENDPOINT
+        value: http://nopo11y-stack-kube-prometh-prometheus:9090/prometheus
+      - name: HEALTH_CHECK_TYPE
+        value: node
+      image: ghcr.io/znsio/nopo11y/system-health-check:latest
+      imagePullPolicy: IfNotPresent
+      name: main
+      resources:
+        requests:
+          cpu: 10m
+          memory: 50Mi
+      securityContext:
+        allowPrivilegeEscalation: false
+        readOnlyRootFilesystem: true
+    securityContext:
+      fsGroup: 999
+      runAsUser: 999
+  runInterval: 1m
+  timeout: 5m
+```
+
+---
+
+#### 3. **SLO Alert Check**  
+Checks for any active critical SLO alerts in the cluster using Prometheus and Thanos Query.  
+
+```yaml
+apiVersion: comcast.github.io/v1
+kind: KuberhealthyCheck
+metadata:
+  name: slo-nopo11y-health-check
+  namespace: observability
+spec:
+  podSpec:
+    containers:
+    - env:
+      - name: KH_REPORTING_URL
+        value: kuberhealthy:80
+      - name: PROMETHEUS_ENDPOINT
+        value: http://nopo11y-stack-kube-prometh-prometheus:9090/prometheus
+      - name: THANOS_QUERY_ENDPOINT
+        value: http://nopo11y-stack-thanos-query:9090/thanos-query
+      - name: HEALTH_CHECK_TYPE
+        value: slo
+      image: ghcr.io/znsio/nopo11y/system-health-check:latest
+      imagePullPolicy: IfNotPresent
+      name: main
+      resources:
+        requests:
+          cpu: 10m
+          memory: 50Mi
+      securityContext:
+        allowPrivilegeEscalation: false
+        readOnlyRootFilesystem: true
+    securityContext:
+      fsGroup: 999
+      runAsUser: 999
+  runInterval: 1m
+  timeout: 5m
 ```
+
+---
+
+These configurations can be customized by updating the environment variables to suit your specific requirements for Prometheus endpoints, thresholds, and check types.
diff --git a/tools/system-health-check/health_check.py b/tools/system-health-check/health_check.py
@@ -15,7 +15,9 @@
 
 logger.info("Starting health check")
 
-prometheus = os.getenv("PROMETHEUS_ENDPOINT", "http://nopo11y-stack-kube-prometh-prometheus:9090")
+prometheus = os.getenv("PROMETHEUS_ENDPOINT", "http://nopo11y-stack-kube-prometh-prometheus.observability.svc.cluster.local:9090/prometheus")
+thanos_query = os.getenv("THANOS_QUERY_ENDPOINT", "")
+health_check_type = os.getenv("HEALTH_CHECK_TYPE","namespace")
 namespace = os.getenv("NAMESPACE","default")
 healthy_pods = os.getenv("HEALTHY_PODS_PERCENTAGE","30")
 pod_cpu_threshold = os.getenv("HEALTHY_POD_CPU_UTILIZATION_THRESHOLD","80")
@@ -28,6 +30,8 @@
 node_disk_available = os.getenv("HEALTHY_NODE_ROOT_DISK_AVAILABLR_SPACE", "200")
 
 prometheus_url = prometheus + '/api/v1/query?'
+if thanos_query != "":
+    thanos_query_url = thanos_query + '/api/v1/query?'
 
 pods_with_resources_query = 'sum(kube_pod_container_resource_limits{namespace="'+ namespace +'",container!~"istio-proxy|", resource="cpu"}) by (pod) AND sum(kube_pod_container_resource_limits{namespace="'+ namespace +'",container!~"istio-proxy|", resource="memory"}) by (pod)'
 
@@ -223,7 +227,10 @@ def node_check():
 def slo_check():
     failed = []
     try:
-        response = requests.get(prometheus_url, params={'query': 'ALERTS{alertname=~".*availability.*|.*requests.*|.*latency.*|.*response time.*", severity="critical"}'})
+        if thanos_query_url != "":
+            response = requests.get(thanos_query_url, params={'query': 'ALERTS{alertname=~".*availability.*|.*requests.*|.*latency.*|.*response time.*", severity="critical"}'})
+        else:
+            response = requests.get(prometheus_url, params={'query': 'ALERTS{alertname=~".*availability.*|.*requests.*|.*latency.*|.*response time.*", severity="critical"}'})
         response_json = json.loads(response.text)
         if response_json['data']['result']:
             for result in response_json['data']['result']:
@@ -256,10 +263,27 @@ def pods_details():
 
 def main():
     errors = []
-    for failures in [pods_details(), pvc_check(), node_check(), slo_check()]:
-        for error in failures:
-            errors.append(error)
-    logger.info("health check errors %s",  str(errors))
+    if health_check_type == "namespace":
+        logger.info("Running health check for namespace - %s", str(namespace))
+        for failures in [pods_details(), pvc_check()]:
+            for error in failures:
+                errors.append(error)
+        logger.info("health check errors %s",  str(errors))
+
+    if health_check_type == "node":
+        logger.info("Running health check for nodes")
+        for failures in [node_check()]:
+            for error in failures:
+                errors.append(error)
+        logger.info("health check errors %s",  str(errors))
+
+    if health_check_type == "slo":
+        logger.info("Running health check for SLO")
+        for failures in [slo_check()]:
+            for error in failures:
+                errors.append(error)
+        logger.info("health check errors %s",  str(errors))
+
 
     if len(errors) > 0:
         logger.info("reporting failure")