more writing

Altinn · Jan 23, 2025 · 1c9c80e · 1c9c80e
1 parent 908cccd
commit 1c9c80e
Showing 1 changed file with 72 additions and 51 deletions.
diff --git a/rfcs/0004-k6-tests.md b/rfcs/0004-k6-tests.md
@@ -45,7 +45,9 @@ a small group of people to write them all.
 
 # Guide-level explanation
 [guide-level-explanation]: #guide-level-explanation
-![](./img/k6-workflow-overview.png)
+> [!WARNING]
+> Workflow is outdated but the general idea is still valid.
+![General workflow](./img/k6-workflow-overview.png)
 
 The expected workflow should be:
 - Developers write down their k6 scripts.
@@ -62,74 +64,92 @@ potential notifications from AlertManager.
 [reference-level-explanation]: #reference-level-explanation
 
 Onboarding a new Team requires some setup beforehand.
-First, it's assumed that Teams already have a [Service Principal](https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/configuring-openid-connect-in-azure)
-that they are using to authenticate towards Azure.
+First, it's assumed that teams already have a [Service Principal](https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/configuring-openid-connect-in-azure)
+that they are using to authenticate towards Azure from Github.
 
-As the platform engineer, we need to create a namespace for the team, create the necessary Entra ID Groups, add the necessary members to the group and create the k8s RoleBindings. [docs](https://learn.microsoft.com/en-us/azure/aks/azure-ad-rbac?tabs=portal)
-Ideally, these will be done in a automated way but as of right now we have to do this manually. :'(
+On our end, we need to create a namespace for the team, create the necessary Entra ID Groups, add the necessary members to the group and create the k8s RoleBindings. [Azure docs for an overview of the needed steps.](https://learn.microsoft.com/en-us/azure/aks/azure-ad-rbac)
+Ideally, these will be done in a automated way but as of right now we have to do this manually.
 
-As of right now, developers need to setup their own Github Workflow to get the AKS credentials and create their own k8s manifests to deploy to the cluster. In the (hopefully near) futuer,
-we should create a Github Workflow that can be re-used by Developers. This will hopefully make the barrier of entry quite low and allow developers to get onboarded quickly.
+There are 4 general groups defined:
+- Cluster Admin: has the [`Azure Kubernetes Service Cluster Admin Role`](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles/containers#azure-kubernetes-service-cluster-admin-role) to allow us to List cluster admin credential whenever we need to do anything in the cluster. It's also required by whatever ServicePrincipal we decide to use to manage resources in the cluster, e.g. to create the namespaces per team, create the role bindings, deploy services we might want to provide in cluster, etc. It's a [super-user access](https://kubernetes.io/docs/reference/access-authn-authz/rbac/#user-facing-roles). In the future we might want to use this Group/Role only in exceptional cases and instead, create CustomRole with the permissions we know we need and another a CustomRole for the permissions needed by the Service Principal. And for us, we can use PIM when needed.
 
+- Cluster User: has the [`Azure Kubernetes Service Cluster User Role`](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles/containers#azure-kubernetes-service-cluster-user-role) to allow users (humans and service principals) to List cluster user credentials. This is what will allow them to interact with the cluster via kubectl for example.
 
-Currently, there are 4 roles defined:
-- Cluster Admin has the [`Azure Kubernetes Service Cluster Admin Role`](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles/containers#azure-kubernetes-service-cluster-admin-role) has the to allow us to List cluster admin credential action. It's also required by w.e ServicePrincipal we decide to use to manage resources in the cluster. It's a [super-user access](https://kubernetes.io/docs/reference/access-authn-authz/rbac/#user-facing-roles)
-- Cluster Users has the [`Azure Kubernetes Service Cluster User Role`](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles/containers#azure-kubernetes-service-cluster-user-role) to allow users to List cluster user credentials
-- <TEAM> Users - The object id will be used in the k8s RoleBinding to set the permissions that developers need to debug issues on the cluster - [example config](https://github.com/Altinn/altinn-platform/blob/d37e379417b1886f6d17816ba70bfae5ac664c32/infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_rbac.tf#L1-L31).
-- <SP> Users - The object id will be used in the k8s RoleBinding to set the permissions that service principals need to deploy k8s resources on the cluster - [example config](https://github.com/Altinn/altinn-platform/blob/d37e379417b1886f6d17816ba70bfae5ac664c32/infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_rbac.tf#L33-L63).
+- ${TEAM_NAME} Users - The object id will be used in the k8s RoleBinding to set the permissions that developers need to debug issues on the cluster - [example config](https://github.com/Altinn/altinn-platform/blob/d37e379417b1886f6d17816ba70bfae5ac664c32/infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_rbac.tf#L1-L31). This is a group per team.
+
+- ${TEAM_SP} Users - The object id will be used in the k8s RoleBinding to set the permissions that service principals need to deploy k8s resources on the cluster - [example config](https://github.com/Altinn/altinn-platform/blob/d37e379417b1886f6d17816ba70bfae5ac664c32/infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_rbac.tf#L33-L63). This is a group per team. If a team has multiple repos, we can add the various SPs to the same group. (If they prefer to keep it separate, we can also follow the normal process as if it was a completely different team).
+
+Once we've done our part, developers need to setup their own Github Workflow that references our GithubAction and fill out a config file that we use to generate the deployment manifests. The current implementation uses a simple yaml file as the config format and Jsonnet to create the needed manifests.
+
+```
+test_run:
+  name: k6-enduser-search
+  vus: 50
+  duration: 10m
+  parallelism: 10
+  file_path: "/path/to/where/test/files/are/located"
+```
+The Github Action is a [composite action](https://docs.github.com/en/actions/sharing-automations/creating-actions/creating-a-composite-action) that runs a docker image with the needed tools and scripts to setup the environment, generate the needed k6 and k8s resources and deploys starts the test by deploying the manifests into the cluster.
+
+Some of the steps include [creating a .tar file](https://grafana.com/docs/k6/latest/misc/archive/), creating a ConfigMap to hold the archive file, creating optional SealedSecrets with encrypted data and generating the actual [TestRun Custom Resource](https://grafana.com/docs/k6/latest/testing-guides/running-distributed-tests/#4-create-a-custom-resource). Other useful things such as adding labels, default test id values, etc. that are useful for integrating with other systems are also handled by the action.
+
+An example of a TestRun config can be seen below.
+
+```
+apiVersion: k6.io/v1alpha1
+kind: TestRun
+metadata:
+  name: k6-create-transmissions
+  namespace: dialogporten
+spec:
+  arguments: --out experimental-prometheus-rw --vus=10 --duration=5m --tag testid=k6-create-transmissions_20250109T082811
+  parallelism: 5
+  script:
+    configMap:
+      name: k6-create-transmissions
+      file: archive.tar
+  runner:
+    env:
+    - name: K6_PROMETHEUS_RW_SERVER_URL
+      value: "http://kube-prometheus-stack-prometheus.monitoring:9090/api/v1/write"
+    - name: K6_PROMETHEUS_RW_TREND_STATS
+      value: "avg,min,med,max,p(95),p(99),p(99.5),p(99.9),count"
+    metadata:
+      labels:
+        k6-test: k6-create-transmissions
+    resources:
+      requests:
+        memory: 200Mi
+```
+
+As the test is running, Grafana can be used to check the behavior in real time.
+
+For developers that would like to have smoke tests implemented after every commit to main, it's possible use the [github api](https://docs.github.com/en/rest/commits/statuses?apiVersion=2022-11-28) to do it. For those who wish it, it's also possible to use [AlertManager](https://prometheus.io/docs/alerting/latest/configuration/#receiver) to generate notifications to systems such as Slack.
+
+![Grafana K6 Prometheus Dashboard](https://grafana.com/api/dashboards/19665/images/14905/image)
 
 ## Infrastructure
 The main infrastructure needed are a k8s clusters (for running the tests and other supporting services) and an Azure Monitor Workspace for storing the Prometheus metrics generated by the test runs.
 
-The k8s cluster needs to support Workload Identity and OIDC issuer enabled so we can push metrics into the Azure Monitor Workspace as well as integration with Azure RBAC so we can assign the correct permissions to automated systems making deployments into the cluster as well as developers performing debugging tasks.
+Some of the main requirements from the cluster are: [the enablement of OIDC issuer and workload identity](https://learn.microsoft.com/en-us/azure/aks/workload-identity-deploy-cluster), which are needed for example to configure Prometheus to write metrics into the Azure Monitor Workspace. [Entra ID with Kubernetes RBAC](https://learn.microsoft.com/en-us/azure/aks/azure-ad-rbac?tabs=portal) so that we can define permissions per namespace and per user type/role. And the [deployment of multiple node pools](https://learn.microsoft.com/en-us/azure/aks/manage-node-pools) with different labels in order to be able to define [where specific workloads need to run on](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/).
 
 The amount of node pools needed will vary depending on the use cases we end up supporting and it's a relatively easy process to add and remove node pools from the cluster. (TODO: I had some issues a few years ago where if we scaled a node pool to zero it would not scale up when workloads needed to be scheduled. There are mitigations for this but hopefully the nodes will scale up automatically now.) The process of adding the necessary config into the TestRun k8s manifest should be abstract from the users to avoid silly misconfigurations.
 
-To visualize the data in the Azure Monitor Workspace, we need to add a azure_monitor_workspace_integrations block in the centralized monitoring azurerm_dashboard_grafana. A new datasource is then available in Grafana for data querying.
+The cluster should also be configured in a while requires the least amount of maintenance possible, e.g. by [allowing automatic updates](https://learn.microsoft.com/en-us/azure/aks/auto-upgrade-cluster?tabs=azure-cli#cluster-auto-upgrade-channels).
+
 
-Azure also provides a few out-of-the-box dashboards that can be used to monitor the state of the cluster. We also import other OSS dashboards as needed; such as the K6 operator dashboard.
+To visualize the data  stored in the Azure Monitor Workspace, we need to add a azure_monitor_workspace_integrations block in the centralized monitoring azurerm_dashboard_grafana. A new datasource will then available in Grafana for data querying.
 
-### Authentication and Authorization
-Azure AD authentication with Kubernetes RBAC
+Azure also provides a few out-of-the-box dashboards that can be used to monitor the state of the cluster. We also import other OSS dashboards as needed; such as the [K6 operator dashboard](https://grafana.com/grafana/dashboards/19665-k6-prometheus/).
 
 ### Services
-There are also a few services we need to maintain; mainly a Prometheus instance that is used as the remote write target by the test pods which then forwards the metrics to the Azure Monitor Worspace. Currently, the config is quite simple. The Prometheus instance was deployed via kube-prometheus-stack's Helm Chart together with AlertManager. Prometheus needs to be configured to use Workload Identity in order for it to be able to push metrics to the Azure Monitor Workspace. The rest of the prometheus configs tweaked so far were: Addition of externalLabels (likely not needed if we only use a single cluster), enableRemoteWriteReceiver to support receiving metrics via Remote Write from the test pods, a low retention period as the objective at the moment is only to keep the metrics long enough until they are remote writed to AMW (This might need to be tweaks depending on how we end up using AlertManager), configuration of the volumeClaimTemplate to select an appropriate disk type and size, and a remote write configuration block that points to the Azure Monitor Workspace. The K8s manifests also need some tweaks, mainly the ServiceAccount and Pod need some Workload Identity Labels and Annotations respectively.
+There are also a few services we need to maintain; mainly a Prometheus instance that is used as [the remote write target by the test pods](https://grafana.com/docs/k6/latest/results-output/real-time/prometheus-remote-write/) which then [forwards the metrics to the Azure Monitor Worspace](https://learn.microsoft.com/en-us/azure/azure-monitor/containers/prometheus-remote-write-managed-identity). Currently, the config is quite simple. The Prometheus instance was deployed via kube-prometheus-stack's Helm Chart together with AlertManager. Prometheus needs to be configured to use Workload Identity in order for it to be able to push metrics to the Azure Monitor Workspace. The rest of the prometheus configs tweaked so far were: Addition of externalLabels (likely not needed if we only use a single cluster), enableRemoteWriteReceiver to support receiving metrics via Remote Write from the test pods, a low retention period as the objective at the moment is only to keep the metrics long enough until they are remote writed to AMW (This might need to be tweaks depending on how we end up using AlertManager), configuration of the volumeClaimTemplate to select an appropriate disk type and size, and a remote write configuration block that points to the Azure Monitor Workspace. The K8s manifests also need some tweaks, mainly the ServiceAccount and Pod need some Workload Identity Labels and Annotations respectively.
 
-The other major service we need is the k6-operator which is responsible for actually running the tests based on the TestRun manifests being applied to the cluster. The k6 operator is also deployed via a Helm Chart.
+The other major service we need is the [k6-operator](https://grafana.com/docs/k6/latest/testing-guides/running-distributed-tests/) which is responsible for actually running the tests based on the TestRun manifests being applied to the cluster. The k6 operator is also deployed via a Helm Chart.
 
-The last service is Sealed Secrets which can be used by developers that need to inject any sort of secrets into the cluster. Sealed Secrets allows for Secrets to be encrypted locally and commited to a Github repo. Only the controller running in the cluster is able to decrypt the secrets.
+The last service is [Sealed Secrets](https://github.com/bitnami-labs/sealed-secrets) which can be used by developers that need to inject any sort of secrets into the cluster. Sealed Secrets allows for Secrets to be encrypted locally and commited to a Github repo. Only the controller running in the cluster is able to decrypt the secrets.
 
-### Automation
-There is some boilerplate configs that can be abstracted away from developers.
-An example TestRun file looks like:
-```
-apiVersion: k6.io/v1alpha1
-kind: TestRun
-metadata:
-    name: k6-create-transmissions
-    namespace: dialogporten
-spec:
-    arguments: --out experimental-prometheus-rw --vus=10 --duration=5m --tag testid=k6-create-transmissions_20250109T082811
-    parallelism: 5
-    script:
-        configMap:
-            name: k6-create-transmissions
-            file: archive.tar
-    runner:
-      env:
-      - name: K6_PROMETHEUS_RW_SERVER_URL
-        value: "http://kube-prometheus-stack-prometheus.monitoring:9090/api/v1/write"
-      - name: K6_PROMETHEUS_RW_TREND_STATS
-        value: "avg,min,med,max,p(95),p(99),p(99.5),p(99.9),count"
-        metadata:
-          labels:
-            k6-test: k6-create-transmissions
-        resources:
-          requests:
-              memory: 200Mi
-```
 
-Some things will be relatively static such as the configuration of remote writing, the archiving of the .js files, the proper creation and referencing of the ConfigMap or Volume holding the .js code, the testid tag value, etc. We should provide a way to generate all these things on the fly instead of having developers maintaining these.
 
 ### Potential Use-Cases
 The [Grafana K6 documentation](https://grafana.com/docs/k6/latest/testing-guides/automated-performance-testing/#model-the-scenarios-and-workload) has a lot of good information to get started.
@@ -167,5 +187,6 @@ TODO: Get an overview of what Dagfinn, Core? and other teams were doing previoul
 
 - Simplify manifest generation. Most of the setup is boilerplate so we should be able to abstract most things.
 - Add support to deploy the tests with a volume mount instead of a ConfigMap.
-- Improve Dashboards experience, e.g. easy linking between resource usage, tracing, logs, etc.
+- Improve Dashboards experience, e.g. easy linking between resource usage (both for individual pods as nodes), tracing, logs, etc.
 - Slack and/or Github integration so teams receive feedback of their test runs.
+- Store pod logs and integrate the log solution from Microsoft with Grafana.