Skip to content

Commit

Permalink
Merge pull request #5452 from ministryofjustice/update-runbook-for-cl…
Browse files Browse the repository at this point in the history
…uster-upgrade

Update Cluster Upgrade Runbook
  • Loading branch information
timckt authored Apr 10, 2024
2 parents a3ae838 + 09ca944 commit 30eb517
Show file tree
Hide file tree
Showing 4 changed files with 203 additions and 11 deletions.
13 changes: 8 additions & 5 deletions runbooks/source/creating-a-live-like.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Creating a live-like Cluster
weight: 350
last_reviewed_on: 2024-01-26
last_reviewed_on: 2024-04-10
review_in: 6 months
---

Expand All @@ -16,8 +16,8 @@ to the configuration similar to the live cluster.

## Setting cluster size to match Live

1. Set the node group desired size to 48 (check the live cluster for up-to-date number) in the AWS console under Compute
2. Set the node_groups_count to same as live cluster (64) and default_ng_min_count to 48 in [terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf]
1. Set the node group desired size to 60 (check the live cluster for up-to-date number) in the AWS console under Compute
2. Set the node_groups_count to same as live cluster (60) and default_ng_min_count to 60 in [terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf]
3. Copy the node_size values from live to default, currently `["r6i.2xlarge", "r6i.xlarge", "r5.2xlarge"]`
4. Copy the monitoring_node_size values from live to default, currently `["r6i.8xlarge", "r5a.2xlarge"]`
5. Ensure that your Terraform workspace matches your cluster name
Expand Down Expand Up @@ -61,14 +61,17 @@ See documentation for upgrading a [cluster](upgrade-eks-cluster.html).
* `watch -n 1 "kubectl get nodes --sort-by=\".metadata.creationTimestamp\""` - get all nodes and sort by create timestamp

* Useful third party tools
* [k9s](https://k9scli.io/)
* [K9s](https://k9scli.io/)
* [Stern](https://github.com/stern/stern)

You may refer to [Monitor EKS Cluster](/monitor-eks-cluster.html) section for more details.

## Final Tests

1. Run `make run-tests` from the root cloud-platform repository
1. Run `make run-tests` from the root cloud-platform-infrastructure repository
2. Update `cluster.tf` `cluster_version` to match version upgraded to
3. Run `terraform plan` to ensure there are no unexpected changes
4. Go to `component` layer and scale up and down the `starter_pack` module to ensure `terraform apply` can run smoothly

## Tearing down

Expand Down
184 changes: 184 additions & 0 deletions runbooks/source/monitor-eks-cluster.html.md.erb
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
---
title: Monitor EKS Cluster
weight: 70
last_reviewed_on: 2024-04-10
review_in: 6 months
---

# Monitor EKS Cluster

## Monitoring with K9s
[K9s](https://k9scli.io/) provides a powerful terminal UI to interact with your Kubernetes clusters, allowing you to monitor and manage your resources efficiently. This part covers how to monitor nodes, pods, and events, and how to use filters to narrow down your view to specific namespaces or pods in specific status.

###Installation
Before you begin, ensure that K9s is installed on your machine. If not, please follow the official [K9s installation instructions](https://k9scli.io/topics/install/).

###Launching K9s
To start K9s, open your terminal and type `k9s`

This command launches the K9s interface, displaying your default namespace's pods.

###Monitoring Nodes
To view and monitor nodes:

Press `:` to activate the command mode and type `nodes` and press Enter.

Here, you can see a list of your cluster's nodes along with their status, CPU, memory usage, version, Pods and age.

####Sorting for nodes
K9s allows you to sort resources based on different metrics, providing flexibility in how you view your cluster's data.
This can be particularly useful in troubleshooting or Cluster Upgrade when you need to quickly identify which nodes are under the most strain or which are the newest or oldest.

```
<shift-a> Sort Age
<shift-c> Sort CPU
<shift-m> Sort Memory
<shift-n> Sort Name
<shift-o> Sort Pods
<shift-r> Sort Role
```

By default, sorting is in descending order for most metrics except for age. If you need to change the sort order to ascending (for example, to sort the node by CPU usage), you can toggle the sort order by:

Pressing `shift-c` and this will show the node sort with CPU usage in descending order.

You can press `shift-c` again to toggle back to ascending order.

Sorting by age `shift-a` is different and is default in ascending order (see the newest nodes first when sorting by age). If you need to change the sort order to descending to see the oldest nodes first when sorting by age), you can toggle the sort order by
pressing `shift-a` again.

During EKS Cluster upgrade, it is recommended to sort nodes by age in ascending order which allows you to:

- Identify Newly Created Nodes: Quickly determine which nodes are the newest additions to your cluster. This is especially useful to verify that nodes are being successfully created as part of the upgrade process.
- Monitor Node Replacement: Ensure that older nodes are being decommissioned as expected.
- Troubleshoot Issues: Identify and troubleshoot any anomalies with node creation times, such as unexpected delays or nodes not being created as planned.

###Monitoring Pods

To view and monitor pods:

Press `:` to activate the command mode and type `pods` and press Enter.

Here, you can see a list of pods along with their namespace, name, status, IP, node and age.

Press `0` to monitor all pods across all namespaces.

####Filtering Pods

Filter pods by specific namespace:

With the pods view open, press `/` to start a filter.
Type the namespace name and press Enter.
Only pods within the specified namespace will be displayed.

You can also monitor 2 or more namepsace at the same time by adding `|` in the filter, like `namespace-1|namespace-2` to view pods in those 2 namespace.

Filter pods by status:

With the pods view open, press `/` to start a filter.
Type `error` and press Enter to filter pod by error status.
You can also filter pods at 2 or more status at the same time by adding `|` in the filter, like `error|fail` to view pods in those 2 namespace.

During EKS Cluster upgrade, it is recommended to filter pods by status `ContainerStatusUnknown|error|fail` to get all pods in unnormal state.

####Sorting for Pods
Sorting concecpt for pods is similar to sorting for nodes. You may refer to [Sorting for nodes](#sorting-for-nodes) for more details.

```
<shift-a> Sort Age │
<shift-c> Sort CPU │
<ctrl-x> Sort CPU/L │
<shift-x> Sort CPU/R │
<shift-i> Sort IP │
<shift-m> Sort MEM │
<ctrl-q> Sort MEM/L │
<shift-z> Sort MEM/R │
<shift-n> Sort Name │
<shift-p> Sort Namespace │
<shift-o> Sort Node │
<shift-r> Sort Ready │
<shift-t> Sort Restart │
<shift-s> Sort Status │
```

###Monitoring Events

To view and monitor events:

Press `:` to activate the command mode and type `events` and press Enter.

Here, you can see a list of events along with their namespace, last seen, type, reason, object and count.

Press `0` to monitor all events across all namespaces.

Press `1` to monitor all events across in default view, which is useful for Cluster Upgrade.

####Filtering Events
Filtering Events by Namespace:

With the events view open, press `/`.
Enter the namespace name and press Enter.
Only events related to the specified namespace will be shown.

####Sorting for Pods
Sorting concecpt for pods is similar to sorting for nodes. You may refer to [Sorting for nodes](#sorting-for-nodes) for more details.

```
<shift-a> Sort Age
<shift-c> Sort Count
<shift-f> Sort FirstSeen
<shift-l> Sort LastSeen
<shift-n> Sort Name
<shift-p> Sort Namespace
<shift-r> Sort Reason
<shift-s> Sort Source
<shift-t> Sort Type │
```

During EKS Cluster upgrade, it is recommended to sort events in default view by last seen in ascending order. This sorting method enhances your understanding by providing a chronological sequence of events.
It ensures that you can easily track the progression of the upgrade and promptly identify any recent issues that may arise.

### Further reading
For more details, you may refer to the built-in help by pressing `?` within K9s or below pages.

- [K9s Commands](https://k9scli.io/topics/commands/)
- [K9s Configuration](https://k9scli.io/topics/config/)

## Monitoring with Stern

[Stern](https://github.com/stern/stern) allows you to tail multiple pods on Kubernetes and multiple containers within the pod. Each result is color coded for quicker debugging.

Stern simplifies the process of monitoring logs from multiple pods within Kubernetes. It aggregates logs from various sources, allowing for real-time monitoring and troubleshooting.

###Basic Usage
To start using Stern, open your terminal and point it to your Kubernetes cluster by setting the correct context with `kubectl`.

To tail logs from all pods in a specific namespace, you may run

```
stern -n <namepsace>
```

Tailing Logs from Specific Pods
To tail logs from specific pods in a specific namespace, you may run

```
stern -n <namepsace> <pod-name>
```

It's particularly useful during the update process of EKS add-ons, offering visibility into how changes affect pod operations.

```
stern --namespace kube-system <coredns aws-node kube-proxy>
```
Stern sometimes may reach the maximum number of log request which is 50 by default, and you may use the flag `--max-log-requests <number>` to increase the log limit, for example

```
stern -n kube-system kube-proxy --max-log-requests 500
```

### Further reading
For more details, you may refer to the below pages.

- [Stern Doc](https://github.com/stern/stern?tab=readme-ov-file#usage)
- [Tail Kubernetes with Stern](https://kubernetes.io/blog/2016/10/tail-kubernetes-with-stern/)
7 changes: 4 additions & 3 deletions runbooks/source/recycle-all-nodes.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Recycling all the nodes in a cluster
weight: 255
last_reviewed_on: 2024-03-20
last_reviewed_on: 2024-04-10
review_in: 6 months
---

Expand Down Expand Up @@ -55,7 +55,7 @@ To resolve the issue:

delete_pods() {
NAMESPACE=$(echo "$1" | sed -E 's/\/api\/v1\/namespaces\/(.*)\/pods\/.*/\1/')
POD=$(echo "$1" | sed -E 's/.*\/pods\/(.*)\/eviction/\1/')
POD=$(echo "$1" | sed -E 's/.*\/pods\/(.*)\/eviction\?timeout=.*/\1/')

echo $NAMESPACE

Expand Down Expand Up @@ -110,7 +110,8 @@ If you want to find the offending pod manually, follow these steps:
```
4. If there are results they will have a pattern like this:
`/api/v1/namespaces/$NAMESPACE/pods/$POD_NAME-$POD_ID/eviction?timeout=19s`
5. You can then run the following command to manually delete the pod
5. You may also go to the [CloudWatch Dashboard](https://eu-west-2.console.aws.amazon.com/cloudwatch/home?region=eu-west-2#dashboards/dashboard/cloud-platform-eks-live-pdb-eviction-status) directly to identify the offending pod.
6. You can then run the following command to manually delete the pod
`kubectl delete pod -n $NAMESPACE $POD_NAME-$POD_ID`

Nodes should continue to recycle and after a few moments there should be one less node with the status "Ready,SchedulingDisabled"
Expand Down
10 changes: 7 additions & 3 deletions runbooks/source/upgrade-eks-cluster.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
---
title: Upgrade EKS cluster
weight: 53
last_reviewed_on: 2024-01-24
review_in: 3 months
last_reviewed_on: 2024-04-10
review_in: 6 months
---

# Upgrade EKS cluster
Expand Down Expand Up @@ -79,15 +79,19 @@ Run a `tf plan` against the cluster your upgrading to check to see if everything

Before you start the upgrade it is useful to have a few monitoring resources up and running so you can catch any issues quickly.

[k9s](https://k9scli.io/) is a useful tool to have open in a few terminal windows, the following views are helpful:
[K9s](https://k9scli.io/) is a useful tool to have open in a few terminal windows, the following views are helpful:

* nodes - see nodes recycling and coming up with new version
* events - check to see if there are any errors
* pods - you can use vim style searching to see pods in `Error` state.

You may refer to [Monitoring with K9s](/monitor-eks-cluster.html#monitoring-with-k9s) section for more details.

When a node group version changes, this will cause all of the nodes to recycle. When AWS recycles the nodes, it will not evict pods if it will break the PDB.
This will cause the node to stall the update and the nodes will **not** continue to recycle.

[This] (https://eu-west-2.console.aws.amazon.com/cloudwatch/home?region=eu-west-2#dashboards/dashboard/cloud-platform-eks-live-pdb-eviction-status) CloudWatch Dashboard is used to monitor the pod eviction stauts for live cluster.

To rectify this, run the script mentioned in [Recycle-all-nodes Gotchas](/recycle-all-nodes.html#gotchas) section.

[This](https://kibana.cloud-platform.service.justice.gov.uk/_plugin/kibana/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15d,to:now))&_a=(columns:!(_source),filters:!(),index:'1f29f240-00eb-11ec-8a38-954e9fb3b0ba',interval:auto,query:(language:kuery,query:'%22failed%20to%20assign%20an%20IP%20address%20to%20container%22'),sort:!())) kibana dashboard is used to monitor the IP assignment for pods when they are rescheduled. If there is a spike in errors then the could be a starvation of IP address while scheduling pods.
Expand Down

0 comments on commit 30eb517

Please sign in to comment.