Skip to content

Commit

Permalink
Clean up markup and alignment in the Troubleshooting section (#3255)
Browse files Browse the repository at this point in the history
* chore: Fix markup and alignment in the Troubleshooting section

* chore: Fix code snippets in cluster-deployment
  • Loading branch information
yuliiiah authored Jul 5, 2024
1 parent 3f6167b commit af234a2
Show file tree
Hide file tree
Showing 6 changed files with 64 additions and 86 deletions.
34 changes: 18 additions & 16 deletions docs/docs-content/troubleshooting/cluster-deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,6 @@ The following steps will help you troubleshoot errors in the event issues arise
An instance is launched and terminated every 30 minutes prior to completion of its deployment, and the **Events Tab**
lists errors with the following message:

<br />

```hideClipboard bash
Failed to update kubeadmControlPlane Connection timeout connecting to Kubernetes Endpoint
```
Expand All @@ -35,28 +33,30 @@ why a service may fail are:
user `spectro`. If you are initiating an SSH session into an installer instance, log in as user `ubuntu`.

```shell
ssh --identity_file <_pathToYourSSHkey_> [email protected]
ssh --identity_file <_pathToYourSSHkey_> [email protected]
```

2. Elevate the user access.

```shell
sudo -i
sudo -i
```

3. Verify the Kubelet service is operational.

```shell
systemctl status kubelet.service
systemctl status kubelet.service
```

4. If the Kubelet service does not work as expected, do the following. If the service operates correctly, you can skip
this step.

1. Navigate to the **/var/log/** folder.

```shell
cd /var/log/
```

2. Scan the **cloud-init-output** file for any errors. Take note of any errors and address them.
```
cat cloud-init-output.log
Expand All @@ -66,34 +66,36 @@ why a service may fail are:

- Export the kubeconfig file.

```shell
export KUBECONFIG=/etc/kubernetes/admin.conf
```
```shell
export KUBECONFIG=/etc/kubernetes/admin.conf
```

- Connect with the cluster's Kubernetes API.
```shell
kubectl get pods --all-namespaces
```
```shell
kubectl get pods --all-namespaces
```
- When the connection is established, verify the pods are in a _Running_ state. Take note of any pods that are not in
_Running_ state.
```shell
kubectl get pods -o wide
```
```shell
kubectl get pods -o wide
```
- If all the pods are operating correctly, verify their connection with the Palette API.
- For clusters using Gateway, verify the connection between the Installer and Gateway instance:
```shell
curl -k https://<KUBE_API_SERVER_IP>:6443
curl -k https://<KUBE_API_SERVER_IP>:6443
```
- For Public Clouds that do not use Gateway, verify the connection between the public Internet and the Kube
endpoint:
```shell
curl -k https://<KUBE_API_SERVER_IP>:6443
curl -k https://<KUBE_API_SERVER_IP>:6443
```
:::info
Expand Down
28 changes: 14 additions & 14 deletions docs/docs-content/troubleshooting/edge.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,23 +42,23 @@ adjust the values of related environment variables in the KubeVip DaemonSet with

2. Issue the following command:

```shell
kubectl edit ds kube-vip-ds --namespace kube-system
```
```shell
kubectl edit ds kube-vip-ds --namespace kube-system
```

3. In the `env` of the KubeVip service, modify the environment variables to have the following corresponding values.

```yaml {4-9}
env:
- name: vip_leaderelection
value: "true"
- name: vip_leaseduration
value: "30"
- name: vip_renewdeadline
value: "20"
- name: vip_retryperiod
value: "4"
```
```yaml {4-9}
env:
- name: vip_leaderelection
value: "true"
- name: vip_leaseduration
value: "30"
- name: vip_renewdeadline
value: "20"
- name: vip_retryperiod
value: "4"
```
4. Within a minute, the old Pods in unknown state will be terminated and Pods will come up with the updated values.
Expand Down
72 changes: 31 additions & 41 deletions docs/docs-content/troubleshooting/nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,6 @@ resulted in a node repave. The API payload is incomplete for brevity.

For detailed information, review the cluster upgrades [page](../clusters/clusters.md).

<br />

## Clusters

## Scenario - vSphere Cluster and Stale ARP Table
Expand All @@ -64,8 +62,6 @@ This is done automatically without any user action.
You can verify the cleaning process by issuing the following command on non-VIP nodes and observing that the ARP cache
is never older than 300 seconds.

<br />

```shell
watch ip -statistics neighbour
```
Expand All @@ -77,8 +73,6 @@ Amazon EKS
[Runbook](https://docs.aws.amazon.com/systems-manager-automation-runbooks/latest/userguide/automation-awssupport-troubleshooteksworkernode.html)
for troubleshooting guidance.

<br />

## Palette Agents Workload Payload Size Issue

A cluster comprised of many nodes can create a situation where the workload report data the agent sends to Palette
Expand All @@ -89,8 +83,6 @@ If you encounter this scenario, you can configure the cluster to stop sending wo
the workload report feature, create a _configMap_ with the following configuration. Use a cluster profile manifest layer
to create the configMap.

<br />

```shell
apiVersion: v1
kind: ConfigMap
Expand All @@ -101,8 +93,6 @@ data:
feature.workloads: disable
```

<br />

## OS Patch Fails

When conducting [OS Patching](../clusters/cluster-management/os-patching.md), sometimes the patching process can time
Expand All @@ -128,39 +118,39 @@ To resolve this issue, use the following steps:

7. SSH into one of the cluster nodes and issue the following command.

```shell
rm /var/cache/debconf/config.dat && \
dpkg --configure -a
```
```shell
rm /var/cache/debconf/config.dat && \
dpkg --configure -a
```

8. A prompt may appear asking you to select the boot device. Select the appropriate boot device and press **Enter**.

:::tip

If you are unsure of the boot device, use a disk utility such as `lsblk` or `fdisk` to identify the boot device. Below
is an example of using `lsblk` to identify the boot device. The output is abbreviated for brevity.

```shell
lsblk --output NAME,TYPE,MOUNTPOINT
```

```shell {10} hideClipboard
NAME TYPE MOUNTPOINT
fd0 disk
loop0 loop /snap/core20/1974
...
loop10 loop /snap/snapd/20092
loop11 loop /snap/snapd/20290
sda disk
├─sda1 part /
├─sda14 part
└─sda15 part /boot/efi
sr0 rom
```

The highlighted line displays the boot device. In this example, the boot device is `sda15`, mounted at `/boot/efi`. The
boot device may be different for your node.

:::
:::tip

If you are unsure of the boot device, use a disk utility such as `lsblk` or `fdisk` to identify the boot device.
Below is an example of using `lsblk` to identify the boot device. The output is abbreviated for brevity.

```shell
lsblk --output NAME,TYPE,MOUNTPOINT
```

```shell {10} hideClipboard
NAME TYPE MOUNTPOINT
fd0 disk
loop0 loop /snap/core20/1974
...
loop10 loop /snap/snapd/20092
loop11 loop /snap/snapd/20290
sda disk
├─sda1 part /
├─sda14 part
└─sda15 part /boot/efi
sr0 rom
```

The highlighted line displays the boot device. In this example, the boot device is `sda15`, mounted at `/boot/efi`.
The boot device may be different for your node.

:::

9. Repeat the previous step for all nodes in the cluster.
8 changes: 0 additions & 8 deletions docs/docs-content/troubleshooting/palette-dev-engine.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,6 @@ tags: ["troubleshooting", "pde", "app mode"]

Use the following content to help you troubleshoot issues you may encounter when using Palette Dev Engine (PDE).

<br />

## Resource Requests

All [Cluster Groups](../clusters/cluster-groups/cluster-groups.md) are configured with a default
Expand All @@ -25,18 +23,12 @@ to let the system manage the resources.
If you specify `requests` but not `limits`, the default limits imposed by the LimitRange will likely be lower than the
requests, causing the following error.

<br />

```shell hideClipboard
Invalid value: "300m": must be less than or equal to CPU limit spec.containers[0].resources.requests: Invalid value: "512Mi": must be less than or equal to memory limit
```

<br />

The workaround is to define both the `requests` and `limits`.

<br />

## Scenario - Controller Manager Pod Not Upgraded

If the `palette-controller-manager` pod for a virtual cluster is not upgraded after a Palette platform upgrade, use the
Expand Down
2 changes: 1 addition & 1 deletion docs/docs-content/troubleshooting/pcg.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ unavailable IP addresses for the worker nodes, or the inability to perform a Net
9. If the problem persists, download the cluster logs from Palette. The screenshot below will help you locate the button
to download logs from the cluster details page.

![A screenshot highlighting how to download the cluster logs from Palette.](/troubleshooting-pcg-download_logs.webp)
![A screenshot highlighting how to download the cluster logs from Palette.](/troubleshooting-pcg-download_logs.webp)

10. Share the logs with our support team at [[email protected]](mailto:[email protected]).

Expand Down
6 changes: 0 additions & 6 deletions docs/docs-content/troubleshooting/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,6 @@ tags: ["troubleshooting"]
Use the following troubleshooting resources to help you address issues that may arise. You can also reach out to our
support team by opening up a ticket through our [support page](http://support.spectrocloud.io/).

<br />

- [Cluster Deployment](cluster-deployment.md)

- [Edge](edge.md)
Expand Down Expand Up @@ -53,8 +51,6 @@ Follow the link for more details: [Download Cluster Logs](../clusters/clusters.m
Spectro Cloud maintains an event stream with low-level details of the various orchestration tasks being performed. This
event stream is a good source for identifying issues in the event an operation does not complete for a long time.

<br />

:::warning

Due to Spectro Cloud’s reconciliation logic, intermittent errors show up in the event stream. As an example, after
Expand Down Expand Up @@ -83,5 +79,3 @@ made to perform the task. Failed conditions are a great source of troubleshootin
For example, failure to create a virtual machine in AWS due to the vCPU limit being exceeded would cause this error is
shown to the end-users. They could choose to bring down some workloads in the AWS cloud to free up space. The next time
a VM creation task is attempted, it would succeed and the condition would be marked as a success.

<br />

0 comments on commit af234a2

Please sign in to comment.