diff --git a/docs/docs-content/troubleshooting/cluster-deployment.md b/docs/docs-content/troubleshooting/cluster-deployment.md index 30221a7eeb..5ee1cfe18a 100644 --- a/docs/docs-content/troubleshooting/cluster-deployment.md +++ b/docs/docs-content/troubleshooting/cluster-deployment.md @@ -17,8 +17,6 @@ The following steps will help you troubleshoot errors in the event issues arise An instance is launched and terminated every 30 minutes prior to completion of its deployment, and the **Events Tab** lists errors with the following message: -
- ```hideClipboard bash Failed to update kubeadmControlPlane Connection timeout connecting to Kubernetes Endpoint ``` @@ -35,28 +33,30 @@ why a service may fail are: user `spectro`. If you are initiating an SSH session into an installer instance, log in as user `ubuntu`. ```shell - ssh --identity_file <_pathToYourSSHkey_> spectro@X.X.X.X + ssh --identity_file <_pathToYourSSHkey_> spectro@X.X.X.X ``` 2. Elevate the user access. ```shell - sudo -i + sudo -i ``` 3. Verify the Kubelet service is operational. ```shell - systemctl status kubelet.service + systemctl status kubelet.service ``` 4. If the Kubelet service does not work as expected, do the following. If the service operates correctly, you can skip this step. 1. Navigate to the **/var/log/** folder. + ```shell cd /var/log/ ``` + 2. Scan the **cloud-init-output** file for any errors. Take note of any errors and address them. ``` cat cloud-init-output.log @@ -66,34 +66,36 @@ why a service may fail are: - Export the kubeconfig file. - ```shell - export KUBECONFIG=/etc/kubernetes/admin.conf - ``` + ```shell + export KUBECONFIG=/etc/kubernetes/admin.conf + ``` - Connect with the cluster's Kubernetes API. - ```shell - kubectl get pods --all-namespaces - ``` + ```shell + kubectl get pods --all-namespaces + ``` - When the connection is established, verify the pods are in a _Running_ state. Take note of any pods that are not in _Running_ state. - ```shell - kubectl get pods -o wide - ``` + ```shell + kubectl get pods -o wide + ``` - If all the pods are operating correctly, verify their connection with the Palette API. - For clusters using Gateway, verify the connection between the Installer and Gateway instance: + ```shell - curl -k https://:6443 + curl -k https://:6443 ``` + - For Public Clouds that do not use Gateway, verify the connection between the public Internet and the Kube endpoint: ```shell - curl -k https://:6443 + curl -k https://:6443 ``` :::info diff --git a/docs/docs-content/troubleshooting/edge.md b/docs/docs-content/troubleshooting/edge.md index ce04332b4c..2a63aac898 100644 --- a/docs/docs-content/troubleshooting/edge.md +++ b/docs/docs-content/troubleshooting/edge.md @@ -42,23 +42,23 @@ adjust the values of related environment variables in the KubeVip DaemonSet with 2. Issue the following command: -```shell -kubectl edit ds kube-vip-ds --namespace kube-system -``` + ```shell + kubectl edit ds kube-vip-ds --namespace kube-system + ``` 3. In the `env` of the KubeVip service, modify the environment variables to have the following corresponding values. -```yaml {4-9} -env: - - name: vip_leaderelection - value: "true" - - name: vip_leaseduration - value: "30" - - name: vip_renewdeadline - value: "20" - - name: vip_retryperiod - value: "4" -``` + ```yaml {4-9} + env: + - name: vip_leaderelection + value: "true" + - name: vip_leaseduration + value: "30" + - name: vip_renewdeadline + value: "20" + - name: vip_retryperiod + value: "4" + ``` 4. Within a minute, the old Pods in unknown state will be terminated and Pods will come up with the updated values. diff --git a/docs/docs-content/troubleshooting/nodes.md b/docs/docs-content/troubleshooting/nodes.md index e076b657fc..b2675f9d8b 100644 --- a/docs/docs-content/troubleshooting/nodes.md +++ b/docs/docs-content/troubleshooting/nodes.md @@ -48,8 +48,6 @@ resulted in a node repave. The API payload is incomplete for brevity. For detailed information, review the cluster upgrades [page](../clusters/clusters.md). -
- ## Clusters ## Scenario - vSphere Cluster and Stale ARP Table @@ -64,8 +62,6 @@ This is done automatically without any user action. You can verify the cleaning process by issuing the following command on non-VIP nodes and observing that the ARP cache is never older than 300 seconds. -
- ```shell watch ip -statistics neighbour ``` @@ -77,8 +73,6 @@ Amazon EKS [Runbook](https://docs.aws.amazon.com/systems-manager-automation-runbooks/latest/userguide/automation-awssupport-troubleshooteksworkernode.html) for troubleshooting guidance. -
- ## Palette Agents Workload Payload Size Issue A cluster comprised of many nodes can create a situation where the workload report data the agent sends to Palette @@ -89,8 +83,6 @@ If you encounter this scenario, you can configure the cluster to stop sending wo the workload report feature, create a _configMap_ with the following configuration. Use a cluster profile manifest layer to create the configMap. -
- ```shell apiVersion: v1 kind: ConfigMap @@ -101,8 +93,6 @@ data: feature.workloads: disable ``` -
- ## OS Patch Fails When conducting [OS Patching](../clusters/cluster-management/os-patching.md), sometimes the patching process can time @@ -128,39 +118,39 @@ To resolve this issue, use the following steps: 7. SSH into one of the cluster nodes and issue the following command. -```shell -rm /var/cache/debconf/config.dat && \ -dpkg --configure -a -``` + ```shell + rm /var/cache/debconf/config.dat && \ + dpkg --configure -a + ``` 8. A prompt may appear asking you to select the boot device. Select the appropriate boot device and press **Enter**. -:::tip - -If you are unsure of the boot device, use a disk utility such as `lsblk` or `fdisk` to identify the boot device. Below -is an example of using `lsblk` to identify the boot device. The output is abbreviated for brevity. - -```shell -lsblk --output NAME,TYPE,MOUNTPOINT -``` - -```shell {10} hideClipboard -NAME TYPE MOUNTPOINT -fd0 disk -loop0 loop /snap/core20/1974 -... -loop10 loop /snap/snapd/20092 -loop11 loop /snap/snapd/20290 -sda disk -├─sda1 part / -├─sda14 part -└─sda15 part /boot/efi -sr0 rom -``` - -The highlighted line displays the boot device. In this example, the boot device is `sda15`, mounted at `/boot/efi`. The -boot device may be different for your node. - -::: + :::tip + + If you are unsure of the boot device, use a disk utility such as `lsblk` or `fdisk` to identify the boot device. + Below is an example of using `lsblk` to identify the boot device. The output is abbreviated for brevity. + + ```shell + lsblk --output NAME,TYPE,MOUNTPOINT + ``` + + ```shell {10} hideClipboard + NAME TYPE MOUNTPOINT + fd0 disk + loop0 loop /snap/core20/1974 + ... + loop10 loop /snap/snapd/20092 + loop11 loop /snap/snapd/20290 + sda disk + ├─sda1 part / + ├─sda14 part + └─sda15 part /boot/efi + sr0 rom + ``` + + The highlighted line displays the boot device. In this example, the boot device is `sda15`, mounted at `/boot/efi`. + The boot device may be different for your node. + + ::: 9. Repeat the previous step for all nodes in the cluster. diff --git a/docs/docs-content/troubleshooting/palette-dev-engine.md b/docs/docs-content/troubleshooting/palette-dev-engine.md index 04ccbf1e1b..124972afb8 100644 --- a/docs/docs-content/troubleshooting/palette-dev-engine.md +++ b/docs/docs-content/troubleshooting/palette-dev-engine.md @@ -12,8 +12,6 @@ tags: ["troubleshooting", "pde", "app mode"] Use the following content to help you troubleshoot issues you may encounter when using Palette Dev Engine (PDE). -
- ## Resource Requests All [Cluster Groups](../clusters/cluster-groups/cluster-groups.md) are configured with a default @@ -25,18 +23,12 @@ to let the system manage the resources. If you specify `requests` but not `limits`, the default limits imposed by the LimitRange will likely be lower than the requests, causing the following error. -
- ```shell hideClipboard Invalid value: "300m": must be less than or equal to CPU limit spec.containers[0].resources.requests: Invalid value: "512Mi": must be less than or equal to memory limit ``` -
- The workaround is to define both the `requests` and `limits`. -
- ## Scenario - Controller Manager Pod Not Upgraded If the `palette-controller-manager` pod for a virtual cluster is not upgraded after a Palette platform upgrade, use the diff --git a/docs/docs-content/troubleshooting/pcg.md b/docs/docs-content/troubleshooting/pcg.md index f4dab562a0..08741110c4 100644 --- a/docs/docs-content/troubleshooting/pcg.md +++ b/docs/docs-content/troubleshooting/pcg.md @@ -90,7 +90,7 @@ unavailable IP addresses for the worker nodes, or the inability to perform a Net 9. If the problem persists, download the cluster logs from Palette. The screenshot below will help you locate the button to download logs from the cluster details page. -![A screenshot highlighting how to download the cluster logs from Palette.](/troubleshooting-pcg-download_logs.webp) + ![A screenshot highlighting how to download the cluster logs from Palette.](/troubleshooting-pcg-download_logs.webp) 10. Share the logs with our support team at [support@spectrocloud.com](mailto:support@spectrocloud.com). diff --git a/docs/docs-content/troubleshooting/troubleshooting.md b/docs/docs-content/troubleshooting/troubleshooting.md index 0b4fc030f2..fe13c40c8f 100644 --- a/docs/docs-content/troubleshooting/troubleshooting.md +++ b/docs/docs-content/troubleshooting/troubleshooting.md @@ -11,8 +11,6 @@ tags: ["troubleshooting"] Use the following troubleshooting resources to help you address issues that may arise. You can also reach out to our support team by opening up a ticket through our [support page](http://support.spectrocloud.io/). -
- - [Cluster Deployment](cluster-deployment.md) - [Edge](edge.md) @@ -53,8 +51,6 @@ Follow the link for more details: [Download Cluster Logs](../clusters/clusters.m Spectro Cloud maintains an event stream with low-level details of the various orchestration tasks being performed. This event stream is a good source for identifying issues in the event an operation does not complete for a long time. -
- :::warning Due to Spectro Cloud’s reconciliation logic, intermittent errors show up in the event stream. As an example, after @@ -83,5 +79,3 @@ made to perform the task. Failed conditions are a great source of troubleshootin For example, failure to create a virtual machine in AWS due to the vCPU limit being exceeded would cause this error is shown to the end-users. They could choose to bring down some workloads in the AWS cloud to free up space. The next time a VM creation task is attempted, it would succeed and the condition would be marked as a success. - -