-
Notifications
You must be signed in to change notification settings - Fork 12
Common issues and solutions
Here's some of the common issues and resolutions that have been encountered in the environment.
The following daemonset is often deployed already based on the platform Terraform.
https://gist.github.com/zachomedia/2a3a799a1468915de7414f2bcacda984
In order to connect to the node first you need to exec into one of the sh-* containers.
kubectl exec -it -n kube-system sh-4zfdd sh
Now you can execute the following chroot command which will give you the node context.
chroot /mnt /bin/bash
You will then be greeted by the following entry:
root@aks-nodepool1-XXXXXXXX-vmss00006Q:/#
Usually if a Node has NodeDiskPressure if might be hard to do the connect to Node workflow. In this case you will have to connect to another Node and then ssh onto the Node itself which has NodeDiskPressure
In order to connect to the node first you need to exec into one of the sh-* that don't have the NodeDiskPressure.
kubectl exec -it -n kube-system sh-4zfdd sh
When you are in the container you will need to ensure you have the SSH client.
apk --update add openssh-client
At this point you should now be able to SSH into the node with NodeDiskPressure. You can get the SSH key in our Vault instance.
ssh -i id_rsa azureuser@aks-nodepool1-XXXXX-vmss0000XX
Once connect to the node you can determine the where the majority of the disk pressure is occurring by using the df command.
df -h
For a more interactive tool if the df command is not sufficient you can use ncdu command.
apt-get install ncdu -y
ncdu -x /
If gatekeeper has crashed, it will begin to block commands that require further validation. Unfortunately, due to a bug, gatekeeper will not restart on its own and requires manual intervention.
It may be seen when running commands such as kubectl logs
, but can present in other ways. One common error message is:
Error from server (InternalError): Internal error occurred: Authorization error (user=$USER, verb=get, resource=nodes, subresource=proxy)
Update the failure pocliy:
kubectl patch validatingwebhookconfigurations.admissionregistration.k8s.io gatekeeper-validating-webhook-configuration --type=json -p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Ignore"}]'
Once gatekeeper is running (1/1), run the following to restore the failure policy:
kubectl patch validatingwebhookconfigurations.admissionregistration.k8s.io gatekeeper-validating-webhook-configuration --type=json -p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Fail"}]'
In the event that gatekeeper is 1/1
and you are still having issues:
kubectl -n gatekeeper-system rollout restart deployment gatekeeper-controller-manager
kubectl -n gatekeeper-system scale rs gatekeeper-controller-manager-$OLDRSID --replicas=0
This should allow the new pod to successfully start.
There can be extremely rare times where there are intermittent communication issues between the Kubernetes API server.
You simply have to just restart the Tunnelfront pod after confirming the logs that there have been disconnects.
Sometimes, a Pod may be stuck in terminating preventing it from being rescheduled. This can be caused by Boathouse being unable to unmount its drives.
To verify this, you can check the kubelet logs. On a Node, you may use the following:
journalctl -u kubelet --since "15 minutes ago"
To remove the Pod forcefully:
kubectl delete pod <pod_name> -n <namespace> --grace-period 0 --force
Try restarting the following components:
kubectl -n kubeflow rollout restart deploy centraldashboard
kubectl -n Istio-system rollout restart statefulset authservice
The following deletes the profile and the namespace as well.
kubectl get profile
kubectl delete profile <username>
In case namespace is still not deleted, despite the profile being deleted.
kubectl delete ns <username>
Deleted corrupted Index:
curl -X DELETE -v --user elastic:$(k -n daaas get secret daaas-es-elastic-user '--template={{ .data.elastic }}' | base64 --decode) https://elastic.covid.cloud.statcan.ca/.kibana_5
Deleted corrupted Document:
curl -X DELETE -v --user elastic:$PASSWORD https://elastic.covid.cloud.statcan.ca/.kibana_4/_doc/test-test-test
Deleted corrupted Visualization:
curl -X DELETE -v --user elastic:$PASS https://elastic.covid.cloud.statcan.ca/.kibana_4/_doc/visualization:rl_learning_daily_testing_july4
Connectivity issues to Managed DB from Kubernetes cluster:
- Enable VNET peering
- Add Containers subnet to firewall on database
- Add Istio ServiceEntry for db
- Enable Service Endpoints for Microsoft.Sql on Kubernetes virtual network
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
name: managed-postgresql-db
namespace: default
spec:
addresses:
- XX.XXX.XXX.XXX
hosts:
- manageddb.postgres.database.azure.com
location: MESH_EXTERNAL
ports:
- name: psql
number: 5432
protocol: TLS
resolution: DNS