Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support request for VirtualBox #1082

Closed
lyudmilalala opened this issue Mar 1, 2023 · 5 comments
Closed

Support request for VirtualBox #1082

lyudmilalala opened this issue Mar 1, 2023 · 5 comments

Comments

@lyudmilalala
Copy link

When I tried to deploy OpenFaaS CE, the gateway and queue-worker pods continuously crashed.

Environment

Self-built k8s cluster
two Ubuntu 22.04 Virtualbox VM, 1 as master and 1 as worker
docker version 20.10
k8s version 1.21.14
helm version 3.4.2

Steps to Reproduce

I tried two approaches of deployment.

Ideally, to avoid failure caused by disconnection to github, I want everything could be build from downloaded sources. Indeed, I followed the instruction here.

$ git clone https://github.com/openfaas/faas-netes.git
$ cd faas-netes
$ kubectl apply -f namespaces.yml
$ kubectl -n openfaas create secret generic basic-auth \
    --from-literal=basic-auth-user=admin \
    --from-literal=basic-auth-password=1234abcd
$ kubectl apply -f ./yaml

I also tried to deploy directly by helm during debugging.

$ kubectl apply -f namespaces.yml
$ helm repo add openfaas https://openfaas.github.io/faas-netes/
$ helm repo update && helm upgrade openfaas --install openfaas/openfaas --namespace openfaas

These two approaches both work with my local Docker Desktop k8s. But both face the below error when it comes to my virtualbox.

Expected Behaviour

Expect everything gets running.

Current Behaviour

nodes

NAME            STATUS   ROLES                  AGE    VERSION
spinq-master    Ready    control-plane,master   111m   v1.21.14
spinq-worker1   Ready    <none>                 73m    v1.21.14

pods

NAMESPACE     NAME                                   READY   STATUS             RESTARTS   AGE
kube-system   coredns-59d64cd4d4-7vvmf               1/1     Running            0          108m
kube-system   coredns-59d64cd4d4-qwqzj               1/1     Running            0          108m
kube-system   etcd-spinq-master                      1/1     Running            36         109m
kube-system   kube-apiserver-spinq-master            1/1     Running            36         108m
kube-system   kube-controller-manager-spinq-master   1/1     Running            24         109m
kube-system   kube-flannel-ds-ptlb7                  1/1     Running            0          105m
kube-system   kube-flannel-ds-wpr2p                  1/1     Running            0          71m
kube-system   kube-proxy-22h2k                       1/1     Running            0          108m
kube-system   kube-proxy-5m79w                       1/1     Running            0          71m
kube-system   kube-scheduler-spinq-master            1/1     Running            52         109m
openfaas      alertmanager-64554b5687-xrksb          1/1     Running            0          51m
openfaas      basic-auth-plugin-d4cbc7686-rlsqq      1/1     Running            0          51m
openfaas      gateway-7c447458db-4fckp               1/2     CrashLoopBackOff   20         51m
openfaas      nats-697d4bd9fd-dnq2w                  1/1     Running            0          51m
openfaas      prometheus-77f7cf8ddd-nw2sj            1/1     Running            0          51m
openfaas      queue-worker-5758796689-p26ld          0/1     CrashLoopBackOff   12         51m

services

NAMESPACE     NAME                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE
default       kubernetes          ClusterIP   10.96.0.1        <none>        443/TCP                  109m
kube-system   kube-dns            ClusterIP   10.96.0.10       <none>        53/UDP,53/TCP,9153/TCP   109m
openfaas      alertmanager        ClusterIP   10.98.97.53      <none>        9093/TCP                 51m
openfaas      basic-auth-plugin   ClusterIP   10.110.62.220    <none>        8080/TCP                 51m
openfaas      gateway             ClusterIP   10.100.124.192   <none>        8080/TCP                 51m
openfaas      gateway-external    NodePort    10.97.102.237    <none>        8080:31112/TCP           51m
openfaas      nats                ClusterIP   10.105.161.242   <none>        4222/TCP                 51m
openfaas      prometheus          ClusterIP   10.98.244.36     <none>        9090/TCP                 51m

journalctl -xeu kubelet on worker

Feb 28 08:12:34 spinq-worker1 kubelet[2930]: I0228 08:12:34.568147    2930 scope.go:111] "RemoveContainer" containerID="8321173eaf4f93512d64219b2f114f92ecfcbcffc967ee5b677107d06593e0e0"
Feb 28 08:12:44 spinq-worker1 kubelet[2930]: I0228 08:12:44.566767    2930 scope.go:111] "RemoveContainer" containerID="fe3967621cb0835012bb6cdf8b7fae4ea388f138bc076bd1ca591889d2fd32af"
Feb 28 08:12:44 spinq-worker1 kubelet[2930]: E0228 08:12:44.567352    2930 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"gateway\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=gateway pod=gateway-7c447458db-4fckp_openfaas(dcf77f24-2974-4e7b-b6af-740c194fc385)\"" pod="openfaas/gateway-7c447458db-4fckp" podUID=dcf77f24-2974-4e7b-b6af-740c194fc385
Feb 28 08:12:58 spinq-worker1 kubelet[2930]: I0228 08:12:58.573820    2930 scope.go:111] "RemoveContainer" containerID="fe3967621cb0835012bb6cdf8b7fae4ea388f138bc076bd1ca591889d2fd32af"
Feb 28 08:12:58 spinq-worker1 kubelet[2930]: E0228 08:12:58.576620    2930 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"gateway\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=gateway pod=gateway-7c447458db-4fckp_openfaas(dcf77f24-2974-4e7b-b6af-740c194fc385)\"" pod="openfaas/gateway-7c447458db-4fckp" podUID=dcf77f24-2974-4e7b-b6af-740c194fc385
Feb 28 08:13:13 spinq-worker1 kubelet[2930]: I0228 08:13:13.569488    2930 scope.go:111] "RemoveContainer" containerID="fe3967621cb0835012bb6cdf8b7fae4ea388f138bc076bd1ca591889d2fd32af"
Feb 28 08:13:13 spinq-worker1 kubelet[2930]: E0228 08:13:13.573155    2930 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"gateway\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=gateway pod=gateway-7c447458db-4fckp_openfaas(dcf77f24-2974-4e7b-b6af-740c194fc385)\"" pod="openfaas/gateway-7c447458db-4fckp" podUID=dcf77f24-2974-4e7b-b6af-740c194fc385
Feb 28 08:13:19 spinq-worker1 kubelet[2930]: I0228 08:13:19.266765    2930 scope.go:111] "RemoveContainer" containerID="8321173eaf4f93512d64219b2f114f92ecfcbcffc967ee5b677107d06593e0e0"
Feb 28 08:13:19 spinq-worker1 kubelet[2930]: I0228 08:13:19.267853    2930 scope.go:111] "RemoveContainer" containerID="c95396b1fce5396f4a1b24b601c5e7bf1d9fb6906c938b7f289804c4612fa856"
Feb 28 08:13:19 spinq-worker1 kubelet[2930]: E0228 08:13:19.268277    2930 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"queue-worker\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=queue-worker pod=queue-worker-5758796689-p26ld_openfaas(50dc53bc-0d23-46b6-99c1-4ad6efd857aa)\"" pod="openfaas/queue-worker-5758796689-p26ld" podUID=50dc53bc-0d23-46b6-99c1-4ad6efd857aa
Feb 28 08:13:26 spinq-worker1 kubelet[2930]: I0228 08:13:26.573069    2930 scope.go:111] "RemoveContainer" containerID="fe3967621cb0835012bb6cdf8b7fae4ea388f138bc076bd1ca591889d2fd32af"
Feb 28 08:13:26 spinq-worker1 kubelet[2930]: E0228 08:13:26.573599    2930 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"gateway\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=gateway pod=gateway-7c447458db-4fckp_openfaas(dcf77f24-2974-4e7b-b6af-740c194fc385)\"" pod="openfaas/gateway-7c447458db-4fckp" podUID=dcf77f24-2974-4e7b-b6af-740c194fc385
Feb 28 08:13:31 spinq-worker1 kubelet[2930]: I0228 08:13:31.566061    2930 scope.go:111] "RemoveContainer" containerID="c95396b1fce5396f4a1b24b601c5e7bf1d9fb6906c938b7f289804c4612fa856"
Feb 28 08:13:31 spinq-worker1 kubelet[2930]: E0228 08:13:31.572709    2930 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"queue-worker\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=queue-worker pod=queue-worker-5758796689-p26ld_openfaas(50dc53bc-0d23-46b6-99c1-4ad6efd857aa)\"" pod="openfaas/queue-worker-5758796689-p26ld" podUID=50dc53bc-0d23-46b6-99c1-4ad6efd857aa
Feb 28 08:13:38 spinq-worker1 kubelet[2930]: I0228 08:13:38.566159    2930 scope.go:111] "RemoveContainer" containerID="fe3967621cb0835012bb6cdf8b7fae4ea388f138bc076bd1ca591889d2fd32af"
Feb 28 08:13:38 spinq-worker1 kubelet[2930]: E0228 08:13:38.569156    2930 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"gateway\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=gateway pod=gateway-7c447458db-4fckp_openfaas(dcf77f24-2974-4e7b-b6af-740c194fc385)\"" pod="openfaas/gateway-7c447458db-4fckp" podUID=dcf77f24-2974-4e7b-b6af-740c194fc385

faas-netes docker container log (keep running)

Trace[1437902002]: [30.004677407s] [30.004677407s] END
E0228 08:05:41.157731       1 reflector.go:138] github.com/openfaas/faas-netes/main.go:178: Failed to watch *v1.Deployment: failed to list *v1.Deployment: Get "https://10.96.0.1:443/apis/apps/v1/namespaces/openfaas-fn/deployments?resourceVersion=8576": dial tcp 10.96.0.1:443: i/o timeout
I0228 08:05:41.597921       1 trace.go:205] Trace[793909336]: "Reflector ListAndWatch" name:github.com/openfaas/faas-netes/main.go:184 (28-Feb-2023 08:05:11.596) (total time: 30001ms):
Trace[793909336]: [30.001335641s] [30.001335641s] END
E0228 08:05:41.597983       1 reflector.go:138] github.com/openfaas/faas-netes/main.go:184: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get "https://10.96.0.1:443/api/v1/namespaces/openfaas-fn/endpoints?resourceVersion=8734": dial tcp 10.96.0.1:443: i/o timeout
I0228 08:05:41.663422       1 trace.go:205] Trace[1149509107]: "Reflector ListAndWatch" name:github.com/openfaas/faas-netes/main.go:193 (28-Feb-2023 08:05:11.658) (total time: 30004ms):
Trace[1149509107]: [30.00471121s] [30.00471121s] END
E0228 08:05:41.663474       1 reflector.go:138] github.com/openfaas/faas-netes/main.go:193: Failed to watch *v1.Profile: failed to list *v1.Profile: Get "https://10.96.0.1:443/apis/openfaas.com/v1/namespaces/openfaas/profiles?resourceVersion=8210": dial tcp 10.96.0.1:443: i/o timeout
W0228 08:06:58.775549       1 reflector.go:436] github.com/openfaas/faas-netes/main.go:178: watch of *v1.Deployment ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0228 08:06:58.775658       1 reflector.go:436] github.com/openfaas/faas-netes/main.go:193: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0228 08:06:58.775800       1 reflector.go:436] github.com/openfaas/faas-netes/main.go:184: watch of *v1.Endpoints ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
I0228 08:07:18.205958       1 trace.go:205] Trace[1342381351]: "Reflector ListAndWatch" name:github.com/openfaas/faas-netes/main.go:184 (28-Feb-2023 08:07:03.368) (total time: 14837ms):

gateway docker container log (exit)

&{0xc00009e280}
2023/02/28 08:03:37 HTTP Read Timeout: 1m5s
2023/02/28 08:03:37 HTTP Write Timeout: 1m5s
2023/02/28 08:03:37 Binding to external function provider: http://127.0.0.1:8081/
2023/02/28 08:03:37 Async enabled: Using NATS Streaming.
2023/02/28 08:03:37 Opening connection to nats://nats.openfaas.svc.cluster.local:4222
2023/02/28 08:03:37 Connect: nats://nats.openfaas.svc.cluster.local:4222

queue-worker docker container log (exit)

Starting queue-worker (Community Edition). Version: dev Git Commit:
Connect: nats://nats.openfaas.svc.cluster.local:4222
can't connect to nats://nats.openfaas.svc.cluster.local:4222: dial tcp: i/o timeout
panic: can't connect to nats://nats.openfaas.svc.cluster.local:4222: dial tcp: i/o timeout

goroutine 1 [running]:
log.Panic({0xc000151df8, 0x80471a, 0x1d})
        /usr/local/go/src/log/log.go:354 +0x65
main.main()
        /go/src/github.com/openfaas/nats-queue-worker/main.go:181 +0x759

As I found the dial problem of faas-netes docker container, I tried to curl the same api address from the worker node's terminal. It sent back a message instead of failed, so I think there was no problem between the network connection of the two nodes.

$ curl https://10.96.0.1:443/api/v1/namespaces/openfaas-fn/endpoints?resourceVersion=4669
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

I found this post with similar error message, but I did not use lstio and I know nothing about it, so I have no idea whether these are similar problems.

@alexellis
Copy link
Member

/set title: Support request for VirtualBox

@derek derek bot changed the title gateway and queue-worker pods continuously get to CrashLoopBackOff Support request for VirtualBox Mar 1, 2023
@alexellis
Copy link
Member

Hi there @lyudmilalala

Thanks for your interest in OpenFaaS CE

I can confirm that pend to end tests ran 19 hours ago, and passed](https://github.com/openfaas/faas-netes/actions/runs/4296169879/jobs/7487552935). We've made no changes since.

That means you probably have an issue with your own configuration. Bear in mind that GitHub was also having an outage today - https://twitter.com/alexellisuk/status/1630907921189011456?s=20 from the output it looks like you have networking issues, or haven't configured Kubernetes / routing or DNS correctly, which is outside the bounds of OpenFaaS CE.

I'd suggest deploying to a cloud or creating a K3s cluster with one or more VMs: https://docs.openfaas.com/deployment/kubernetes/

https://www.openfaas.com/blog/openfaas-linode/

Finally, if VirtualBox and or Kubernetes are getting in your way of trying OpenFaaS, you have faasd which you may prefer:

http://github.com/openfaas/faasd

Alex

@lyudmilalala
Copy link
Author

Thanks for your quick response, alex @alexellis

I am trying to find a FaaS solution for an on-premises private cloud, so I tested it on a self-built k8s cluster on Ubuntu VMs.

I will try to deploy on two VMWare machines and two real servers. I will also try faasd, definitely. For K3s, I am still not clear about its role in a k8s solution. I just took it as an IoT devices solution in the past, and feels it is not necessary for deploying a k8s function cluster on typical Linux servers. Need more study on this part.

Do you have any recommended tutorial (either post or video) about deploying OpenFaaS on self-built k8s cluster on raw VMs or servers? Most tutorials and posts I saw had the cluster deployed on a single node or on the cloud. It would be better if the tutorial has detailed explanation on network setup.

Also, do you have any suggestion for the next step I should take to figure out this network error? After finding the dial error, I built two simple python flask microservices on the k8s cluster, and they worked properly (user could call the exposed service by external IP, and the exposed service could call the private service by cluster IP). Consequently, I lose my direction for a next attempt. Is there some small tests I can try to locate the specific network configuration failure?

@lyudmilalala
Copy link
Author

I do a few more works, and think it is the dns's problem. If I do not add nameserver 8.8.8.8 to /etc/resolv.conf in the pod. I cannot analyze domains to IPs.

I think for a permanent fix I should debug the network layer (I currently use flannel). I will try weaver or calico.

I haven't find any useful posts on google at the time, I think because I failed to input accurate keywords. Hope someone could give me some hint on this.

@alexellis
Copy link
Member

I've already advised you on what to do.

Use K3s, it's much simpler.

Try it on a public cloud, you'll get it working in less than 60 seconds.

https://www.openfaas.com/blog/openfaas-linode/

https://github.com/alexellis/k3sup

If you like, you can even try it on your VirtualBox VM.

Perhaps you may also like to try multipass.run which is less of a legacy product?

Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants