-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubectl get pods :lcm ContainerCreating,prometheus trainer and trainingdata STATUS CrashLoopBackOff #161
Comments
Hi @Earl-chen, it looks like the volume configmap is not being created. Can you run the following script to generate the necessary configmap? Thanks. pushd bin
./create_static_volumes.sh
./create_static_volumes_config.sh
popd |
Thank you for taking the time to redeploy FfDL. It looks like many of the pods failed liveness probe. Which means those microservices might not able to communicate with each other via the KubeDNS server on your cluster. Can you display some logs from your KubeDNS pod in the |
@Tomcli Thank you for your prompt reply. Unfortunately, I missed KubeDNS. KubeDNS is not installed. I install it now, and then give you feedback. |
@Tomcli I also have the problem above I run command $kubectl describe pods ffdl-lcm-8d555c7bf-6pg7z --namespace kube-system, the result is: 192.168.110.158 is the node of k8s, I run the command |
@Eric-Zhang1990 |
@Tomcli I can get following info, it shows static-volumes and v2 are there: However, I restart FfDL, and still encounter problems with "SetUp failed for volume "static-volumes-config-volume-v2" : configmap "static-volumes-v2" not found". And I rerun the command ./create_static_volumes.sh and ./create_static_volumes_config.sh, I got these: How can I solve it? Thanks. |
@Eric-Zhang1990 It looks like you deployed the static-volumes at the |
The |
@Tomcli Does following state cause the problem above? |
@Tomcli After running one more hour, the statues of 'ffdl-trainingdata*' is still changing, sometimes is in status 'running', sometimes is 'CrashLoopBackOff'. And I run command "kubectl logs ffdl-trainingdata-74f7cdf66c-lkk2p", logs info are: time="2019-01-23T07:06:18Z" level=debug msg="Log level set to 'debug'" goroutine 1 [running]: Is the problem of "no available connection: no Elasticsearch node available"? |
Thank you for taking time to debug this. Elasticsearch should be part of the Thanks. |
@Tomcli I check elasticsearch is deployed, and logs of storage-0 shows "Failed to find a usable hardware address from the network interfaces; using random bytes: 64:4b:61:9d:da:79:4a:d3", which reason can cause this problem? |
@Tomcli Today I run FfDL again, I can get all compositions are running, but they all get some numbers of RESTARTS, is it all right? Can I use it for training? Thank you. |
Hi @Eric-Zhang1990, Sorry for the late reply. Regrading the elastic search error, you supposed to have the following logs at the end of the
The above logs will indicate the elastic search schema table is created, then the Since I see all your pods is running today, you can go ahead and starting use it for training. I can follow up on it if you encounter any further question. Thank you. |
@Tomcli Thank you for your patient reply. I check the log of storage-0 container, it shows the same info as yours. One more thing: I run FfDL on 2 servers, and they are on local area network, does network have effect on deployment of FfDL? |
Hi @Eric-Zhang1990, it looks like some internal connections are either refused or timed out. If you local area network has low bandwidth, I recommend to deploy FfDL without any monitoring service to reduce the network throughput. e.g. helm install . --set prometheus.deploy=false |
@Tomcli I run command 'helm install . --set prometheus.deploy=false' and find ffdl-trainer is also CrashLoopBackOff or running, and it always shows "Back-off restarting failed container".
|
@Tomcli Thanks, it seems like the issue of internal connections, I can run correctly on one server, but on two server, the status is unstable. |
@Tomcli Sorry for bothering you. I have the same problem after I deploy FfDL on other two servers (192.168.110.158 and 192.168.110.76 as node, 192.168.110.25 as master). Is it also the internal connections issue between pods in defferent servers? I don't know where problem is, thanks. |
Hi @Eric-Zhang1990, it looks like some of the services are not reachable between two of your worker nodes. Also, the errors you have before that fails the liveness test also indicates that the GRPC protocols are not reachable between the microservices that are in different nodes. Since FfDL is using KubeDNS to discover and communicate between each microservice, it could be your KubeDNS wasn't setup correctly. Another reason could be something is blocking the internode communication (e.g. firewall setting, VLAN, etc...). |
@Tomcli Thank you for your kind reply, I also think the issue is communication problem, after many times try, I delete k8s and deploy it in kubeadm tool, and now it runs correctly. |
@Tomcli hello, I have some similar but not same question when I deploy FFdl, And after I clean up FdDL and rebuild(make deploy-plugin), it shows |
You can check the list of storageclass on your cluster by running
Once you completed with the above steps, you can continue with |
After I installed FfDl according to the prompts, check the status and get the following prompt:
and use
helm list
,get follow :Then, for these incorrect pods, I use
kubectl describe pods <pods Name>
to view the information. The results are as follows:Is there any friend who can give me some advice? What is the problem with me? I would like to express my heartfelt thanks.
The text was updated successfully, but these errors were encountered: