Container deployment issues seen on a Swarm mode cluster with rebooted workers running on VMs #1112

sisudhir · 2018-01-18T09:01:52Z

Description

In a mixed swarm mode cluster (baremetal and VMs) with Contiv 1.1.7, docker service scale issues are seen on rebooting the worker VMs.
Before the reboot the cluster had the containers running on all the nodes (baremetal and VMs) using Contiv network and policy framework.

Expected Behavior

The VM reboot should not affect the performance with Contiv network.

Observed Behavior

On rebooting the VMs that were running containers, the containers moved successfully on the surviving worker nodes. But the Docker service scale takes unusually long time. Also, connection errors are seen in netmaster log as:
Error dial tcp 10.65.121.129:9002: getsockopt: no route to host connecting to 10.65.121.129:%!s(uint16=9002). Retrying..

Steps to Reproduce (for bugs)

Created DEE 17.06 cluster in swarm mode with mixed topology - baremetal and VM worker nodes. Master nodes are on baremetal and worker nodes are on VMs.
Installed Contiv 1.1.7 and created back-end Contiv network and policies. Applied policies via group with Contiv tag and created corresponding Docker network
Created Docker service using Contiv network as backend and checked network endpoint connectivity between them and SVIs. All working as expected.
Rebooted 2 worker VMs, containers running on them moved successfully to surviving nodes.
Tried scalling same Docker service to add 5 more containers on the same Contiv network.
Service scale took unusually long, more than 30 minutes to complete for adding 5 more containers.
Saw connections errors to rebooted worker VMs in netmaster logs.

Your Environment

netctl version - 1.1.7/v2Plugin
Orchestrator version (e.g. kubernetes, mesos, swarm): Swarm 17.06/UCP-2.3*
Operating System and version: RHEL7.3
Contiv Data Path: physical vNIC exposed by ESXi on worker VMs in pass-through mode
contiv-logs.tar.gz

The text was updated successfully, but these errors were encountered:

vhosakot · 2018-01-18T17:15:28Z

Looking at the logs in contiv-logs.tar.gz, looks like an RPC issue when netmaster connects to Ofnet:

netmaster.log has:

time="Jan 18 08:36:21.576831134" level=warning msg="Error dial tcp 10.65.121.129:9002: getsockopt: no route to host connecting to 10.65.121.129:%!s(uint16=9002). Retrying.."
time="Jan 18 08:36:22.578994895" level=error msg="Failed to connect to Rpc server 10.65.121.129:9002"
time="Jan 18 08:36:22.579084442" level=error msg="Error calling RPC: OfnetAgent.AddMaster. Could not connect to server"
time="Jan 18 08:36:22.579133952" level=error msg="Error calling AddMaster rpc call on node {10.65.121.129 9002}. Err: Could not connect to server"
time="Jan 18 08:36:22.579152875" level=error msg="Error adding node {10.65.121.129 9002}. Err: Could not connect to server"

Can you send the docker daemon's logs when you see this issue?

blaksmit · 2018-01-18T17:19:27Z

On today's call, there was an ask to see if this is an issue on K8s as well or just in Docker Swarm mode.

vhosakot · 2018-01-18T17:55:19Z

@blaksmit This issue is seen Docker Swarm mode. Pretty sure that this issue cannot be seen in k8s as k8s does not even have the docker service scale command that exposes this issue.

blaksmit · 2018-01-18T19:01:17Z

@vhosakot the comment was to see whether a similar VM scale issue is seen with K8s.

vhosakot · 2018-01-18T20:11:11Z

@blaksmit I see, got it. We could test if this issue is seen when kubectl scale is done (k8s equivalent of docker service scale).

sisudhir · 2018-01-23T08:28:47Z

Please note the changed title.
This is not just a scale issue as we are seeing this issue with deploying a new service having just a single container.

g1rana · 2018-03-29T00:58:37Z

@sisudhir , Is this issue is seen at every iteration of your failure test ? Is it possible for you to share your setup with me . I can take a look at setup during error times

sisudhir changed the title ~~Docker service scale issues seen with rebooted workers running on VMs~~ Container deployment issues seen on a Swarm mode cluster with rebooted workers running on VMs Jan 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container deployment issues seen on a Swarm mode cluster with rebooted workers running on VMs #1112

Container deployment issues seen on a Swarm mode cluster with rebooted workers running on VMs #1112

sisudhir commented Jan 18, 2018 •

edited

Loading

vhosakot commented Jan 18, 2018 •

edited

Loading

blaksmit commented Jan 18, 2018

vhosakot commented Jan 18, 2018

blaksmit commented Jan 18, 2018

vhosakot commented Jan 18, 2018

sisudhir commented Jan 23, 2018

g1rana commented Mar 29, 2018

Container deployment issues seen on a Swarm mode cluster with rebooted workers running on VMs #1112

Container deployment issues seen on a Swarm mode cluster with rebooted workers running on VMs #1112

Comments

sisudhir commented Jan 18, 2018 • edited Loading

Description

Expected Behavior

Observed Behavior

Steps to Reproduce (for bugs)

Your Environment

vhosakot commented Jan 18, 2018 • edited Loading

blaksmit commented Jan 18, 2018

vhosakot commented Jan 18, 2018

blaksmit commented Jan 18, 2018

vhosakot commented Jan 18, 2018

sisudhir commented Jan 23, 2018

g1rana commented Mar 29, 2018

sisudhir commented Jan 18, 2018 •

edited

Loading

vhosakot commented Jan 18, 2018 •

edited

Loading