[FLOC-2659] Manage docker storage #123

wallrj · 2015-07-15T18:11:30Z

Design for https://clusterhq.atlassian.net/browse/FLOC-2659

I mentioned a few different strategies in the issue.
This branch sketches out how we might reconfigure Docker on our on-demand buildslaves so that its device mapper data file doesn't grow larger than the available space on the root file system.

We might also consider increasing the size of the root partition.

As a followup, we should consider upgrading to a newer version of Docker which is better at cleaning up unused container layers from the device mapper data store.
Or we could configure docker to use a separate block device instead of loopback devices for its data pool, as suggested here:

http://www.projectatomic.io/blog/2015/06/notes-on-fedora-centos-and-docker-storage-drivers/

…ve AMI.

jongiddy · 2015-07-16T08:56:27Z

You should proceed with coding this up. Even if we end up doing some of the other options, it seems reasonable to also limit the loopback device to fit in the size of the underlying filesystem. Being sparse doesn't help if the device is being filled.

…covered while testing.

wallrj · 2015-07-16T18:55:36Z

I've:

Experimented with the recommended thinpooldev option on Ubuntu
- https://github.com/docker/docker/blob/master/man/docker.1.md#dmthinpooldev
But it's difficult because it requires me to manually set up the LVM PV, VG, LV and then reconfigure Docker.

On Centos-7 it should be easier because there's a package called
docker-storage-setup which does it all for you, and I even managed to get
that script working on Ubuntu too,
- https://github.com/projectatomic/docker-storage-setup
...but then realised that the version of Docker we're using on Ubuntu (1.3.3) is too
old to support that configuration option.
So instead I stuck to the original plan of reducing the size of the docker devicemapper loop devices to 2G
- This seems big enough to support ~750 openstack/busybox-http-app containers on Ubuntu 14.04
- But when the 2G is used up (on Ubuntu) the docker daemon becomes unresponsive and the docker command line tools just hang, which I think will make for just as annoying test failures as we already have.
The more I think about it the more it seems like we should be using Docker 1.6 / 1.7 on the Ubuntu builders
- which would allow us to report suspected Docker bugs (they won't be able to just say upgrade to the newer version)
- which would allow us to configure either the recommended LVM thin provisioning "device" or perhaps the overlayfs storage backend
  - http://www.projectatomic.io/blog/2015/06/notes-on-fedora-centos-and-docker-storage-drivers/

Anyway, the code in this branch at the moment has the smaller loop devices, and I've:

Built two new AMIs:
- Ubuntu: ami-6dd6d45d
- Centos: ami-add1d39d
Attempted to deploy a staging master on AWS us-west-2 configured to use those:
- ec2-54-69-50-28.us-west-2.compute.amazonaws.com
- 54.69.50.28

...
slaves:
    aws/ubuntu-14.04:
        distribution: "ubuntu-14.04"
        ami: "ami-6dd6d45d"
        slaves: 1
        instance_type: "c3.xlarge"
        max_builds: 4
    aws/centos-7:
        distribution: "centos-7"
        ami: "ami-add1d39d"
        slaves: 1
        instance_type: "c3.xlarge"
        max_builds: 4
...

But I'm getting RequestLimitExceeded errors when the buildmaster container attempts to create on-demand slaves.

Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-] while starting aws/ubuntu-14.04/0
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]         Traceback (most recent call last):
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/usr/lib64/python2.7/threading.py", line 786, in __bootstrap
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             self.__bootstrap_inner()
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/usr/lib64/python2.7/threading.py", line 813, in __bootstrap_inner
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             self.run()
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/usr/lib64/python2.7/threading.py", line 766, in run
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             self.__target(*self.__args, **self.__kwargs)
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]         --- <exception caught here> ---
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/usr/lib64/python2.7/site-packages/twisted/python/threadpool.py", ..._worker
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             result = context.call(ctx, function, *args, **kwargs)
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", lin...Context
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             return self.currentContext().callWithContext(ctx, func, *args, **kw)
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", lin...Context
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             return func(*args,**kw)
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/srv/buildmaster/flocker_bb/ec2.py", line 173, in thread_start
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             self.image_metadata = self.driver.get_image_metadata()
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/srv/buildmaster/flocker_bb/ec2.py", line 381, in get_image_metadata
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             image = self.get_image()
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/srv/buildmaster/flocker_bb/ec2.py", line 374, in get_image
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             images=self.driver.list_images(ex_owner="self"),
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/ec2.py",..._images
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             self.connection.request(self.path, params=params).object
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/usr/lib/python2.7/site-packages/libcloud/common/base.py", line 73...request
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             response = responseCls(**kwargs)
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/usr/lib/python2.7/site-packages/libcloud/common/base.py", line 11..._init__
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             raise Exception(self.parse_error())
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]         exceptions.Exception: RequestLimitExceeded: Request limit exceeded.

If someone else wants to pick this up while I'm away, feel free.

wallrj · 2015-07-16T18:56:58Z

And an even quicker fix would be to increase the size of the root partition on Ubuntu and Centos buildslave instances.

wallrj · 2015-07-16T19:04:20Z

The RequestLimit problem has now passed and I notice that I was using incorrect AMI image references

Jul 16 18:57:57 ip-172-31-46-201 buildmaster[17204]: [-]           File "/srv/buildmaster/flocker_bb/ec2.py", line 287, in get_newest_tagged_image
Jul 16 18:57:57 ip-172-31-46-201 buildmaster[17204]: [-]             raise ValueError("Unknown image.", name)
Jul 16 18:57:57 ip-172-31-46-201 buildmaster[17204]: [-]         exceptions.ValueError: ('Unknown image.', u'ami-6dd6d45d')

I can't remember how to reference unpromoted buildslave images.

tomprince · 2015-07-17T06:24:39Z

...
aws:
    ...
    image_tags: {}
...

exarkun · 2015-07-17T13:08:06Z

Attempted to deploy a staging master on AWS us-west-2 configured to use those:

Using a different region would help avoid the rate limit errors.

wallrj added 3 commits July 15, 2015 19:01

Here's what we'll need to do for Ubuntu build slave AMIs...

058b101

...and here's where we'll reconfigure docker on the centos-7 buildsla…

01f71f8

…ve AMI.

Explain how we'll build and promote the new buildslave images.

273ce5e

wallrj added 2 commits July 16, 2015 18:14

A proposed configuration for Ubuntu...with notes on some problems dis…

75739dc

…covered while testing.

Something for centos-7

ff29959

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLOC-2659] Manage docker storage #123

[FLOC-2659] Manage docker storage #123

wallrj commented Jul 15, 2015

jongiddy commented Jul 16, 2015

wallrj commented Jul 16, 2015

wallrj commented Jul 16, 2015

wallrj commented Jul 16, 2015

tomprince commented Jul 17, 2015

exarkun commented Jul 17, 2015

[FLOC-2659] Manage docker storage #123

Are you sure you want to change the base?

[FLOC-2659] Manage docker storage #123

Conversation

wallrj commented Jul 15, 2015

jongiddy commented Jul 16, 2015

wallrj commented Jul 16, 2015

wallrj commented Jul 16, 2015

wallrj commented Jul 16, 2015

tomprince commented Jul 17, 2015

exarkun commented Jul 17, 2015