Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLOC-2659] Manage docker storage #123

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

wallrj
Copy link
Contributor

@wallrj wallrj commented Jul 15, 2015

Design for https://clusterhq.atlassian.net/browse/FLOC-2659

I mentioned a few different strategies in the issue.
This branch sketches out how we might reconfigure Docker on our on-demand buildslaves so that its device mapper data file doesn't grow larger than the available space on the root file system.

We might also consider increasing the size of the root partition.

As a followup, we should consider upgrading to a newer version of Docker which is better at cleaning up unused container layers from the device mapper data store.
Or we could configure docker to use a separate block device instead of loopback devices for its data pool, as suggested here:

Review on Reviewable

@jongiddy
Copy link
Contributor

You should proceed with coding this up. Even if we end up doing some of the other options, it seems reasonable to also limit the loopback device to fit in the size of the underlying filesystem. Being sparse doesn't help if the device is being filled.

@wallrj
Copy link
Contributor Author

wallrj commented Jul 16, 2015

I've:

  • Experimented with the recommended thinpooldev option on Ubuntu

    But it's difficult because it requires me to manually set up the LVM PV, VG, LV and then reconfigure Docker.

    On Centos-7 it should be easier because there's a package called
    docker-storage-setup which does it all for you, and I even managed to get
    that script working on Ubuntu too,

    ...but then realised that the version of Docker we're using on Ubuntu (1.3.3) is too
    old to support that configuration option.

  • So instead I stuck to the original plan of reducing the size of the docker devicemapper loop devices to 2G

    • This seems big enough to support ~750 openstack/busybox-http-app containers on Ubuntu 14.04
    • But when the 2G is used up (on Ubuntu) the docker daemon becomes unresponsive and the docker command line tools just hang, which I think will make for just as annoying test failures as we already have.
  • The more I think about it the more it seems like we should be using Docker 1.6 / 1.7 on the Ubuntu builders

Anyway, the code in this branch at the moment has the smaller loop devices, and I've:

  • Built two new AMIs:
    • Ubuntu: ami-6dd6d45d
    • Centos: ami-add1d39d
  • Attempted to deploy a staging master on AWS us-west-2 configured to use those:
    • ec2-54-69-50-28.us-west-2.compute.amazonaws.com
    • 54.69.50.28
...
slaves:
    aws/ubuntu-14.04:
        distribution: "ubuntu-14.04"
        ami: "ami-6dd6d45d"
        slaves: 1
        instance_type: "c3.xlarge"
        max_builds: 4
    aws/centos-7:
        distribution: "centos-7"
        ami: "ami-add1d39d"
        slaves: 1
        instance_type: "c3.xlarge"
        max_builds: 4
...

But I'm getting RequestLimitExceeded errors when the buildmaster container attempts to create on-demand slaves.

Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-] while starting aws/ubuntu-14.04/0
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]         Traceback (most recent call last):
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/usr/lib64/python2.7/threading.py", line 786, in __bootstrap
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             self.__bootstrap_inner()
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/usr/lib64/python2.7/threading.py", line 813, in __bootstrap_inner
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             self.run()
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/usr/lib64/python2.7/threading.py", line 766, in run
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             self.__target(*self.__args, **self.__kwargs)
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]         --- <exception caught here> ---
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/usr/lib64/python2.7/site-packages/twisted/python/threadpool.py", ..._worker
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             result = context.call(ctx, function, *args, **kwargs)
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", lin...Context
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             return self.currentContext().callWithContext(ctx, func, *args, **kw)
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", lin...Context
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             return func(*args,**kw)
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/srv/buildmaster/flocker_bb/ec2.py", line 173, in thread_start
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             self.image_metadata = self.driver.get_image_metadata()
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/srv/buildmaster/flocker_bb/ec2.py", line 381, in get_image_metadata
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             image = self.get_image()
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/srv/buildmaster/flocker_bb/ec2.py", line 374, in get_image
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             images=self.driver.list_images(ex_owner="self"),
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/ec2.py",..._images
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             self.connection.request(self.path, params=params).object
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/usr/lib/python2.7/site-packages/libcloud/common/base.py", line 73...request
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             response = responseCls(**kwargs)
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]           File "/usr/lib/python2.7/site-packages/libcloud/common/base.py", line 11..._init__
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]             raise Exception(self.parse_error())
Jul 16 18:40:22 ip-172-31-46-201 buildmaster[17204]: [-]         exceptions.Exception: RequestLimitExceeded: Request limit exceeded.

If someone else wants to pick this up while I'm away, feel free.

@wallrj
Copy link
Contributor Author

wallrj commented Jul 16, 2015

And an even quicker fix would be to increase the size of the root partition on Ubuntu and Centos buildslave instances.

@wallrj
Copy link
Contributor Author

wallrj commented Jul 16, 2015

The RequestLimit problem has now passed and I notice that I was using incorrect AMI image references

Jul 16 18:57:57 ip-172-31-46-201 buildmaster[17204]: [-]           File "/srv/buildmaster/flocker_bb/ec2.py", line 287, in get_newest_tagged_image
Jul 16 18:57:57 ip-172-31-46-201 buildmaster[17204]: [-]             raise ValueError("Unknown image.", name)
Jul 16 18:57:57 ip-172-31-46-201 buildmaster[17204]: [-]         exceptions.ValueError: ('Unknown image.', u'ami-6dd6d45d')

I can't remember how to reference unpromoted buildslave images.

@tomprince
Copy link
Contributor

...
aws:
    ...
    image_tags: {}
...

@exarkun
Copy link
Contributor

exarkun commented Jul 17, 2015

Attempted to deploy a staging master on AWS us-west-2 configured to use those:

Using a different region would help avoid the rate limit errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants