adding option to store bdist files in S3 as a second-level cache #33

adamfeuer · 2014-10-10T22:06:37Z

This is so you can check out the code easily. I haven't done documentation yet - if this looks ok, I'll do the docs.

This change really accelerates building on ephemeral Elastic Bamboo instances - since their local filesystems have an empty cache when they start up.

in addition to local file system
uses PIP_S3_CACHE_BUCKET envrionment variable to indicate
S3 use. If this is not set, the S3 cache will not be used
option PIP_S3_CACHE_PREFIX environment variable sets a prefix
(folder) to store the bdist files in, allowing for multiple
OS-specific or platform-specific bdist folders. The user is responsible for manually managing this.

- in addition to local file system - uses PIP_S3_CACHE_BUCKET envrionment variable to indicate S3 use. - option PIP_S3_CACHE_PREFIX environment variable sets a prefix (folder) to store the bdist files in, allowing for multiple OS-specific bdist folders.

jzoldak · 2014-11-04T19:19:08Z

I'm also interested in pip-accel with the upload to S3. What would help to get this PR moved along? Fix the broken test? @xolox are there other concerns?

adamfeuer · 2014-11-04T21:08:26Z

Since there's interest, I'll update documentation and see if I can fix the test. I also want to get this running on one of my build boxes at work.

coveralls · 2014-11-04T21:22:12Z

Coverage decreased (-2.43%) when pulling 124465f on adamfeuer:s3-cache into 84ab4bf on paylogic:master.

jzoldak · 2014-11-04T21:22:13Z

@adamfeuer Awesome. I fixed the test in PR #1 on your fork.
I'm working on adding some tests.

adamfeuer · 2014-11-04T21:32:29Z

I also fixed the problem too, and pushed it - looks like the test passes
now:

https://travis-ci.org/paylogic/pip-accel/builds/39992802

:-)

-adam

On Tue, Nov 4, 2014 at 1:22 PM, Jesse Zoldak [email protected]
wrote:

@adamfeuer https://github.com/adamfeuer Awesome. I fixed the test in PR
#1 on your fork adamfeuer#1.
I'm working on adding some tests.

—
Reply to this email directly or view it on GitHub
#33 (comment).

Adam Feuer [email protected]

adamfeuer · 2014-11-04T21:33:51Z

I also committed some basic documentation.

coveralls · 2014-11-04T21:39:49Z

Coverage decreased (-2.43%) when pulling e7b8086 on adamfeuer:s3-cache into 84ab4bf on paylogic:master.

jzoldak · 2014-11-04T21:50:32Z

pip_accel/config.py

@@ -22,6 +25,9 @@
    pip_accel_cache = expand_user('~/.pip-accel')

 # Enable overriding the default locations with environment variables.
+if 'PIP_S3_CACHE_BUCKET' in os.environ:
+    s3_cache_bucket = expand_user(os.environ['PIP_S3_CACHE_BUCKET'])


From my read of the code, this does not need to be passed through the expand_user method, as that is used to modify references to a user's home directory for the local caches.

You're right, I will take this out.

xolox · 2014-11-04T23:07:23Z

@adamfeuer and @jzoldak: To clear up any worries, I'm planning to merge this soon (hopefully this week) because I definitely see how this is a useful feature to have for a lot of (potential) users, even if I don't have a direct use for this myself (at the moment). Sorry by the way for being slow to respond and thanks for the messages to remind me, I've been swamped at work due to recent organizational changes that I won't get into here ;-).

The most important thing for me is finding a way to make the Boto dependency optional while still keeping things KISS for the user (a bit more work for me in preparing releases would not be a problem, I always automate release management anyway :-).

I've been thinking about 1) the most elegant way to implement this (I want my projects to be of high software engineering quality) vs. 2) a pragmatic way to implement this (I want to have the feature merged soon so as not to waste the efforts of @adamfeuer and the enthusiasm of @jzoldak). Some ideas based on those conflicting considerations:

1. Architecture astronaut mode: Maybe the most elegant way would be a plug-in mechanism that uses e.g. setuptools entry points (the best thing the Python world has for automatic discovery of plug-ins installed as separate packages, AFAIK). The problem is that it requires a bit of "plumbing" and I'm not sure how easy it is to generalize this pull request into a generic "storage backend" plug-in layer (I still have to dive into the code).

2. KISS mode: A pragmatic way would be to use setuptools' support for 'extras' to define the Boto dependency, but keep all of the code in the main pip-accel package. Then installing pip-accel as pip install pip-accel[s3] could pull in the Boto dependency. The code base wouldn't need any knowledge of this extra, it would just do this:

try:
    import boto
    enable_the_feature()
except ImportError:
    disable_the_feature()

3. KISS^2 mode: Setuptools' extras support in pip seems to have some rough edges, so an even more pragmatic way would be to keep the pip-accel package and additionally introduce a pip-accel-s3 package that pulls in pip-accel and Boto and provides the glue needed to tie them together (i.e. the meat of this pull request). The main package could try to import the 'sub-package' and gracefully handle the ImportError.

I guess the latter two options provide the quickest (most pragmatic) way to get this merged without a hard dependency on Boto. I would still find the first option the most elegant one, but it requires a bit more work up front (to pay off in a hypothetical future ;-). Maybe option two is the sweet spot for merging this.

xolox · 2014-11-04T23:39:44Z

I was wondering, what was your reasoning for putting the binary distribution cache in S3 but not the source distribution cache? (given what you've explained about using hosts with ephemeral local storage)

(deep into the internals of pip-accel)

The reason I ask is that pip-accel needs pip to generate a dependency set, and to do that pip requires unpacked source distributions. This implies that every pip-accel install invocation needs those source distributions to be available for unpacking, even if the complete dependency set can be satisfied using cached binary distributions :-(

This points to a really ugly and inefficient part of the communication between pip and pip-accel which is (AFAICT) very complex to solve because it would require pip-accel to re-implement large parts of the logic already available in pip but impossible to re-use effectively.

(coming back up to a high level)

Knowing what I've explained above, would you choose to also store the source distribution archives in S3? It may seem kind of silly because whether you're downloading from PyPI, an external distribution site or an S3 bucket, it's all HTTP, so who cares? Some reasons I can think of why you might care:

Connectivity between EC2 instances and S3 (versus EC2 and "the internet") should be much faster and is cheaper (source: "There is no Data Transfer charge for data transferred between Amazon EC2 and Amazon S3 within the same Region" from aws.amazon.com/s3/faqs).
PyPI and distribution sites can go down, effectively blocking dozens if not hundreds (if not etc.) of Python deployments world wide :-)
Old(er) packages can be removed from PyPI, forcing you to upgrade your code base to a possibly incompatible version that requires (time/monetary) investments in you code base (and of course you find this out during a deployment (hopefully while testing and not in production)).

adamfeuer · 2014-11-05T00:23:20Z

@xolox Regarding why I didn't put the sdists on S3 too - I was trying to keep things simple, and since I am depending on being able to reach PyPI anyway, I figured that was ok.

Regarding how to deal with the Boto dependency - I like your #2 best. I considered making a separate package that would depend on pip-accel (pip-accel-s3) but thought that most people would have a hard time finding out about it- often the hardest thing about using software is to know what software to use. :-)

Anyway, I'm willing to implement #2, update the docs, and push it to the branch. I'll take a shot at that tomorrow.

jzoldak · 2014-11-05T02:47:17Z

FWIW because of the way that pip does unpacking/dependencies, and also because I'm primarily looking to optimize CI worker spinup, I would prefer an implementation that allows uploading the distributions on S3. I can refactor or add that feature after we get through this bit.

adamfeuer · 2014-11-05T03:21:07Z

@jzoldak Ok, that sounds good. I'd like to keep this PR as small as possible.

- not needed since these variables will be referring to S3

- removing boot module from requirements - adding documentation to say that boto module is required to use S3 cache option

- so they can be used for sdists if needed - removed boot imports from top of files - inlined several helper methods so the work with the local import of boot

coveralls · 2014-11-05T04:42:25Z

Coverage decreased (-3.82%) when pulling 601d118 on adamfeuer:s3-cache into 84ab4bf on paylogic:master.

adamfeuer · 2014-11-05T04:47:46Z

Ok, I implemented @xolox's suggestion #2 for how to deal with making boto an optional dependency - we now check if the S3 cache environment vars are configured - if so, then we check for boto. If either fails, we don't use the cache.

I moved the cache code to __init__.py so that it can be more easily used by other routines - for instance if we want to implement the sdist caching.

Finally, I updated the docs to indicate that if you want to use the S3 cache feature, it's up to you to import boto yourself, otherwise the S3 cache will not be used.

I ran some manual tests and it works. I can write some unit tests - using the moto mock_s3 class - let me know if you want me to do that. It would add a build-time dependency on moto, but would enable us to write fast tests that don't actually hit AWS S3.

If you're ok with the manual testing for now - this is ok to merge.

jzoldak · 2014-11-05T10:59:11Z

@adamfeuer I'm a strong advocate of tests, and would love to see unit tests added. Again, I can circle back and add them later if you don't have time or inclination right now.

Perhaps we create a test_requirements.txt requirements file to separate out the runtime dependencies from the test dependencies. Leave the requirements.txt logic in setup.py alone and in the travis.yml file install: pip install -r test_requirements.txt

adamfeuer · 2014-11-05T15:38:51Z

@jzoldak I like tests too - and will add them, using a test_requirements.txt, and will update the travis.yml file as you suggest.

The reason I was hesitant was that the tests will depend on moto and boto, but with the setup you suggest, only people running the tests will be impacted.

jzoldak · 2014-11-05T19:28:08Z

pip_accel/__init__.py

+    logger.debug("S3_CACHE_BUCKET is set, attempting to read file from S3 cache.")
+    try:
+        import boto
+        from boto.s3.key import Key


this import isn't needed

coveralls · 2014-11-08T05:05:16Z

Coverage decreased (-3.82%) when pulling 6ddbb42 on adamfeuer:s3-cache into 84ab4bf on paylogic:master.

See also pull request #33 on GitHub: #33 As discussed in the pull request I definitely see the value of being able to keep the binary cache in Amazon S3. What I didn't like about the pull request was that it introduced a call to store_file_into_s3_cache() inside the module pip_accel.bdist. Conceptually that module has absolutely nothing to do with Amazon S3 so that had to change :-) This change set merges pull request #33 but also introduces a new pluggable cache backend registration mechanism that enables cache backends to be added without changing pip-accel. This mechanism uses setuptools' support for custom entry points to discover the relevant modules and a trivial metaclass to automagically track cache backend class definitions. The local binary cache backend and the Amazon S3 cache backend (introduced in the pull request) have been modified to use the pluggable registration mechanism. Maybe more backends will follow. We'll see :-)

xolox · 2014-11-09T20:46:49Z

Hi @adamfeuer and @jzoldak,

I've merged this pull request and published the result as pip-accel 0.14. Note that I changed literally all of the code introduced in the pull request. That's not to say there was anything wrong with the code, I just had bigger plans (the previously described cache backend plug-in mechanism).

If you are interested you can review the result in the merge commit 8ff50a9. I also renamed the relevant environment variables in order to consistently use the $PIP_ACCEL_ prefix because I don't want anyone to get confused between where pip stops and pip-accel begins. The environment variabels are documented in the readme as suggested in the pull request.

I hope this works well for you guys. Thanks to both of you, both for the pull request and the discussion. Feedback is welcome!

Some relevant notes regarding previous discussions in this pull request:

Right now there's no automated test yet, but I did verify manually that the Amazon S3 backend works in cPython 2.6, cPython 2.7, cPython 3.4 and PyPy 2.7 (you can run make test or just tox to verify, but note that cPython 2.7 and PyPy 2.7 generate conflicting binary distribution archives, I hope to fix this tonight).
I implemented the 'extra' in setup.py to enable pip install pip-accel[s3]. I'm crossing my fingers that this will work well for the average user interested in the feature :-).
Right now only the binary cache can be stored in Amazon S3. It is possible but non trivial to add support for caching source distribution archives in Amazon S3. If @jzoldak really sees value in this I may try to implement it soon. The main difficulty is that for the binary cache a simple get() / put() interface suffices, but the source index requires an index.html that can be scanned by pip install --find-links=.... Because Amazon S3 does not support server side directory listings this will have to be implemented in pip-accel (one way or another).

While testing the results of merging pull request #33 I noticed that when I enabled the Amazon S3 cache backend and ran the test suite using Tox (which runs the test suite under CPython 2.6, CPython 2.7, CPython 3.4 and PyPy 2.7) either the CPython 2.7 or the PyPy 2.7 tests would fail. It turns out that the two don't go well together (who would have thought? ;-). This change set permanently fixes the problem by encoding the Python implementation in the filenames used to store binary distribution archives in the binary cache. See also pull request #33 on GitHub: #33

xolox · 2014-11-09T21:39:54Z

Right now there's no automated test yet, but I did verify manually that the Amazon S3 backend works in cPython 2.6, cPython 2.7, cPython 3.4 and PyPy 2.7 (you can run make test or just tox to verify, but note that cPython 2.7 and PyPy 2.7 generate conflicting binary distribution archives, I hope to fix this tonight).

Fixed in d43e260.

adamfeuer · 2014-11-09T23:10:43Z

@xolox Awesome! I will try it out. I will also try to write an automated test with the moto mock_s3 class.

jzoldak · 2014-11-12T17:42:36Z

@xolox and @adamfeuer thanks this is great.
I get the point about caching source distribution archives as being non-trivial and don't see it as an urgent need.

adamfeuer mentioned this pull request Oct 10, 2014

Store cache on AWS S3? #32

Closed

fixing broken import of s3_cache_prefix

124465f

updated docs for using the S3 cache

e7b8086

jzoldak reviewed Nov 4, 2014
View reviewed changes

adamfeuer added 3 commits November 4, 2014 20:01

removing expand_user from config.py

062ba8b

- not needed since these variables will be referring to S3

making S3 cache code check for optional boto dependency

231d2ab

- removing boot module from requirements - adding documentation to say that boto module is required to use S3 cache option

moved S3 cache methods to __init__.py

601d118

- so they can be used for sdists if needed - removed boot imports from top of files - inlined several helper methods so the work with the local import of boot

jzoldak reviewed Nov 5, 2014
View reviewed changes

fixing typo in link formatting

6ddbb42

xolox merged commit 6ddbb42 into paylogic:master Nov 9, 2014

This was referenced Nov 15, 2014

Enable binary cache format revisions to coexist #39

Closed

Implement source distribution archive caching for Amazon S3 backend #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding option to store bdist files in S3 as a second-level cache #33

adding option to store bdist files in S3 as a second-level cache #33

adamfeuer commented Oct 10, 2014

jzoldak commented Nov 4, 2014

adamfeuer commented Nov 4, 2014

coveralls commented Nov 4, 2014

jzoldak commented Nov 4, 2014

adamfeuer commented Nov 4, 2014

adamfeuer commented Nov 4, 2014

coveralls commented Nov 4, 2014

jzoldak Nov 4, 2014

adamfeuer Nov 5, 2014

xolox commented Nov 4, 2014

xolox commented Nov 4, 2014

adamfeuer commented Nov 5, 2014

jzoldak commented Nov 5, 2014

adamfeuer commented Nov 5, 2014

coveralls commented Nov 5, 2014

adamfeuer commented Nov 5, 2014

jzoldak commented Nov 5, 2014

adamfeuer commented Nov 5, 2014

jzoldak Nov 5, 2014

coveralls commented Nov 8, 2014

xolox commented Nov 9, 2014

xolox commented Nov 9, 2014

adamfeuer commented Nov 9, 2014

jzoldak commented Nov 12, 2014

adding option to store bdist files in S3 as a second-level cache #33

adding option to store bdist files in S3 as a second-level cache #33

Conversation

adamfeuer commented Oct 10, 2014

jzoldak commented Nov 4, 2014

adamfeuer commented Nov 4, 2014

coveralls commented Nov 4, 2014

jzoldak commented Nov 4, 2014

adamfeuer commented Nov 4, 2014

adamfeuer commented Nov 4, 2014

coveralls commented Nov 4, 2014

jzoldak Nov 4, 2014

Choose a reason for hiding this comment

adamfeuer Nov 5, 2014

Choose a reason for hiding this comment

xolox commented Nov 4, 2014

xolox commented Nov 4, 2014

adamfeuer commented Nov 5, 2014

jzoldak commented Nov 5, 2014

adamfeuer commented Nov 5, 2014

coveralls commented Nov 5, 2014

adamfeuer commented Nov 5, 2014

jzoldak commented Nov 5, 2014

adamfeuer commented Nov 5, 2014

jzoldak Nov 5, 2014

Choose a reason for hiding this comment

coveralls commented Nov 8, 2014

xolox commented Nov 9, 2014

xolox commented Nov 9, 2014

adamfeuer commented Nov 9, 2014

jzoldak commented Nov 12, 2014