Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding option to store bdist files in S3 as a second-level cache #33

Merged
merged 7 commits into from
Nov 9, 2014

Conversation

adamfeuer
Copy link
Contributor

This is so you can check out the code easily. I haven't done documentation yet - if this looks ok, I'll do the docs.

This change really accelerates building on ephemeral Elastic Bamboo instances - since their local filesystems have an empty cache when they start up.

  • in addition to local file system
  • uses PIP_S3_CACHE_BUCKET envrionment variable to indicate
    S3 use. If this is not set, the S3 cache will not be used
  • option PIP_S3_CACHE_PREFIX environment variable sets a prefix
    (folder) to store the bdist files in, allowing for multiple
    OS-specific or platform-specific bdist folders. The user is responsible for manually managing this.

- in addition to local file system
- uses PIP_S3_CACHE_BUCKET envrionment variable to indicate
  S3 use.
- option PIP_S3_CACHE_PREFIX environment variable sets a prefix
  (folder) to store the bdist files in, allowing for multiple
  OS-specific bdist folders.
@adamfeuer adamfeuer mentioned this pull request Oct 10, 2014
@jzoldak
Copy link

jzoldak commented Nov 4, 2014

I'm also interested in pip-accel with the upload to S3. What would help to get this PR moved along? Fix the broken test? @xolox are there other concerns?

@adamfeuer
Copy link
Contributor Author

Since there's interest, I'll update documentation and see if I can fix the test. I also want to get this running on one of my build boxes at work.

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.43%) when pulling 124465f on adamfeuer:s3-cache into 84ab4bf on paylogic:master.

@jzoldak
Copy link

jzoldak commented Nov 4, 2014

@adamfeuer Awesome. I fixed the test in PR #1 on your fork.
I'm working on adding some tests.

@adamfeuer
Copy link
Contributor Author

I also fixed the problem too, and pushed it - looks like the test passes
now:

https://travis-ci.org/paylogic/pip-accel/builds/39992802

:-)

-adam

On Tue, Nov 4, 2014 at 1:22 PM, Jesse Zoldak [email protected]
wrote:

@adamfeuer https://github.com/adamfeuer Awesome. I fixed the test in PR
#1 on your fork adamfeuer#1.
I'm working on adding some tests.


Reply to this email directly or view it on GitHub
#33 (comment).

Adam Feuer [email protected]

@adamfeuer
Copy link
Contributor Author

I also committed some basic documentation.

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.43%) when pulling e7b8086 on adamfeuer:s3-cache into 84ab4bf on paylogic:master.

@@ -22,6 +25,9 @@
pip_accel_cache = expand_user('~/.pip-accel')

# Enable overriding the default locations with environment variables.
if 'PIP_S3_CACHE_BUCKET' in os.environ:
s3_cache_bucket = expand_user(os.environ['PIP_S3_CACHE_BUCKET'])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my read of the code, this does not need to be passed through the expand_user method, as that is used to modify references to a user's home directory for the local caches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I will take this out.

@xolox
Copy link
Member

xolox commented Nov 4, 2014

@adamfeuer and @jzoldak: To clear up any worries, I'm planning to merge this soon (hopefully this week) because I definitely see how this is a useful feature to have for a lot of (potential) users, even if I don't have a direct use for this myself (at the moment). Sorry by the way for being slow to respond and thanks for the messages to remind me, I've been swamped at work due to recent organizational changes that I won't get into here ;-).


The most important thing for me is finding a way to make the Boto dependency optional while still keeping things KISS for the user (a bit more work for me in preparing releases would not be a problem, I always automate release management anyway :-).

I've been thinking about 1) the most elegant way to implement this (I want my projects to be of high software engineering quality) vs. 2) a pragmatic way to implement this (I want to have the feature merged soon so as not to waste the efforts of @adamfeuer and the enthusiasm of @jzoldak). Some ideas based on those conflicting considerations:

1. Architecture astronaut mode: Maybe the most elegant way would be a plug-in mechanism that uses e.g. setuptools entry points (the best thing the Python world has for automatic discovery of plug-ins installed as separate packages, AFAIK). The problem is that it requires a bit of "plumbing" and I'm not sure how easy it is to generalize this pull request into a generic "storage backend" plug-in layer (I still have to dive into the code).

2. KISS mode: A pragmatic way would be to use setuptools' support for 'extras' to define the Boto dependency, but keep all of the code in the main pip-accel package. Then installing pip-accel as pip install pip-accel[s3] could pull in the Boto dependency. The code base wouldn't need any knowledge of this extra, it would just do this:

try:
    import boto
    enable_the_feature()
except ImportError:
    disable_the_feature()

3. KISS^2 mode: Setuptools' extras support in pip seems to have some rough edges, so an even more pragmatic way would be to keep the pip-accel package and additionally introduce a pip-accel-s3 package that pulls in pip-accel and Boto and provides the glue needed to tie them together (i.e. the meat of this pull request). The main package could try to import the 'sub-package' and gracefully handle the ImportError.

I guess the latter two options provide the quickest (most pragmatic) way to get this merged without a hard dependency on Boto. I would still find the first option the most elegant one, but it requires a bit more work up front (to pay off in a hypothetical future ;-). Maybe option two is the sweet spot for merging this.

@xolox
Copy link
Member

xolox commented Nov 4, 2014

I was wondering, what was your reasoning for putting the binary distribution cache in S3 but not the source distribution cache? (given what you've explained about using hosts with ephemeral local storage)


(deep into the internals of pip-accel)

The reason I ask is that pip-accel needs pip to generate a dependency set, and to do that pip requires unpacked source distributions. This implies that every pip-accel install invocation needs those source distributions to be available for unpacking, even if the complete dependency set can be satisfied using cached binary distributions :-(

This points to a really ugly and inefficient part of the communication between pip and pip-accel which is (AFAICT) very complex to solve because it would require pip-accel to re-implement large parts of the logic already available in pip but impossible to re-use effectively.


(coming back up to a high level)

Knowing what I've explained above, would you choose to also store the source distribution archives in S3? It may seem kind of silly because whether you're downloading from PyPI, an external distribution site or an S3 bucket, it's all HTTP, so who cares? Some reasons I can think of why you might care:

  • Connectivity between EC2 instances and S3 (versus EC2 and "the internet") should be much faster and is cheaper (source: "There is no Data Transfer charge for data transferred between Amazon EC2 and Amazon S3 within the same Region" from aws.amazon.com/s3/faqs).
  • PyPI and distribution sites can go down, effectively blocking dozens if not hundreds (if not etc.) of Python deployments world wide :-)
  • Old(er) packages can be removed from PyPI, forcing you to upgrade your code base to a possibly incompatible version that requires (time/monetary) investments in you code base (and of course you find this out during a deployment (hopefully while testing and not in production)).

@adamfeuer
Copy link
Contributor Author

@xolox Regarding why I didn't put the sdists on S3 too - I was trying to keep things simple, and since I am depending on being able to reach PyPI anyway, I figured that was ok.

Regarding how to deal with the Boto dependency - I like your #2 best. I considered making a separate package that would depend on pip-accel (pip-accel-s3) but thought that most people would have a hard time finding out about it- often the hardest thing about using software is to know what software to use. :-)

Anyway, I'm willing to implement #2, update the docs, and push it to the branch. I'll take a shot at that tomorrow.

@jzoldak
Copy link

jzoldak commented Nov 5, 2014

FWIW because of the way that pip does unpacking/dependencies, and also because I'm primarily looking to optimize CI worker spinup, I would prefer an implementation that allows uploading the distributions on S3. I can refactor or add that feature after we get through this bit.

@adamfeuer
Copy link
Contributor Author

@jzoldak Ok, that sounds good. I'd like to keep this PR as small as possible.

- not needed since these variables will be referring to S3
- removing boot module from requirements
- adding documentation to say that boto module is required to use
  S3 cache option
- so they can be used for sdists if needed
- removed boot imports from top of files
- inlined several helper methods so the work with the local import of boot
@coveralls
Copy link

Coverage Status

Coverage decreased (-3.82%) when pulling 601d118 on adamfeuer:s3-cache into 84ab4bf on paylogic:master.

@adamfeuer
Copy link
Contributor Author

Ok, I implemented @xolox's suggestion #2 for how to deal with making boto an optional dependency - we now check if the S3 cache environment vars are configured - if so, then we check for boto. If either fails, we don't use the cache.

I moved the cache code to __init__.py so that it can be more easily used by other routines - for instance if we want to implement the sdist caching.

Finally, I updated the docs to indicate that if you want to use the S3 cache feature, it's up to you to import boto yourself, otherwise the S3 cache will not be used.

I ran some manual tests and it works. I can write some unit tests - using the moto mock_s3 class - let me know if you want me to do that. It would add a build-time dependency on moto, but would enable us to write fast tests that don't actually hit AWS S3.

If you're ok with the manual testing for now - this is ok to merge.

@jzoldak
Copy link

jzoldak commented Nov 5, 2014

@adamfeuer I'm a strong advocate of tests, and would love to see unit tests added. Again, I can circle back and add them later if you don't have time or inclination right now.

Perhaps we create a test_requirements.txt requirements file to separate out the runtime dependencies from the test dependencies. Leave the requirements.txt logic in setup.py alone and in the travis.yml file install: pip install -r test_requirements.txt

@adamfeuer
Copy link
Contributor Author

@jzoldak I like tests too - and will add them, using a test_requirements.txt, and will update the travis.yml file as you suggest.

The reason I was hesitant was that the tests will depend on moto and boto, but with the setup you suggest, only people running the tests will be impacted.

logger.debug("S3_CACHE_BUCKET is set, attempting to read file from S3 cache.")
try:
import boto
from boto.s3.key import Key
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this import isn't needed

@coveralls
Copy link

Coverage Status

Coverage decreased (-3.82%) when pulling 6ddbb42 on adamfeuer:s3-cache into 84ab4bf on paylogic:master.

@xolox xolox merged commit 6ddbb42 into paylogic:master Nov 9, 2014
xolox added a commit that referenced this pull request Nov 9, 2014
See also pull request #33 on GitHub:

  #33

As discussed in the pull request I definitely see the value of being
able to keep the binary cache in Amazon S3. What I didn't like about the
pull request was that it introduced a call to store_file_into_s3_cache()
inside the module pip_accel.bdist. Conceptually that module has
absolutely nothing to do with Amazon S3 so that had to change :-)

This change set merges pull request #33 but also introduces a new
pluggable cache backend registration mechanism that enables cache
backends to be added without changing pip-accel. This mechanism uses
setuptools' support for custom entry points to discover the relevant
modules and a trivial metaclass to automagically track cache backend
class definitions.

The local binary cache backend and the Amazon S3 cache backend
(introduced in the pull request) have been modified to use the pluggable
registration mechanism. Maybe more backends will follow. We'll see :-)
@xolox
Copy link
Member

xolox commented Nov 9, 2014

Hi @adamfeuer and @jzoldak,

I've merged this pull request and published the result as pip-accel 0.14. Note that I changed literally all of the code introduced in the pull request. That's not to say there was anything wrong with the code, I just had bigger plans (the previously described cache backend plug-in mechanism).

If you are interested you can review the result in the merge commit 8ff50a9. I also renamed the relevant environment variables in order to consistently use the $PIP_ACCEL_ prefix because I don't want anyone to get confused between where pip stops and pip-accel begins. The environment variabels are documented in the readme as suggested in the pull request.

I hope this works well for you guys. Thanks to both of you, both for the pull request and the discussion. Feedback is welcome!

Some relevant notes regarding previous discussions in this pull request:

  • Right now there's no automated test yet, but I did verify manually that the Amazon S3 backend works in cPython 2.6, cPython 2.7, cPython 3.4 and PyPy 2.7 (you can run make test or just tox to verify, but note that cPython 2.7 and PyPy 2.7 generate conflicting binary distribution archives, I hope to fix this tonight).
  • I implemented the 'extra' in setup.py to enable pip install pip-accel[s3]. I'm crossing my fingers that this will work well for the average user interested in the feature :-).
  • Right now only the binary cache can be stored in Amazon S3. It is possible but non trivial to add support for caching source distribution archives in Amazon S3. If @jzoldak really sees value in this I may try to implement it soon. The main difficulty is that for the binary cache a simple get() / put() interface suffices, but the source index requires an index.html that can be scanned by pip install --find-links=.... Because Amazon S3 does not support server side directory listings this will have to be implemented in pip-accel (one way or another).

xolox added a commit that referenced this pull request Nov 9, 2014
While testing the results of merging pull request #33 I noticed that
when I enabled the Amazon S3 cache backend and ran the test suite using
Tox (which runs the test suite under CPython 2.6, CPython 2.7, CPython
3.4 and PyPy 2.7) either the CPython 2.7 or the PyPy 2.7 tests would
fail. It turns out that the two don't go well together (who would have
thought? ;-). This change set permanently fixes the problem by encoding
the Python implementation in the filenames used to store binary
distribution archives in the binary cache.

See also pull request #33 on GitHub:
  #33
@xolox
Copy link
Member

xolox commented Nov 9, 2014

Right now there's no automated test yet, but I did verify manually that the Amazon S3 backend works in cPython 2.6, cPython 2.7, cPython 3.4 and PyPy 2.7 (you can run make test or just tox to verify, but note that cPython 2.7 and PyPy 2.7 generate conflicting binary distribution archives, I hope to fix this tonight).

Fixed in d43e260.

@adamfeuer
Copy link
Contributor Author

@xolox Awesome! I will try it out. I will also try to write an automated test with the moto mock_s3 class.

@jzoldak
Copy link

jzoldak commented Nov 12, 2014

@xolox and @adamfeuer thanks this is great.
I get the point about caching source distribution archives as being non-trivial and don't see it as an urgent need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants