-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding option to store bdist files in S3 as a second-level cache #33
Conversation
- in addition to local file system - uses PIP_S3_CACHE_BUCKET envrionment variable to indicate S3 use. - option PIP_S3_CACHE_PREFIX environment variable sets a prefix (folder) to store the bdist files in, allowing for multiple OS-specific bdist folders.
I'm also interested in pip-accel with the upload to S3. What would help to get this PR moved along? Fix the broken test? @xolox are there other concerns? |
Since there's interest, I'll update documentation and see if I can fix the test. I also want to get this running on one of my build boxes at work. |
@adamfeuer Awesome. I fixed the test in PR #1 on your fork. |
I also fixed the problem too, and pushed it - looks like the test passes https://travis-ci.org/paylogic/pip-accel/builds/39992802 :-) -adam On Tue, Nov 4, 2014 at 1:22 PM, Jesse Zoldak [email protected]
Adam Feuer [email protected] |
I also committed some basic documentation. |
@@ -22,6 +25,9 @@ | |||
pip_accel_cache = expand_user('~/.pip-accel') | |||
|
|||
# Enable overriding the default locations with environment variables. | |||
if 'PIP_S3_CACHE_BUCKET' in os.environ: | |||
s3_cache_bucket = expand_user(os.environ['PIP_S3_CACHE_BUCKET']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my read of the code, this does not need to be passed through the expand_user method, as that is used to modify references to a user's home directory for the local caches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I will take this out.
@adamfeuer and @jzoldak: To clear up any worries, I'm planning to merge this soon (hopefully this week) because I definitely see how this is a useful feature to have for a lot of (potential) users, even if I don't have a direct use for this myself (at the moment). Sorry by the way for being slow to respond and thanks for the messages to remind me, I've been swamped at work due to recent organizational changes that I won't get into here ;-). The most important thing for me is finding a way to make the Boto dependency optional while still keeping things KISS for the user (a bit more work for me in preparing releases would not be a problem, I always automate release management anyway :-). I've been thinking about 1) the most elegant way to implement this (I want my projects to be of high software engineering quality) vs. 2) a pragmatic way to implement this (I want to have the feature merged soon so as not to waste the efforts of @adamfeuer and the enthusiasm of @jzoldak). Some ideas based on those conflicting considerations: 1. Architecture astronaut mode: Maybe the most elegant way would be a plug-in mechanism that uses e.g. setuptools entry points (the best thing the Python world has for automatic discovery of plug-ins installed as separate packages, AFAIK). The problem is that it requires a bit of "plumbing" and I'm not sure how easy it is to generalize this pull request into a generic "storage backend" plug-in layer (I still have to dive into the code). 2. KISS mode: A pragmatic way would be to use setuptools' support for 'extras' to define the Boto dependency, but keep all of the code in the main pip-accel package. Then installing pip-accel as try:
import boto
enable_the_feature()
except ImportError:
disable_the_feature() 3. KISS^2 mode: Setuptools' extras support in pip seems to have some rough edges, so an even more pragmatic way would be to keep the I guess the latter two options provide the quickest (most pragmatic) way to get this merged without a hard dependency on Boto. I would still find the first option the most elegant one, but it requires a bit more work up front (to pay off in a hypothetical future ;-). Maybe option two is the sweet spot for merging this. |
I was wondering, what was your reasoning for putting the binary distribution cache in S3 but not the source distribution cache? (given what you've explained about using hosts with ephemeral local storage) (deep into the internals of pip-accel) The reason I ask is that pip-accel needs pip to generate a dependency set, and to do that pip requires unpacked source distributions. This implies that every This points to a really ugly and inefficient part of the communication between pip and pip-accel which is (AFAICT) very complex to solve because it would require pip-accel to re-implement large parts of the logic already available in pip but impossible to re-use effectively. (coming back up to a high level) Knowing what I've explained above, would you choose to also store the source distribution archives in S3? It may seem kind of silly because whether you're downloading from PyPI, an external distribution site or an S3 bucket, it's all HTTP, so who cares? Some reasons I can think of why you might care:
|
@xolox Regarding why I didn't put the sdists on S3 too - I was trying to keep things simple, and since I am depending on being able to reach PyPI anyway, I figured that was ok. Regarding how to deal with the Boto dependency - I like your #2 best. I considered making a separate package that would depend on pip-accel (pip-accel-s3) but thought that most people would have a hard time finding out about it- often the hardest thing about using software is to know what software to use. :-) Anyway, I'm willing to implement #2, update the docs, and push it to the branch. I'll take a shot at that tomorrow. |
FWIW because of the way that pip does unpacking/dependencies, and also because I'm primarily looking to optimize CI worker spinup, I would prefer an implementation that allows uploading the distributions on S3. I can refactor or add that feature after we get through this bit. |
@jzoldak Ok, that sounds good. I'd like to keep this PR as small as possible. |
- not needed since these variables will be referring to S3
- removing boot module from requirements - adding documentation to say that boto module is required to use S3 cache option
- so they can be used for sdists if needed - removed boot imports from top of files - inlined several helper methods so the work with the local import of boot
Ok, I implemented @xolox's suggestion #2 for how to deal with making boto an optional dependency - we now check if the S3 cache environment vars are configured - if so, then we check for boto. If either fails, we don't use the cache. I moved the cache code to Finally, I updated the docs to indicate that if you want to use the S3 cache feature, it's up to you to import boto yourself, otherwise the S3 cache will not be used. I ran some manual tests and it works. I can write some unit tests - using the moto mock_s3 class - let me know if you want me to do that. It would add a build-time dependency on moto, but would enable us to write fast tests that don't actually hit AWS S3. If you're ok with the manual testing for now - this is ok to merge. |
@adamfeuer I'm a strong advocate of tests, and would love to see unit tests added. Again, I can circle back and add them later if you don't have time or inclination right now. Perhaps we create a test_requirements.txt requirements file to separate out the runtime dependencies from the test dependencies. Leave the requirements.txt logic in setup.py alone and in the travis.yml file install: pip install -r test_requirements.txt |
@jzoldak I like tests too - and will add them, using a test_requirements.txt, and will update the travis.yml file as you suggest. The reason I was hesitant was that the tests will depend on moto and boto, but with the setup you suggest, only people running the tests will be impacted. |
logger.debug("S3_CACHE_BUCKET is set, attempting to read file from S3 cache.") | ||
try: | ||
import boto | ||
from boto.s3.key import Key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this import isn't needed
See also pull request #33 on GitHub: #33 As discussed in the pull request I definitely see the value of being able to keep the binary cache in Amazon S3. What I didn't like about the pull request was that it introduced a call to store_file_into_s3_cache() inside the module pip_accel.bdist. Conceptually that module has absolutely nothing to do with Amazon S3 so that had to change :-) This change set merges pull request #33 but also introduces a new pluggable cache backend registration mechanism that enables cache backends to be added without changing pip-accel. This mechanism uses setuptools' support for custom entry points to discover the relevant modules and a trivial metaclass to automagically track cache backend class definitions. The local binary cache backend and the Amazon S3 cache backend (introduced in the pull request) have been modified to use the pluggable registration mechanism. Maybe more backends will follow. We'll see :-)
Hi @adamfeuer and @jzoldak, I've merged this pull request and published the result as pip-accel 0.14. Note that I changed literally all of the code introduced in the pull request. That's not to say there was anything wrong with the code, I just had bigger plans (the previously described cache backend plug-in mechanism). If you are interested you can review the result in the merge commit 8ff50a9. I also renamed the relevant environment variables in order to consistently use the I hope this works well for you guys. Thanks to both of you, both for the pull request and the discussion. Feedback is welcome! Some relevant notes regarding previous discussions in this pull request:
|
While testing the results of merging pull request #33 I noticed that when I enabled the Amazon S3 cache backend and ran the test suite using Tox (which runs the test suite under CPython 2.6, CPython 2.7, CPython 3.4 and PyPy 2.7) either the CPython 2.7 or the PyPy 2.7 tests would fail. It turns out that the two don't go well together (who would have thought? ;-). This change set permanently fixes the problem by encoding the Python implementation in the filenames used to store binary distribution archives in the binary cache. See also pull request #33 on GitHub: #33
Fixed in d43e260. |
@xolox Awesome! I will try it out. I will also try to write an automated test with the moto mock_s3 class. |
@xolox and @adamfeuer thanks this is great. |
This is so you can check out the code easily. I haven't done documentation yet - if this looks ok, I'll do the docs.
This change really accelerates building on ephemeral Elastic Bamboo instances - since their local filesystems have an empty cache when they start up.
S3 use. If this is not set, the S3 cache will not be used
(folder) to store the bdist files in, allowing for multiple
OS-specific or platform-specific bdist folders. The user is responsible for manually managing this.