-
Notifications
You must be signed in to change notification settings - Fork 1
Load PyPI dependency info into a database #2
Comments
Ultimately what we want is to pass in a Hmm ... https://graphcommons.com/ |
Jackpot! |
Well, jackpot: https://github.com/anvaka/pm#individual-visualizations. Indexers for 13 ecosystems! |
As of https://mail.python.org/pipermail/distutils-sig/2015-January/025683.html we don't have to worry about mutability in PyPI. That means we never need to update info once we have it. It's possible to delete packages but I think we don't want to do that. We want to keep old info around. |
And we only need one-week granularity. If we update once a day we'll be well inside our loop. |
Where the mutability comes in is that dependencies are subject to a range. If I depend on |
What's the data structure we want? |
We want to support taking in a list of files of type
Take care to handle different files with the same name in the upload. |
Two more hops:
|
Briefly spoke on the phone with @whit537 and summarized the basic guidance for PyPI data analytics that require traversing dependency trees. Note that right now metadata available via PyPI is limited and may be for some time as PEP 426 is indefinitely deferred. Information is available in the deferment section of the PEP describing other PEPs addressing these topics. In order to crawl PyPI for dependency links, you'll need general metadata for "indexing" as well as the package files themselves to obtain dependency information via Recommended approach:
All of the above endpoints and tools with the exception of the XMLRPC are designed to minimize impact on the PyPI backend infrastructure, as they are easily cached in our CDN. |
A table for A PyPI gives us a changelog that includes
We'll need to keep a table of for each package in might_depend_on(new_release):
for release in package.releases:
release.update_deps(new_release) |
[edit—what he said :] |
So for ETL we can look at bandersnatch ... before the call I had been thinking we'd roll our own (I'd already started based on #2 (comment)) that would look like this:
|
Another point @ewdurbin made on the phone is that some projects vendor in their dependencies (e.g., that's how Requests uses urllib3), and an approach that looks only at |
I've downloaded and run a bit of bandersnatch. I am finding tarballs. I think we should be able to get what we need from that, without having to resort to the JSON API (bandersnatch does use fetch JSON under the hood, but afaict it throws it away) . The name, version, and license are in the One issue with bandersnatch is that it doesn't download tarballs that aren't on PyPI. Another is that we don't actually need to keep the tarballs around after we process them. Doing so would cost about $50/mo at Digital Ocean. Will we be able to easily convince bandersnatch not to redownload things we've already downloaded and then deleted? |
If we can delete old tarballs without tripping up bandersnatch, then we should be able to run a bandersnatch process, and a second process to consume tarballs: ETL them and then throw them away. This second process can run cronishly, offset from bandersnatch, and simply walk the tree looking for tarballs. |
I've moved http://gdr.rocks/ over to NYC1 and am attaching a 500 GB volume. |
|
|
That puts us at about eight hours to finish. |
Okay! Let's do some local testing wrt snatching tarballs out from under bandersnatch. Also: ETL. |
From reading through |
Here's what it looks like when I
|
The weird thing is that on the first run through, it processes packages in alphabetical order by name, not in numeric order by serial. It only writes |
New status is 2410197. |
Re-ran, updated three packages, status is 2410203. |
Okay! |
So it looks like there will be order of magnitude 1,000s of packages to update if we run bandersnatch once a day. Call it 10% of the total, so we could probably manage with 50 GB of disk for $5/mo. |
Looking at file types (h/t):
|
What if the only tarball we have for a release is an MSI? |
Or |
I guess let's focus on |
I guess it's a time/space trade-off. Fresh mirrors are within an hour. If we update every 30 minutes then it should be 100s or maybe even 10s of packages, and we can probably manage with 5 GB or maybe even 1 GB. |
If it's under 1 GB then we can keep it on the droplet and not use a separate volume, though if we're going to run a database at Digital Ocean to back gdr.rocks then we should maybe store it on a volume for good decoupling. |
Managed Postgres at DO starts at $19/mo. |
Alright, let's keep this lightweight. One $5/mo droplet, local Postgres. The only reason we are mirroring PyPI is to extract dependency info, which we can't get from metadata. We don't need to store all metadata, because PyPI itself gives us a JSON API, which we can even hit from the client side if we want to (I checked: we have |
Okay! I think I've figured out incremental updates. Bandersnatch needs a Now, it will over-download, but if we process frequently enough, we should be okay. It looks like if we process every 30 minutes then we'll have well less than 100 packages to update. Packages generally have well less than 100 release files, though when Requests or Django pushes a new release we'll have a lot of old ones to download. I guess we want to tune the cron to run frequently enough to keep the modal batch size small, while still giving us enough time to complete processing for the occasional larger batch. Logging ftw. |
On the other hand, the JSON is heavy (100s of kB), and the |
How about we grab |
Since we're going to be importing untrusted |
But we'd have it in the tarchomper process instead of in the web app. |
Extension finder died mid-write. :]
|
Okay! So! |
|
Nomenclature update:
|
http://setuptools.readthedocs.io/en/latest/setuptools.html#new-and-changed-setup-keywords Okay, let's not worry about On the other hand we should include |
Blorg. Tests are failing after installing bandersnatch, because it |
PR in #5. |
In light of the shift of focus at gratipay/gratipay.com#4135 (comment), I've removed the droplet, volume, and floating IP from Digital Ocean to avoid incurring additional cost. |
Resolving dependencies by running
pip install -r requirements.txt
and then sniffing the results is a reeeaalllyy inefficient way to go about it. Better is going to be loading each package index into a database. Let's do this once, starting with PyPI.First step is to basically download PyPI.
The text was updated successfully, but these errors were encountered: