-
Notifications
You must be signed in to change notification settings - Fork 1
Maven Indexing
To obtain a representative test base for Jade, we decided to index the Maven Central repository.
The main repository is hosted by Sonatype. It is difficult to access programmatically.
Starting in 2015, Google began hosting a mirror of the repository on Google Cloud Storage. The root of this repository is located at https://storage.googleapis.com/maven-central/
.
Although the robots.txt
file says all user-agents are prohibited, we emailed Les Vogel through the [email protected]
address and he said we were fine to do whatever we wanted (within reason).
Accessing files on Google Cloud Storage is relatively straightforward. A (very simple) example is given here under the "Cloud Storage" heading, but a more thorough walkthrough will be given below.
(All example code given is in Python 3.7.0.)
To access files on Google Cloud Storage (GCS), it appears necessary to have an account with Google Cloud Platform (GCP). Accounts can be created for free. Once an account is created, follow these instructions (under the heading "Obtaining and providing service account credentials manually" in the "GCP CONSOLE" box) to obtain an authentication .json
file on your local machine. Almost nothing matters about the configuration except that the file correctly corresponds to your account.
Once you have your file (which I will refer to as auth.json
) on your local machine, install the GCS Python library via PIP:
$ pip install google-cloud-storage
(Note that if you use both Python 2 and 3, you may need to specify pip3
or else a full path to the appropriate pip
executable for your Python interpreter of choice.)
To interact with the repository, we need to obtain a Bucket:
from google.cloud import storage
MAVEN_BUCKET = 'maven-central'
AUTH_FILE = 'auth.json'
client = storage.Client.from_service_account_json(AUTH_FILE)
bucket = client.get_bucket(MAVEN_BUCKET)
"Bucket" is the GCP term for what might otherwise be called a "repository". It is essentially just a collection of files (called "blobs" in the GCP lexicon) and metadata about those files. The maven-central
bucket contains all of the files of the Maven repository and sufficient metadata to process those files (such as by recreating a local Maven clone, or determining which files are largest, etc.).
We now have access to the Bucket in the form of a Bucket
object in the interpreter. There are many methods in the Bucket
, but we only care about bucket.list_blobs()
, which will provide an iterator over all the blobs (object) in the bucket (repository). (This method accepts an optional parameter, max_results
, which denotes the maximum number of blobs to iterate through.)
A tab-separated index file (index.tsv
) can be generated:
i = 0
with open('index.tsv', 'w') as f:
for blob in bucket.list_blobs():
f.write(f"{i}\t{blob.name}\t{blob.size}\n")
The file will have three columns: the number of the blob in the index, the name of the blob (which is the full file name in the repository), and the size of that blob in bytes. For example, here are the first ten lines of the index.tsv
file I generated:
1 README.md 1238
2 index.html 3155
3 repos/central/data/./94a8262a403880.properties 301
4 repos/central/data/./9e9bbc30f020cf.properties 310
5 repos/central/data/./9e9bbc30f020cf.properties.md5 32
6 repos/central/data/./9e9bbc30f020cf.properties.sha1 40
7 repos/central/data/./archetype-catalog.xml 6552513
8 repos/central/data/./archetype-catalog.xml.md5 32
9 repos/central/data/./archetype-catalog.xml.sha1 40
10 repos/central/data/./fb69c44c24b38.properties 307
Note that building the complete index took just shy of 9 hours, and there does not appear to be a faster way to perform this operation.
To download a blob blob-name
to a file file-name
, simply do:
blob = bucket.get_blob('blob-name')
blob.download_to_filename('file-name')
Now we have an index index.tsv
that tells us all of the files in Maven as well as their sizes. As of this writing, a little processing provided the following statistics from the index:
- Index file is 8.3 GiB
- ~71M blobs (71,364,531)
- ~7.8M .jar files (7,752,139)
- ~270k artifacts (269,285)
- Total size of all files in repo: ~9TiB (10,103,426,642,816 bytes)
- Total size of just .jar files: ~4TiB (4,586,501,379,706 bytes)
Further processing the list of files will reveal that the predominant file types by extension are:
-
.md5
(18,072,497) -
.sha1
(18,050,130) -
.asc
(10,896,218) -
.jar
(7,752,139) -
.json
(6,307,140)
.md5
and .sha1
files contain only hashes used to verify the integrity of other files. That is, a file foo.bar
may have a foo.bar.md5
or foo.bar.sha1
(or both), in which case foo.bar.md5
and/or foo.bar.sha1
contain hashes of the file foo.bar
.
.asc
files contain GPG signatures for a similar purpose. So a foo.bar.asc
file contains the GPG signature of foo.bar
.
It may be worth noting that most .asc
files seem to also have corresponding .md5
and .sha1
files, such that it is common to see all of the following:
foo.bar
foo.bar.asc
foo.bar.asc.md5
foo.bar.asc.sha1
foo.bar.md5
foo.bar.sha1
We wanted to be sure of this assertion, though. We needed to verified that every foo.md5
, foo.sha1
, or foo.asc
corresponds to an existing foo
. To that end, we employed the use of some one-liners for the shell.
First, we produced a file containing just the filenames for every file in the index. We did this so we could later use the comm
utility (which does a fast byte-wise line-by-line comparison of two files). This was done by:
$ perl -ane 'print "$F[1]\n"' < index.tsv > filenames.txt
This puts the filenames in the file filenames.txt
.
Then we produced a file containing the names of files which we expect to exist based on the presence of their hash files (either .md5
or .sha1
):
$ perl -ane 'if ($F[1] =~ /\.(md5|sha1)/) {print "$`\n"}' < index.tsv > hash-basenames.txt
This would take the file names from the previous section and produce:
foo.bar
foo.bar.asc
foo.bar.asc
foo.bar.asc
foo.bar
foo.bar
We can see that there are some duplicates. To remove duplicates and also sort the output, we wrote small programs (TODO: link to those):
$ ./uniqsemisort < hash-basenames.txt > sorted-hash-basenames.txt
For example, the previous file names would be reduced to:
foo.bar
foo.bar.asc
Now we can compare the expected filenames (in sorted-hash-basenames.txt
) to the full list of existing filenames (filenames.txt
):
$ comm -1 -3 filenames.txt sorted-hash-basenames.txt > hash-comparison.txt
The resulting output file, hash-comparison.txt
, contains a list of files which we expected to exist (based on the presence of either a .md5
or .sha1
file) but which did not exist.
We came up with some 2,985 missing files.
There are 54 central-metadata.json
files. These are not listed when browsing Maven Central's folders through the browser, but they are accessible.
There are 495 maven-metadata.xml
files. All of these are nested inside of dot-folders (either .DAV
or .svn
). There is one extra #maven-metadata.xml
.
There are 154 *.gz
files. 150 of these are in the top-level .index
directory and appear to be concerned with the index itself. 4 exist in other places.
A similar process can be used for verifying the .asc
(GPG signature) files.
Assuming we already have filenames.txt
from previously:
$ perl -ane 'if ($F[1] =~ /\.asc/) {print "$`\n"}' < index.tsv > asc-basenames.txt
Then we uniqsemisort it:
$ ./uniqsemisort < asc-basenames.txt > sorted-asc-basenames.txt
And compare to the original filenames:
$ comm -1 -3 filenames.txt sorted-asc-basenames.txt > asc-comparison.txt
The resulting output file, asc-comparison.txt
, contains a list of files which we expected to exist (based on the presence of a .asc
file) but which did not exist.
We found 4,471 files in the sorted-asc-basenames.txt
.