Skip to content

mikesaelim/arXivOAIHarvester

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

arXiv OAI Harvester

By: Mike Saelim, [email protected]

Current version: 0.1.2

This is an OAI Harvester library for the arXiv preprint repository, written in Java. At this point, it is still a work in progress.

Upon completion, this library will allow its users to query the arXiv OAI repository for items in the "arXivRaw" metadata format, giving them access to all the metadata of all the articles on the arXiv.

Informational links:

Covering my ass

I mostly designed this library to support a future larger project of mine, and for educational purposes to give me more practice coding and designing. No guarantees on current reliability and future support! If you prioritize these things, I might suggest using one of the more general Java OAI Harvesters out there, which (hopefully) have reached some kind of stability.

Try it out!

I've supplied a CommandLineInterface class with an executable main() method that executes one request to the arXiv OAI repository. You can run this either by building the jar and running it, or by typing

./gradlew run

in the repository directory. This should give you a feel for how it all behaves.

Publishing it locally

While the library is still in development, I haven't yet published its artifacts to a central Maven repository. Until I do, you can publish the jar locally with

./gradlew build publishToMavenLocal

Usage for development

I strongly recommend reading up on the above links before using this library, because this library will not insulate you from all the peculiarities of the OAI protocol.

The API is designed so that users construct a central harvester object, and then pass GetRecord and ListRecords requests to it. The harvester issues the appropriate HTTP calls to the arXiv OAI repository, parses the result, and outputs the response to the user. Responses contain all the data contained in the "arXivRaw" metadata format.

It will probably take at least a few seconds, and possibly minutes, to process each request. This is because a lot of data may be transmitted by the repository, and because the repository throttles requests when they are coming in too fast. Both of these reasons are out of the control of the harvester. The harvester does have some parameters you can set to control how long it is allowed to wait before retrying, and how many times it will retry. This also means that harvesting should be thought of as more of a bulk-retrieval-once-a-day-to-a-local-database thing, and not thought of as a retrieval-on-demand thing.

A single harvester should be used for all of the requests, and this harvester should only be used by a single thread.
Currently, the implementation of the harvester is blocking and not thread-safe. Furthermore, because the arXiv OAI repository throttles your requests based on your IP, multiple harvesters or threads will end up blocking each other anyway. In the future, I may rewrite the implementation to queue multiple requests and resolve them asynchronously.

Importing the library

At this point, while the library is still in development, you can just clone the repo, build a jar, and throw that jar into your project. Once this is ready for release, I will probably host the jar(s) somewhere.

Preparing the harvester

The simplest preparation is to pass a CloseableHttpClient into the constructor:

CloseableHttpClient httpClient = HttpClients.createDefault();  
ArxivOAIHarvester harvester = new ArxivOAIHarvester(httpClient);

This will construct a harvester with the default settings for three important flow control parameters:

  • the maximum number of retries: 3,
  • the minimum wait time between retries: 10 seconds, and
  • the maximum wait time between retries: 5 minutes.

These are needed because the repository can respond to requests with a 503 Retry-After response, which says "Yo, either I'm super busy right now or you've been requesting too frequently, chill for X seconds and try again." And if you don't wait, you'll probably get another 503 Retry-After and have to wait longer. The default harvester will respect the repository's management of the flow control by retrying only 3 times, restricting its requests to no faster than 10 seconds between requests, and timing out if the repository requests a wait longer than 5 minutes. But you can set these values yourself - for example, for 5 retries between 20 seconds and 30 minutes,

CloseableHttpClient httpClient = HttpClients.createDefault();
ArxivOAIHarvester harvester = new ArxivOAIHarvester(httpClient, 5, Duration.ofSeconds(20), Duration.ofMinutes(30));

It is also suggested that you supply "User-Agent" and "From" HTTP headers to identify to the repository who you are:

harvester.setUserAgentHeader("Dave's Super Curious Bot, v0.1");
harvester.setFromHeader("[email protected]");

Retrieving a single record from the repository

To retrieve a single record by its identifier, construct a GetRecordRequest and pass it into the harvester:

GetRecordRequest request = new GetRecordRequest("oai:arXiv.org:1302.2146");
GetRecordResponse response = harvester.harvest(request);
ArticleMetadata record = response.getRecord();

The identifier string will always be "oai:arXiv.org:" followed by the arXiv identifier you're probably more used to, so you can even leave the "oai:arXiv.org:" part off. See arXiv's pages about the OAI for more information.

The resulting GetRecordResponse will contain the record, if it exists, as an ArticleMetadata object. If no record by that identifier was found, then response.getRecord() will return null. If there are any issues sending the request, receiving the response, or parsing the response, the harvester will throw a runtime exception or error - see the javadoc for ArxivOAIHarvester for a full list.

Retrieving a range of records from the repository

To retrieve a range of records between two dates, and/or of a specific set, construct a ListRecordsRequest and pass it into the harvester. Because many records may be returned, the repository will page the results (usually the pages are in sets of 1000 records), and you'll need to issue a new resumption request for each page. But the API takes care of the resumption token stuff for you, if you follow the recommended pattern:

ListRecordsRequest request = new ListRecordsRequest(LocalDate.of(2015, 6, 29), null, "physics:hep-ph");
while (request != ListRecordsRequest.NONE) {
    ListRecordsResponse response = harvester.harvest(request);
    List<ArticleMetadata> records = response.getRecords();
    // do whatever
    request = response.resumption();
}

The request takes three optional parameters: two LocalDate objects specifying the range, and the set to restrict the search to. The date does not refer to the original submission date of the article, but rather the last time the OAI record for that article was updated. Thus, a request to retrieve all the records between two dates will not necessarily grab all the articles submitted between those dates. The list of sets that the arXiv repository supports can be found here. See arXiv's pages about the OAI for more information.

The resulting ListRecordsResponse objects will contain a list of records, if they exist, as ArticleMetadata objects. If no records were found in that range and set, then response.getRecords() will return an empty list. If there are any issues sending the request, receiving the response, or parsing the response, the harvester will throw a runtime exception or error - see the javadoc for ArxivOAIHarvester for a full list.

About

Java OAI Harvester for the arXiv

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages