-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script/endpoint to aggregate coverage of sources across sources #9
Comments
Save yourself some work. Took me a week to gather and clean them all, back in August :) |
@Jurnie how many sources does this include? Is there a list we can enumerate? BASE has around 85,000,000 documents and ~4,000 sources (we're trying to find the ones which are missing). You can browse the list here: http://www.base-search.net/about/en/about_sources_date_dn.php?menu=2 cc: @pietsch |
Everything that isn't a journal. Just look at the URL path. |
@Jurnie, sorry which url path? I am trying to see a list of which sources (institutions) JURN has covered, the total number of sources are included, and how many documents are available. Is this the list of sources? http://www.jurn.org/jurn-listoftitles.pdf or this http://www.jurn.org/directory/? Thanks for your help |
Ah, instructions required :) Ok, forget about JURN - I'm not giving you that. I'm giving you the GRAFT list. Go to the GRAFT link URL, the one that runs the search. MouseOver it. See that URL path, pointing to the HTML source that Google is using to power the on-the-fly CSE? Copy and load it. Right-click, 'View page source'. |
Here's a group test of GRAFT, running it against the other public repository search tools. Albeit on a very hard search, so not many results for any of them. |
Is this the link you're talking about? |
Nearly. If this were a simple little test, re: if I should join your project or not, you wouldn't be doing very well at this point :) http://www.jurn.org/graft/index4.html and right-click, View source. All known repository URLs A-Z (bar archive.org and a couple of other mega-positories that would clutter results), up-to-date and thoroughly cleaned. Enjoy. |
@Jurnie thanks, that worked. This is a great list, thanks for your efforts. It looks like there's just over 4,000 sources here. Would it be helpful for us to check this against BASE to see if there are any missing sources for you to add? |
Of the 85 M, most seem to be not full text documents but metadata records. Does anyone know how many full text documents are there? On 1/29/16 7:11 PM, Michael E. Karpeles wrote:
|
@cleegiles Detecting which records correspond to full texts is a very interesting challenge (beyond the existing classification based solely on the metadata itself). Could the CiteSeerX crawler do that? Basically using the URLs stored in BASE as seed list, I guess? |
It's probably not that hard. The crawler just looks for pdf files and On 1/30/16 12:30 AM, Antonin wrote:
|
If we can get a list of the urls, we and AI2's Semantic Scholar will How do we go about getting it? Best Lee On 1/30/16 12:30 AM, Antonin wrote:
|
That would be awesome! @cleegiles, I recommend getting in touch officially with BASE via their contact form to request their data. @pietsch, what do you think? I am happy to help with generating the list, with your permission of course. @cleegiles, I have no idea how your pipeline works, but ideally it would be good if you could keep track of the relation between BASE's metadata and each PDF you download. The reason is that in my experience, BASE's metadata is cleaner than what you can extract from the PDF using heuristics. BASE already includes CiteSeerX metadata, so of course we need to filter out these records first. |
Hi @cleegiles, if all you need is a list of the URLs in BASE, there is no need to use the contact form. As you can see in ipfs-inactive/archives#3, BASE has already released a data dump containing all URLs via IPFS. Unfortunately, all IPFS copies of this dump were destroyed. BASE is preparing a fresh, larger dump right now. It will be available in about a week. You can either wait for it, or I can prepare a list containing URLs only. |
Does each URL point to a unique document? How many URLs are there? 85M? On 1/31/16 10:55 AM, Christian Pietsch wrote:
|
@cleegiles Each of these URLs belongs to one document. There may be duplicates if authors uploaded a document to several repositories. Many URLs point to an HTML landing page, others point to a PDF document, a few point to other document types. Based on a fresh dump, it should be 87M URLs. /cc @wetneb @davidar |
Size will be about 100G or less? I assume it will be compressed? On 1/31/16 11:08 AM, Christian Pietsch wrote:
|
I think I recall it being <275GB? |
Uncompressed? If necessary, we can put it on our Amazon storage. On 1/31/16 4:13 PM, Michael E. Karpeles wrote:
|
I believe it's compressed. @pietsch? |
@cleegiles @mekarpeles As you can see in ipfs-inactive/archives#3, the previous dump was 23 GB (gzipped XML, 79M records). So the new dump will still be smaller than 30 GB (compressed). Of course, if you just need a list of URLs, then the file will be tiny in comparison. |
@cleegiles I re-sent an invitation to you to join to the Archive Labs slack channel -- several of us chat on there regarding OpenJournal in the #scholar channel. |
Do we just ftp or rsync to download? On 1/31/16 5:12 PM, Christian Pietsch wrote:
|
BASE, openarchives, and others have a listing of their "sources". I plan to write a script which aggregates all of these into a single list.
The text was updated successfully, but these errors were encountered: