Script/endpoint to aggregate coverage of sources across sources #9

mekarpeles · 2015-12-09T01:53:05Z

BASE, openarchives, and others have a listing of their "sources". I plan to write a script which aggregates all of these into a single list.

Jurnie · 2016-01-30T00:06:08Z

Save yourself some work. Took me a week to gather and clean them all, back in August :)

mekarpeles · 2016-01-30T00:10:58Z

@Jurnie how many sources does this include? Is there a list we can enumerate?

BASE has around 85,000,000 documents and ~4,000 sources (we're trying to find the ones which are missing). You can browse the list here: http://www.base-search.net/about/en/about_sources_date_dn.php?menu=2

cc: @pietsch

Jurnie · 2016-01-30T00:11:49Z

Everything that isn't a journal. Just look at the URL path.

mekarpeles · 2016-01-30T00:35:29Z

@Jurnie, sorry which url path? I am trying to see a list of which sources (institutions) JURN has covered, the total number of sources are included, and how many documents are available.

Is this the list of sources? http://www.jurn.org/jurn-listoftitles.pdf or this http://www.jurn.org/directory/?

Thanks for your help

Jurnie · 2016-01-30T00:42:41Z

Ah, instructions required :) Ok, forget about JURN - I'm not giving you that. I'm giving you the GRAFT list. Go to the GRAFT link URL, the one that runs the search. MouseOver it. See that URL path, pointing to the HTML source that Google is using to power the on-the-fly CSE? Copy and load it. Right-click, 'View page source'.

Jurnie · 2016-01-30T01:24:17Z

Here's a group test of GRAFT, running it against the other public repository search tools. Albeit on a very hard search, so not many results for any of them.

mekarpeles · 2016-01-30T02:04:38Z

Is this the link you're talking about?
https://cse.google.com/tools/makecse?url=http%3A%2F%2Fwww.jurn.org%2Fgraft%2Findex4.html

Jurnie · 2016-01-30T02:49:51Z

Nearly. If this were a simple little test, re: if I should join your project or not, you wouldn't be doing very well at this point :) http://www.jurn.org/graft/index4.html and right-click, View source. All known repository URLs A-Z (bar archive.org and a couple of other mega-positories that would clutter results), up-to-date and thoroughly cleaned. Enjoy.

mekarpeles · 2016-01-30T02:57:57Z

@Jurnie thanks, that worked. This is a great list, thanks for your efforts. It looks like there's just over 4,000 sources here. Would it be helpful for us to check this against BASE to see if there are any missing sources for you to add?

cleegiles · 2016-01-30T03:21:44Z

Of the 85 M, most seem to be not full text documents but metadata records.

Does anyone know how many full text documents are there?

On 1/29/16 7:11 PM, Michael E. Karpeles wrote:

@Jurnie how many sources does this include? Is there a list we can enumerate?

BASE has around 85,000,000 documents and ~4,000 sources (we're trying to find the ones which are missing). You can browse the list here: http://www.base-search.net/about/en/about_sources_date_dn.php?menu=2

cc: @pietsch

Reply to this email directly or view it on GitHub:
#9 (comment)

wetneb · 2016-01-30T05:30:54Z

@cleegiles Detecting which records correspond to full texts is a very interesting challenge (beyond the existing classification based solely on the metadata itself). Could the CiteSeerX crawler do that? Basically using the URLs stored in BASE as seed list, I guess?

cleegiles · 2016-01-30T17:47:24Z

It's probably not that hard. The crawler just looks for pdf files and
hopefully associated metadata
only on those sites and nowhere else. Some sites prohibit crawling,
however, with their
robots.txt.

On 1/30/16 12:30 AM, Antonin wrote:

@cleegiles Detecting which records correspond to full texts is a very interesting challenge (beyond the existing classification based solely on the metadata itself). Could the CiteSeerX crawler do that? Basically using the URLs stored in BASE as seed list, I guess?

Reply to this email directly or view it on GitHub:
#9 (comment)

cleegiles · 2016-01-31T15:25:46Z

If we can get a list of the urls, we and AI2's Semantic Scholar will
crawl for PDFs.

How do we go about getting it?

Best

Lee

On 1/30/16 12:30 AM, Antonin wrote:

@cleegiles Detecting which records correspond to full texts is a very interesting challenge (beyond the existing classification based solely on the metadata itself). Could the CiteSeerX crawler do that? Basically using the URLs stored in BASE as seed list, I guess?

Reply to this email directly or view it on GitHub:
#9 (comment)

wetneb · 2016-01-31T15:38:14Z

That would be awesome! @cleegiles, I recommend getting in touch officially with BASE via their contact form to request their data. @pietsch, what do you think? I am happy to help with generating the list, with your permission of course.

@cleegiles, I have no idea how your pipeline works, but ideally it would be good if you could keep track of the relation between BASE's metadata and each PDF you download. The reason is that in my experience, BASE's metadata is cleaner than what you can extract from the PDF using heuristics.

BASE already includes CiteSeerX metadata, so of course we need to filter out these records first.

pietsch · 2016-01-31T15:55:41Z

Hi @cleegiles, if all you need is a list of the URLs in BASE, there is no need to use the contact form. As you can see in ipfs-inactive/archives#3, BASE has already released a data dump containing all URLs via IPFS. Unfortunately, all IPFS copies of this dump were destroyed. BASE is preparing a fresh, larger dump right now. It will be available in about a week. You can either wait for it, or I can prepare a list containing URLs only.

cleegiles · 2016-01-31T16:02:52Z

Does each URL point to a unique document?
Does it point directly to a PDF?

How many URLs are there? 85M?

On 1/31/16 10:55 AM, Christian Pietsch wrote:

Hi @cleegiles, if all you need is a list of the URLs in BASE, there is no need to use the contact form. As you can see in ipfs-inactive/archives#3, BASE has already released a data dump containing all URLs via IPFS. Unfortunately, all IPFS copies of this dump were destroyed. BASE is preparing a fresh, larger dump right now. It will be available in about a week. You can either wait for it, or I can prepare a list containing URLs only.

Reply to this email directly or view it on GitHub:
#9 (comment)

pietsch · 2016-01-31T16:08:53Z

@cleegiles Each of these URLs belongs to one document. There may be duplicates if authors uploaded a document to several repositories. Many URLs point to an HTML landing page, others point to a PDF document, a few point to other document types. Based on a fresh dump, it should be 87M URLs. /cc @wetneb @davidar

cleegiles · 2016-01-31T21:08:35Z

Size will be about 100G or less?

I assume it will be compressed?

On 1/31/16 11:08 AM, Christian Pietsch wrote:

@cleegiles Each of these URLs belongs to one document. There may be duplicates if authors uploaded a document to several repositories. Many URLs point to an HTML landing page, others point to a PDF document, a few point to other document types. Based on a fresh dump, it should be 87M URLs. /cc @wetneb

Reply to this email directly or view it on GitHub:
#9 (comment)

mekarpeles · 2016-01-31T21:13:04Z

I think I recall it being <275GB?

cleegiles · 2016-01-31T21:19:45Z

Uncompressed?

If necessary, we can put it on our Amazon storage.

On 1/31/16 4:13 PM, Michael E. Karpeles wrote:

I think I recall it being <275GB?

Reply to this email directly or view it on GitHub:
#9 (comment)

mekarpeles · 2016-01-31T22:11:28Z

I believe it's compressed. @pietsch?

pietsch · 2016-01-31T22:12:51Z

@cleegiles @mekarpeles As you can see in ipfs-inactive/archives#3, the previous dump was 23 GB (gzipped XML, 79M records). So the new dump will still be smaller than 30 GB (compressed). Of course, if you just need a list of URLs, then the file will be tiny in comparison.

mekarpeles · 2016-01-31T22:16:59Z

@cleegiles I re-sent an invitation to you to join to the Archive Labs slack channel -- several of us chat on there regarding OpenJournal in the #scholar channel.

cleegiles · 2016-02-01T00:51:20Z

Do we just ftp or rsync to download?

On 1/31/16 5:12 PM, Christian Pietsch wrote:

@cleegiles @mekarpeles The previous dump was more like 26 GB (gzipped XML, 79M records). So the new dump will still be smaller than 30 GB (compressed). Of course, if you just need a list of URLs, then the file will be tiny in comparison.

Reply to this email directly or view it on GitHub:
#9 (comment)

mekarpeles self-assigned this Dec 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script/endpoint to aggregate coverage of sources across sources #9

Script/endpoint to aggregate coverage of sources across sources #9

mekarpeles commented Dec 9, 2015

Jurnie commented Jan 30, 2016

mekarpeles commented Jan 30, 2016

Jurnie commented Jan 30, 2016

mekarpeles commented Jan 30, 2016

Jurnie commented Jan 30, 2016

Jurnie commented Jan 30, 2016

mekarpeles commented Jan 30, 2016

Jurnie commented Jan 30, 2016

mekarpeles commented Jan 30, 2016

cleegiles commented Jan 30, 2016

wetneb commented Jan 30, 2016

cleegiles commented Jan 30, 2016

cleegiles commented Jan 31, 2016

wetneb commented Jan 31, 2016

pietsch commented Jan 31, 2016

cleegiles commented Jan 31, 2016

pietsch commented Jan 31, 2016

cleegiles commented Jan 31, 2016

mekarpeles commented Jan 31, 2016

cleegiles commented Jan 31, 2016

mekarpeles commented Jan 31, 2016

pietsch commented Jan 31, 2016

mekarpeles commented Jan 31, 2016

cleegiles commented Feb 1, 2016

Script/endpoint to aggregate coverage of sources across sources #9

Script/endpoint to aggregate coverage of sources across sources #9

Comments

mekarpeles commented Dec 9, 2015

Jurnie commented Jan 30, 2016

mekarpeles commented Jan 30, 2016

Jurnie commented Jan 30, 2016

mekarpeles commented Jan 30, 2016

Jurnie commented Jan 30, 2016

Jurnie commented Jan 30, 2016

mekarpeles commented Jan 30, 2016

Jurnie commented Jan 30, 2016

mekarpeles commented Jan 30, 2016

cleegiles commented Jan 30, 2016

wetneb commented Jan 30, 2016

cleegiles commented Jan 30, 2016

cleegiles commented Jan 31, 2016

wetneb commented Jan 31, 2016

pietsch commented Jan 31, 2016

cleegiles commented Jan 31, 2016

pietsch commented Jan 31, 2016

cleegiles commented Jan 31, 2016

mekarpeles commented Jan 31, 2016

cleegiles commented Jan 31, 2016

mekarpeles commented Jan 31, 2016

pietsch commented Jan 31, 2016

mekarpeles commented Jan 31, 2016

cleegiles commented Feb 1, 2016