Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script/endpoint to aggregate coverage of sources across sources #9

Open
mekarpeles opened this issue Dec 9, 2015 · 24 comments
Open
Assignees

Comments

@mekarpeles
Copy link
Collaborator

BASE, openarchives, and others have a listing of their "sources". I plan to write a script which aggregates all of these into a single list.

@mekarpeles mekarpeles self-assigned this Dec 9, 2015
@Jurnie
Copy link

Jurnie commented Jan 30, 2016

Save yourself some work. Took me a week to gather and clean them all, back in August :)

@mekarpeles
Copy link
Collaborator Author

@Jurnie how many sources does this include? Is there a list we can enumerate?

BASE has around 85,000,000 documents and ~4,000 sources (we're trying to find the ones which are missing). You can browse the list here: http://www.base-search.net/about/en/about_sources_date_dn.php?menu=2

cc: @pietsch

@Jurnie
Copy link

Jurnie commented Jan 30, 2016

Everything that isn't a journal. Just look at the URL path.

@mekarpeles
Copy link
Collaborator Author

@Jurnie, sorry which url path? I am trying to see a list of which sources (institutions) JURN has covered, the total number of sources are included, and how many documents are available.

Is this the list of sources? http://www.jurn.org/jurn-listoftitles.pdf or this http://www.jurn.org/directory/?

Thanks for your help

@Jurnie
Copy link

Jurnie commented Jan 30, 2016

Ah, instructions required :) Ok, forget about JURN - I'm not giving you that. I'm giving you the GRAFT list. Go to the GRAFT link URL, the one that runs the search. MouseOver it. See that URL path, pointing to the HTML source that Google is using to power the on-the-fly CSE? Copy and load it. Right-click, 'View page source'.

@Jurnie
Copy link

Jurnie commented Jan 30, 2016

Here's a group test of GRAFT, running it against the other public repository search tools. Albeit on a very hard search, so not many results for any of them.

@mekarpeles
Copy link
Collaborator Author

@Jurnie
Copy link

Jurnie commented Jan 30, 2016

Nearly. If this were a simple little test, re: if I should join your project or not, you wouldn't be doing very well at this point :) http://www.jurn.org/graft/index4.html and right-click, View source. All known repository URLs A-Z (bar archive.org and a couple of other mega-positories that would clutter results), up-to-date and thoroughly cleaned. Enjoy.

@mekarpeles
Copy link
Collaborator Author

@Jurnie thanks, that worked. This is a great list, thanks for your efforts. It looks like there's just over 4,000 sources here. Would it be helpful for us to check this against BASE to see if there are any missing sources for you to add?

@cleegiles
Copy link
Collaborator

Of the 85 M, most seem to be not full text documents but metadata records.

Does anyone know how many full text documents are there?

On 1/29/16 7:11 PM, Michael E. Karpeles wrote:

@Jurnie how many sources does this include? Is there a list we can enumerate?

BASE has around 85,000,000 documents and ~4,000 sources (we're trying to find the ones which are missing). You can browse the list here: http://www.base-search.net/about/en/about_sources_date_dn.php?menu=2

cc: @pietsch


Reply to this email directly or view it on GitHub:
#9 (comment)

@wetneb
Copy link
Collaborator

wetneb commented Jan 30, 2016

@cleegiles Detecting which records correspond to full texts is a very interesting challenge (beyond the existing classification based solely on the metadata itself). Could the CiteSeerX crawler do that? Basically using the URLs stored in BASE as seed list, I guess?

@cleegiles
Copy link
Collaborator

It's probably not that hard. The crawler just looks for pdf files and
hopefully associated metadata
only on those sites and nowhere else. Some sites prohibit crawling,
however, with their
robots.txt.

On 1/30/16 12:30 AM, Antonin wrote:

@cleegiles Detecting which records correspond to full texts is a very interesting challenge (beyond the existing classification based solely on the metadata itself). Could the CiteSeerX crawler do that? Basically using the URLs stored in BASE as seed list, I guess?


Reply to this email directly or view it on GitHub:
#9 (comment)

@cleegiles
Copy link
Collaborator

If we can get a list of the urls, we and AI2's Semantic Scholar will
crawl for PDFs.

How do we go about getting it?

Best

Lee

On 1/30/16 12:30 AM, Antonin wrote:

@cleegiles Detecting which records correspond to full texts is a very interesting challenge (beyond the existing classification based solely on the metadata itself). Could the CiteSeerX crawler do that? Basically using the URLs stored in BASE as seed list, I guess?


Reply to this email directly or view it on GitHub:
#9 (comment)

@wetneb
Copy link
Collaborator

wetneb commented Jan 31, 2016

That would be awesome! @cleegiles, I recommend getting in touch officially with BASE via their contact form to request their data. @pietsch, what do you think? I am happy to help with generating the list, with your permission of course.

@cleegiles, I have no idea how your pipeline works, but ideally it would be good if you could keep track of the relation between BASE's metadata and each PDF you download. The reason is that in my experience, BASE's metadata is cleaner than what you can extract from the PDF using heuristics.

BASE already includes CiteSeerX metadata, so of course we need to filter out these records first.

@pietsch
Copy link
Collaborator

pietsch commented Jan 31, 2016

Hi @cleegiles, if all you need is a list of the URLs in BASE, there is no need to use the contact form. As you can see in ipfs-inactive/archives#3, BASE has already released a data dump containing all URLs via IPFS. Unfortunately, all IPFS copies of this dump were destroyed. BASE is preparing a fresh, larger dump right now. It will be available in about a week. You can either wait for it, or I can prepare a list containing URLs only.

@cleegiles
Copy link
Collaborator

Does each URL point to a unique document?
Does it point directly to a PDF?

How many URLs are there? 85M?

On 1/31/16 10:55 AM, Christian Pietsch wrote:

Hi @cleegiles, if all you need is a list of the URLs in BASE, there is no need to use the contact form. As you can see in ipfs-inactive/archives#3, BASE has already released a data dump containing all URLs via IPFS. Unfortunately, all IPFS copies of this dump were destroyed. BASE is preparing a fresh, larger dump right now. It will be available in about a week. You can either wait for it, or I can prepare a list containing URLs only.


Reply to this email directly or view it on GitHub:
#9 (comment)

@pietsch
Copy link
Collaborator

pietsch commented Jan 31, 2016

@cleegiles Each of these URLs belongs to one document. There may be duplicates if authors uploaded a document to several repositories. Many URLs point to an HTML landing page, others point to a PDF document, a few point to other document types. Based on a fresh dump, it should be 87M URLs. /cc @wetneb @davidar

@cleegiles
Copy link
Collaborator

Size will be about 100G or less?

I assume it will be compressed?

On 1/31/16 11:08 AM, Christian Pietsch wrote:

@cleegiles Each of these URLs belongs to one document. There may be duplicates if authors uploaded a document to several repositories. Many URLs point to an HTML landing page, others point to a PDF document, a few point to other document types. Based on a fresh dump, it should be 87M URLs. /cc @wetneb


Reply to this email directly or view it on GitHub:
#9 (comment)

@mekarpeles
Copy link
Collaborator Author

I think I recall it being <275GB?

@cleegiles
Copy link
Collaborator

Uncompressed?

If necessary, we can put it on our Amazon storage.

On 1/31/16 4:13 PM, Michael E. Karpeles wrote:

I think I recall it being <275GB?


Reply to this email directly or view it on GitHub:
#9 (comment)

@mekarpeles
Copy link
Collaborator Author

I believe it's compressed. @pietsch?

@pietsch
Copy link
Collaborator

pietsch commented Jan 31, 2016

@cleegiles @mekarpeles As you can see in ipfs-inactive/archives#3, the previous dump was 23 GB (gzipped XML, 79M records). So the new dump will still be smaller than 30 GB (compressed). Of course, if you just need a list of URLs, then the file will be tiny in comparison.

@mekarpeles
Copy link
Collaborator Author

@cleegiles I re-sent an invitation to you to join to the Archive Labs slack channel -- several of us chat on there regarding OpenJournal in the #scholar channel.

@cleegiles
Copy link
Collaborator

Do we just ftp or rsync to download?

On 1/31/16 5:12 PM, Christian Pietsch wrote:

@cleegiles @mekarpeles The previous dump was more like 26 GB (gzipped XML, 79M records). So the new dump will still be smaller than 30 GB (compressed). Of course, if you just need a list of URLs, then the file will be tiny in comparison.


Reply to this email directly or view it on GitHub:
#9 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants