BASE #3

davidar · 2015-08-25T08:47:06Z

pietsch · 2015-09-10T12:05:37Z

The first trial delivery of a data dump has arrived in this directory: http://gateway.ipfs.io/ipfs/QmVUMQttFqwFKqu33AZL6gSkv89RFcPBSnT9kxrCDUNisz
I called it “First Million Records”. There are 76 more million metadata records to follow if the BASE team can be convinced that this is a promising approach.

The deal is that if this community succeeds in extracting all the full text links (mainly PDF, but also PostScript, DjVu and perhaps other file formats), the remaining records will be released. The trouble is that in the current metadata, you will often find pointers to an HTML landing page instead of the full text. The task is then to identify the correct full text, discarding links to unrelated (e.g. policy or license) files. I am willing to help in both teams.

License: CC-BY-NC 4.0

davidar · 2015-09-10T12:37:44Z

@pietsch Awesome, checking it out now :)

Sounds like a fair deal, will make a start on that when I get the chance.

I'm also around at #ipfs on freenode.net if you wanted to discuss further :)

CC: @jbenet

jbenet · 2015-09-10T17:42:50Z

@pietsch this sounds great! works for us :)

if anyone else wants to assist in this effort, it would be very useful for everyone involved.

davidar · 2015-09-21T11:23:55Z

@pietsch I've put together a rough scraper to pull pdf links out of landing pages referenced by the BASE metadata: https://morph.io/davidar/base-data

pietsch · 2015-09-21T11:53:56Z

That is a very promising starting point, @davidar! Do let me know where I can help.

davidar · 2015-09-21T12:39:43Z

@pietsch It would be great if you could do a quick sanity check of the results so far. I know there's some false positives, but we should be able to filter those out later. Also, if you have any ideas about better ways to identify full text links, that would be great --- I initially tried to use the meta tags, but unfortunately that only worked in a minority of cases, so I've just fallen back to extracting any links containing the string 'pdf'

davidar · 2015-09-23T13:12:19Z

@pietsch There's about 3.6k records there now. I'll need to contact @openaustralia (cc @mlandauer @henare) about raising the time limit so we can process the whole thing, once you're happy to go ahead :)

pietsch · 2015-09-23T13:36:47Z

Hi @davidar, let me warn you that there is bad weather in Bielefeld, Germany. Literally. Almost always. So don't be surprised about a dose of negativity:

a) Frankly, I do not quite see what you need morph.io for.

b) The PDF identification strategy will have to become smarter. For instance, the PDF download links in our institutional repository do not contain the string "pdf" at all. For some reason I forget. What should work is doing a HEAD request on all links and evaluating the response MIME type. If that turns out to be unreliable, file type sniffing as in file (1) is still an option. Then you know the file type. But you don't want all PDF files from all landing pages. Some repositories put rather unrelated stuff there apart from the actual document.

davidar · 2015-09-24T04:16:37Z

Hi @davidar, let me warn you that there is bad weather in Bielefeld, Germany. Literally. Almost always.

@pietsch Not to rub salt in the wound, but it's a lovely day here today ;)

a) Frankly, I do not quite see what you need morph.io for.

That's true, just an old habit from scraperwiki I guess. It probably would make more sense to run it on one of the storage nodes.

b) The PDF identification strategy will have to become smarter.

I was afraid you'd say that ;). I'm starting to think it would make sense to build an IPFS crawler that can pull all these pages and their links into IPFS first, which we can then use to experiment with different ways of identifying fulltext links (cf ipfs/infra#92).

What should work is doing a HEAD request on all links and evaluating the response MIME type. If that turns out to be unreliable, file type sniffing as in file (1) is still an option.

The problem with that is it won't work for archives that don't allow bots to download PDFs (and instead redirect either to an error page or back to the original landing page). But I agree it would be helpful to pick up files the simple heuristic misses.

Some repositories but rather unrelated stuff there apart from the actual document.

Yeah, I noticed that (wtf?). With any luck those types of links will follow a pattern, so we can filter them out afterwards.

jbenet · 2015-09-25T23:55:11Z

@davidar that's cool, you linked to a webcam. so your "today" claim now means "forever" :D -- beautiful webcam, btw. i miss how cool webcams were in 1995.

@davidar maybe more people can help clean things up if you leverage us? like maybe post here or in another issue what you're running up against, what sort of data, etc. lower barrier for us to take a look and make suggestions / scripts?

davidar · 2015-09-26T02:12:16Z

that's cool, you linked to a webcam. so your "today" claim now means "forever" :D -- beautiful webcam, btw.

Every "today" is a beautiful day ;)

i miss how cool webcams were in 1995.

@jbenet shows his age... :p

maybe more people can help clean things up if you leverage us? like maybe post here or in another issue what you're running up against, what sort of data, etc. lower barrier for us to take a look and make suggestions / scripts?

Yes, more help would be fantastic, it's been a number of years since I did much in the way of web scraping, so I'm a little rusty :) . I've extracted relevant URLs from the BASE metadata, along with a very basic script for finding PDF links, here: https://github.com/davidar/base-data

@jbenet I think what we need to do now is:

download all URLs in urls.txt, as well as every URL directly linked from them (and upload it to IPFS). I can do this myself, but it might take a while since I need to be careful with rate limiting, so anyone who wanted to donate an IP/bandwidth to this would make things go faster
experiment with techniques to accurately identify which of the linked URLs correspond to fulltext articles, using a combination of:
- analysing the downloaded files
- analysing the links themselves (necessary for archives that disallow robots from downloading fulltext)
identify false positives (e.g. PDFs that only contain metadata information but no fulltext), and filter out anything matching a similar pattern

CC: @ikreymer my new web archiving expert :)

davidar · 2015-10-17T11:56:42Z

@pietsch Sorry that this isn't moving very quickly, I've been caught up with TeX.js recently (which will soon be applied to the arxiv corpus). I hope this deal doesn't have a deadline attached to it? :)

I've started mirroring landing pages and their direct links, so we should have some concrete data to work with soon.

pietsch · 2015-10-17T12:14:25Z

@davidar No worries, you have not missed any deadlines here. I do not think we have any. Btw: TeX.js looks great!

jbenet · 2015-10-18T07:55:19Z

@davidar somehow i missed you made TeX.js. it's awesome, great work! :)

davidar · 2015-10-19T12:31:25Z

sigh Apparently I keep hitting ubuntu/wget#1002870, so either I need to wait until that gets fixed, or roll my own crawler...

pietsch · 2015-10-19T12:44:56Z

Amazed to find such a fat bug in Ubuntu Trusty's version of wget. Do not despair! Building wget from sources is not too much pain. (Or switch to sweet Debian Jessie.)

I did the compiling thing a few years ago because I needed a version with WARC support. WARC archives store HTTP headers and time-stamps in addition to the usual payload. You might want to use them for archiving in IPFS.

whyrusleeping · 2015-10-19T16:34:41Z

@davidar let me tell you about this great linux distro i use...

davidar · 2015-10-20T03:52:14Z

@pietsch @whyrusleeping If it were up to me, the storage nodes would be running plain Debian ;)

/me passes the buck to @lgierth

I guess compiling from source is an option ~~(is that before or after I install Gentoo? :p )~~, though I might try the package provided by Nix first

You might want to use them for archiving in IPFS.

@pietsch See #28

cryptix · 2015-10-28T14:04:36Z

@davidar have you tried httrack? If not, don't try the cli, it's horrible. Use the web interface.

harlantwood · 2015-10-28T14:16:43Z

Hm, @ikreymer is building WARC records with his tool in #28 and saving to IPFS... Maybe use that?

cryptix · 2015-10-28T14:46:02Z

oh wow, of course! There is way to much awesome stuff going on here... ;)

davidar · 2015-11-12T10:58:23Z

Hrm, the most recent version of wget (from Nix) segfaults too, so it seems like it might be a bug in wget itself :/

davidar · 2015-11-19T04:44:46Z

Running the crawl again, hopefully with better tolerance of crashes this time.

pietsch · 2015-11-19T15:06:33Z

The weekly BASE meeting will discuss a full OAI-DC data release tomorrow.

pietsch · 2015-11-23T13:47:28Z

The file size is 23 GB. Unless you stop me, I will add this TAR file in one piece while you sleep.

pietsch · 2015-11-23T15:59:56Z

This is what I did:

$ ipfs add -p -r for_ipfs/
added QmUJvY7e4mBSEBqfvRbz4YhTBf9z2kx4Adz6ke9UqMF9G1 for_ipfs/README.markdown
added QmcH3PYRNt5dKbC7YcYjq2MMt9G7xeoSiaBXmJwxVGgtt6 for_ipfs/oai_dc-dump-2015-07-06.tar
added QmctbbiEcEapEcY2hGZ4puchfu2chtziMcuZHQHVM7zuds for_ipfs

I am sorry the dump dates from July, so it contains less than 80M records – more like 77M.

jbenet · 2015-11-26T03:50:56Z

@pietsch this is fantastic. we'll help replicate it.

Our dump currently consists of gzipped XML files containing 1000 records each, wrapped in one huge TAR file. Is this convenient?

we have 2nd class tar support right now (will be even tighter integration later). sorry to make you add again, but if you add things with ipfs tar then it will add each tar header/data segment as an ipfs object, thus deduping everything nicely within the tar. See this example-- it is the result of adding this directory

https://ipfs.io/ipfs/QmRJpu5PWdRgJeqb9tE2YXm3TTQwc8P3J4kMzG5KMemXsM/viz#QmdJNUivwJRCZGYRqYRdtTfV2SgE4Hg2J2CcyWMr9iiWDY <-- explorable interactive graph
https://ipfs.io/ipfs/QmYKVQt2Aazd7iz9xCVgrwwtqCgUfD4qkf5CZDjpUJ8hxk/test.svg <-- same graph showing dedup via graphviz
https://ipfs.io/ipfs/QmYKVQt2Aazd7iz9xCVgrwwtqCgUfD4qkf5CZDjpUJ8hxk/test.tar <-- the data

afaik, ipfs tar doesn't have a progress bar. (should add one).

cc @whyrusleeping take a look at the dedup graphviz graph-- it seems to not be deduping as much as i thought it would?

davidar · 2015-11-26T09:15:15Z

@pietsch Awesome! I'm having trouble accessing that hash though :(

@jbenet How much dedup can we actually expect here though?

pietsch · 2015-11-26T09:25:38Z

@davidar I just deleted the two IPFS nodes I was running here in Bielefeled, making a fresh start.

pietsch · 2015-11-26T12:26:33Z

Ouch, ipfs tar add seems to require way more RAM than plain old ipfs add:

ipfs tar add oai_dc-dump-2015-07-06.tar
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0xf1b4c0, 0x16)
        /usr/local/go/src/runtime/panic.go:527 +0x90
runtime.sysMap(0xca21500000, 0x200000000, 0x441600, 0x13bfff8)
        /usr/local/go/src/runtime/mem_linux.go:143 +0x9b
runtime.mHeap_SysAlloc(0x139f960, 0x200000000, 0x0)
        /usr/local/go/src/runtime/malloc.go:423 +0x160
runtime.mHeap_Grow(0x139f960, 0x100000, 0x0)
        /usr/local/go/src/runtime/mheap.go:628 +0x63
runtime.mHeap_AllocSpanLocked(0x139f960, 0x100000, 0x7ff100000000)
        /usr/local/go/src/runtime/mheap.go:532 +0x5f1
runtime.mHeap_Alloc_m(0x139f960, 0x100000, 0xffffff0100000000, 0x7ff1d6ffcdd0)
        /usr/local/go/src/runtime/mheap.go:425 +0x1ac
runtime.mHeap_Alloc.func1()
        /usr/local/go/src/runtime/mheap.go:484 +0x41
runtime.systemstack(0x7ff1d6ffcde8)
        /usr/local/go/src/runtime/asm_amd64.s:278 +0xab
runtime.mHeap_Alloc(0x139f960, 0x100000, 0x10100000000, 0xc82012c300)
        /usr/local/go/src/runtime/mheap.go:485 +0x63
runtime.largeAlloc(0x1fffffe00, 0x1, 0x0)
        /usr/local/go/src/runtime/malloc.go:745 +0xb3
runtime.mallocgc.func3()
        /usr/local/go/src/runtime/malloc.go:634 +0x33
runtime.systemstack(0xc82001c000)
        /usr/local/go/src/runtime/asm_amd64.s:262 +0x79
runtime.mstart()
...

This happened with yesterday's ipfs 0.3.10-dev on a Xen VM with 8 GB RAM that had no problem adding this file when I did ipfs add. Any idea?

davidar · 2015-11-26T12:44:12Z

@pietsch yeah, the perf issues are a real pita

Paging Dr Sleeping, Dr @whyrusleeping

pietsch · 2015-11-26T20:34:27Z

I could not get ipfs tar add … working, so I did ipfs add oai_dc-dump-2015-07-06.tar, resulting in:
QmeRuyEnJkLaXZpRGNMBYat6YA3U6QCmiFvLGh5Z9nyDzj

davidar · 2015-11-27T01:04:41Z

@pietsch cool, pinning now :)

pietsch · 2015-11-27T09:27:16Z

Have you managed to get the data? When I first added the entire directory, ipfs cat on another computer on campus worked. Now I get this error message immediately:

$ ipfs cat QmeRuyEnJkLaXZpRGNMBYat6YA3U6QCmiFvLGh5Z9nyDzj > oai_dc-dump-2015-07-06.tar 
Error: Maximum storage limit exceeded. Maybe unpin some files?

davidar · 2015-12-01T08:02:58Z

@pietsch Yes, I've mirrored it to one of our storage nodes :)

wetneb · 2015-12-01T08:50:01Z

So, in OpenJournal/central#8 I mentioned using existing scrapers (zotero/translators) to retrieve metadata from HTML pages. These scrapers often return a link to the full text (without downloading it).

I'm not sure how this would work for URLs extracted from BASE. I suspect BASE covers a lot of small repositories that are not supported by Zotero. Many of them are probably instances of generic repository softwares such as DSpace, Eprints, Fedora or OJS, but Zotero selects scrapers based on regexes run on the URL, so they are not likely to trigger on many different domains. I will do a few experiments and report the results here.

davidar · 2015-12-01T11:10:23Z

@wetneb awesome, thanks :)

@pietsch Have crawled 162GB worth of HTML/PDFs so far

jbenet · 2015-12-01T17:32:40Z

@pietsch curious what failed in ipfs tar add ? (it appears to lack a progress bar, nor verbose progress output, so it may just hang a long time until a root hash maerializes. (yes this ux is painful, we'll fix it) cc @whyrusleeping

pietsch · 2015-12-01T17:43:06Z

@jbenet Have you noticed that I pasted the error message and the first third of the stack dump above?

jbenet · 2015-12-02T12:39:10Z

@pietsch ahhh thanks hadn't noticed.

wetneb · 2015-12-03T12:43:50Z

@davidar @pietsch Zotero looks actually quite suitable to extract PDF urls, even for small sites. For instance, it successfully scrapes instances of the Open Journal System, hosted on any domain.

davidar · 2015-12-03T13:00:13Z

@wetneb cool, that's good to know

Well, I've officially exhausted the remaining disk space on one of the storage nodes by crawling this stuff, so time to start analysing it I guess :)

pietsch · 2016-02-17T14:43:26Z

Finally, a fresh BASE dump in OAI-DC metadata format is available. It is a directory containing 88,104 xz-compressed XML files containing 1,000 records each: QmbdLBA51HsQ9PpcED1epXAxLfHgrd2PDZ3ktmjhFTjg94
In total, these files contain 88,103,128 OAI-DC records corresponding to publications indexed by BASE.
Total compressed size is almost 19 GB.

davidar · 2016-02-20T06:16:08Z

@pietsch Awesome! Mirroring now.

davidar · 2016-02-29T03:39:03Z

@pietsch Is your ipfs node still online? I've partially mirrored the dump, but it seems to be stuck partway through.

pietsch · 2016-02-29T08:40:41Z

@davidar The ipfs daemon is still running. Here are some messages it printed multiple times:

context deadline exceeded swarm_listen.go:129
EOF swarm_listen.go:129
read tcp4 129.70.xx.xx:4001->119.230.xx.xx:4001: read: connection reset by peer swarm_listen.go:129

davidar · 2016-02-29T09:12:14Z

@pietsch could you try restarting the daemon?

pietsch · 2016-02-29T09:14:00Z

@davidar Done.

davidar · 2016-02-29T10:54:41Z

@pietsch thanks, seems to be moving again :)

pietsch · 2016-03-07T15:36:11Z

@davidar Can you confirm that all files have arrived on your side? It should look like this:

ls -1 | wc -l
88105

davidar added the in progress label Aug 25, 2015

davidar self-assigned this Aug 25, 2015

davidar added the help wanted label Sep 15, 2015

davidar mentioned this issue Sep 15, 2015

Sprint Sep 7 ipfs/team-mgmt#31

Closed

41 tasks

davidar mentioned this issue Sep 29, 2015

Sprint Sep 21 ipfs/team-mgmt#33

Closed

65 tasks

davidar mentioned this issue Oct 20, 2015

Sprint Oct 12th ipfs/team-mgmt#39

Closed

62 tasks

davidar mentioned this issue Dec 1, 2015

Indexing OAI-PMH non-compliant repositories OpenJournal/central#8

Open

pietsch mentioned this issue Jan 31, 2016

Script/endpoint to aggregate coverage of sources across sources OpenJournal/central#9

Open

flyingzumwalt added backlog and removed in progress labels Jan 15, 2017

BASE #3

BASE #3

Comments

davidar commented Aug 25, 2015

pietsch commented Sep 10, 2015

davidar commented Sep 10, 2015

jbenet commented Sep 10, 2015

davidar commented Sep 21, 2015

pietsch commented Sep 21, 2015

davidar commented Sep 21, 2015

davidar commented Sep 23, 2015

pietsch commented Sep 23, 2015

davidar commented Sep 24, 2015

jbenet commented Sep 25, 2015

davidar commented Sep 26, 2015

davidar commented Oct 17, 2015

pietsch commented Oct 17, 2015

jbenet commented Oct 18, 2015

davidar commented Oct 19, 2015

pietsch commented Oct 19, 2015

whyrusleeping commented Oct 19, 2015

davidar commented Oct 20, 2015

cryptix commented Oct 28, 2015

harlantwood commented Oct 28, 2015

cryptix commented Oct 28, 2015

davidar commented Nov 12, 2015

davidar commented Nov 19, 2015

pietsch commented Nov 19, 2015

pietsch commented Nov 23, 2015

pietsch commented Nov 23, 2015

jbenet commented Nov 26, 2015

davidar commented Nov 26, 2015

pietsch commented Nov 26, 2015

pietsch commented Nov 26, 2015

davidar commented Nov 26, 2015

pietsch commented Nov 26, 2015

davidar commented Nov 27, 2015

pietsch commented Nov 27, 2015

davidar commented Dec 1, 2015

wetneb commented Dec 1, 2015

davidar commented Dec 1, 2015

jbenet commented Dec 1, 2015

pietsch commented Dec 1, 2015

jbenet commented Dec 2, 2015

wetneb commented Dec 3, 2015

davidar commented Dec 3, 2015

pietsch commented Feb 17, 2016

davidar commented Feb 20, 2016

davidar commented Feb 29, 2016

pietsch commented Feb 29, 2016

davidar commented Feb 29, 2016

pietsch commented Feb 29, 2016

davidar commented Feb 29, 2016

pietsch commented Mar 7, 2016