-
Notifications
You must be signed in to change notification settings - Fork 24
BASE #3
Comments
The first trial delivery of a data dump has arrived in this directory: http://gateway.ipfs.io/ipfs/QmVUMQttFqwFKqu33AZL6gSkv89RFcPBSnT9kxrCDUNisz The deal is that if this community succeeds in extracting all the full text links (mainly PDF, but also PostScript, DjVu and perhaps other file formats), the remaining records will be released. The trouble is that in the current metadata, you will often find pointers to an HTML landing page instead of the full text. The task is then to identify the correct full text, discarding links to unrelated (e.g. policy or license) files. I am willing to help in both teams. License: CC-BY-NC 4.0 |
@pietsch this sounds great! works for us :) if anyone else wants to assist in this effort, it would be very useful for everyone involved. |
@pietsch I've put together a rough scraper to pull pdf links out of landing pages referenced by the BASE metadata: https://morph.io/davidar/base-data |
That is a very promising starting point, @davidar! Do let me know where I can help. |
@pietsch It would be great if you could do a quick sanity check of the results so far. I know there's some false positives, but we should be able to filter those out later. Also, if you have any ideas about better ways to identify full text links, that would be great --- I initially tried to use the |
@pietsch There's about 3.6k records there now. I'll need to contact @openaustralia (cc @mlandauer @henare) about raising the time limit so we can process the whole thing, once you're happy to go ahead :) |
Hi @davidar, let me warn you that there is bad weather in Bielefeld, Germany. Literally. Almost always. So don't be surprised about a dose of negativity: a) Frankly, I do not quite see what you need morph.io for. b) The PDF identification strategy will have to become smarter. For instance, the PDF download links in our institutional repository do not contain the string "pdf" at all. For some reason I forget. What should work is doing a HEAD request on all links and evaluating the response MIME type. If that turns out to be unreliable, file type sniffing as in |
@pietsch Not to rub salt in the wound, but it's a lovely day here today ;)
That's true, just an old habit from scraperwiki I guess. It probably would make more sense to run it on one of the storage nodes.
I was afraid you'd say that ;). I'm starting to think it would make sense to build an IPFS crawler that can pull all these pages and their links into IPFS first, which we can then use to experiment with different ways of identifying fulltext links (cf ipfs/infra#92).
The problem with that is it won't work for archives that don't allow bots to download PDFs (and instead redirect either to an error page or back to the original landing page). But I agree it would be helpful to pick up files the simple heuristic misses.
Yeah, I noticed that (wtf?). With any luck those types of links will follow a pattern, so we can filter them out afterwards. |
@davidar that's cool, you linked to a webcam. so your "today" claim now means "forever" :D -- beautiful webcam, btw. i miss how cool webcams were in 1995. @davidar maybe more people can help clean things up if you leverage us? like maybe post here or in another issue what you're running up against, what sort of data, etc. lower barrier for us to take a look and make suggestions / scripts? |
Every "today" is a beautiful day ;)
@jbenet shows his age... :p
Yes, more help would be fantastic, it's been a number of years since I did much in the way of web scraping, so I'm a little rusty :) . I've extracted relevant URLs from the BASE metadata, along with a very basic script for finding PDF links, here: https://github.com/davidar/base-data @jbenet I think what we need to do now is:
CC: @ikreymer my new web archiving expert :) |
@pietsch Sorry that this isn't moving very quickly, I've been caught up with TeX.js recently (which will soon be applied to the arxiv corpus). I hope this deal doesn't have a deadline attached to it? :) I've started mirroring landing pages and their direct links, so we should have some concrete data to work with soon. |
@davidar No worries, you have not missed any deadlines here. I do not think we have any. Btw: TeX.js looks great! |
@davidar somehow i missed you made TeX.js. it's awesome, great work! :) |
sigh Apparently I keep hitting ubuntu/wget#1002870, so either I need to wait until that gets fixed, or roll my own crawler... |
Amazed to find such a fat bug in Ubuntu Trusty's version of wget. Do not despair! Building wget from sources is not too much pain. (Or switch to sweet Debian Jessie.) I did the compiling thing a few years ago because I needed a version with WARC support. WARC archives store HTTP headers and time-stamps in addition to the usual payload. You might want to use them for archiving in IPFS. |
@pietsch @whyrusleeping If it were up to me, the storage nodes would be running plain Debian ;) /me passes the buck to @lgierth I guess compiling from source is an option
|
oh wow, of course! There is way to much awesome stuff going on here... ;) |
Hrm, the most recent version of |
Running the crawl again, hopefully with better tolerance of crashes this time. |
The weekly BASE meeting will discuss a full OAI-DC data release tomorrow. |
The file size is 23 GB. Unless you stop me, I will add this TAR file in one piece while you sleep. |
This is what I did: $ ipfs add -p -r for_ipfs/
added QmUJvY7e4mBSEBqfvRbz4YhTBf9z2kx4Adz6ke9UqMF9G1 for_ipfs/README.markdown
added QmcH3PYRNt5dKbC7YcYjq2MMt9G7xeoSiaBXmJwxVGgtt6 for_ipfs/oai_dc-dump-2015-07-06.tar
added QmctbbiEcEapEcY2hGZ4puchfu2chtziMcuZHQHVM7zuds for_ipfs I am sorry the dump dates from July, so it contains less than 80M records – more like 77M. |
@pietsch this is fantastic. we'll help replicate it.
we have 2nd class tar support right now (will be even tighter integration later). sorry to make you add again, but if you add things with
afaik, cc @whyrusleeping take a look at the dedup graphviz graph-- it seems to not be deduping as much as i thought it would? |
@davidar I just deleted the two IPFS nodes I was running here in Bielefeled, making a fresh start. |
Ouch,
This happened with yesterday's ipfs 0.3.10-dev on a Xen VM with 8 GB RAM that had no problem adding this file when I did |
@pietsch yeah, the perf issues are a real pita Paging Dr Sleeping, Dr @whyrusleeping |
I could not get |
@pietsch cool, pinning now :) |
Have you managed to get the data? When I first added the entire directory,
|
@pietsch Yes, I've mirrored it to one of our storage nodes :) |
So, in OpenJournal/central#8 I mentioned using existing scrapers (zotero/translators) to retrieve metadata from HTML pages. These scrapers often return a link to the full text (without downloading it). I'm not sure how this would work for URLs extracted from BASE. I suspect BASE covers a lot of small repositories that are not supported by Zotero. Many of them are probably instances of generic repository softwares such as DSpace, Eprints, Fedora or OJS, but Zotero selects scrapers based on regexes run on the URL, so they are not likely to trigger on many different domains. I will do a few experiments and report the results here. |
@pietsch curious what failed in |
@jbenet Have you noticed that I pasted the error message and the first third of the stack dump above? |
@pietsch ahhh thanks hadn't noticed. |
@wetneb cool, that's good to know Well, I've officially exhausted the remaining disk space on one of the storage nodes by crawling this stuff, so time to start analysing it I guess :) |
Finally, a fresh BASE dump in OAI-DC metadata format is available. It is a directory containing 88,104 xz-compressed XML files containing 1,000 records each: QmbdLBA51HsQ9PpcED1epXAxLfHgrd2PDZ3ktmjhFTjg94 |
@pietsch Awesome! Mirroring now. |
@pietsch Is your ipfs node still online? I've partially mirrored the dump, but it seems to be stuck partway through. |
@davidar The ipfs daemon is still running. Here are some messages it printed multiple times:
|
@pietsch could you try restarting the daemon? |
@davidar Done. |
@pietsch thanks, seems to be moving again :) |
@davidar Can you confirm that all files have arrived on your side? It should look like this:
|
https://base-search.net
The text was updated successfully, but these errors were encountered: