Skip to content
This repository has been archived by the owner on Apr 16, 2020. It is now read-only.

Wikipedia #20

Open
davidar opened this issue Sep 16, 2015 · 57 comments
Open

Wikipedia #20

davidar opened this issue Sep 16, 2015 · 57 comments

Comments

@davidar
Copy link
Collaborator

davidar commented Sep 16, 2015

In terms of being able to view this on the web, I'm tempted to push Pandoc through a Haskell-to-JS compiler like Haste.

CC: @jbenet

@rht
Copy link

rht commented Sep 17, 2015

In this case, why does the xml -> html have to be done client-side?

In the archiver's machine

get-dump dump/  # using any of the tool in https://meta.wikimedia.org/wiki/Data_dumps/Download_tools, there is one with rsync
dump2html -r dump/
ipfs add -r dump/ # and ipns it

(although yes it'd be much convenient to just use pandoc as a universal markup viewer)

@davidar
Copy link
Collaborator Author

davidar commented Sep 17, 2015

That's also a possibility, but more time consuming and inflexible

On Thu, 17 Sep 2015 11:29 rht [email protected] wrote:

In this case, why does the xml -> html have to be done client-side?

In the archiver's machine

get-dump dump/ # using any of the tool in https://meta.wikimedia.org/wiki/Data_dumps/Download_tools, there is one with rsync
dump2html -r dump/
ipfs add -r dump/ # and ipns it

(although yes it'd be much convenient to just use pandoc as a universal
markup viewer)


Reply to this email directly or view it on GitHub
#20 (comment).

David A Roberts
https://davidar.io

@DataWraith
Copy link

I actually started on this a while ago, but then thought it would be silly for a single person to attempt this and stopped, but now that I see this issue, I think it might not have been such a bad idea:

I've been experimenting with using a 15GiB (compressed and without images) dump of the English Wikipedia and extracting HTML files using gozim and wget. This gave me a folder full of HTML pages that interlink nicely using relative links.

It took a couple of hours to extract every page reachable from 'Internet' within 2 hops, which amounted to about 1% of the articles in the dump, so it would take at least a week to create HTML pages for the entire dump. And since these HTML files are uncompressed, I'm not sure I have enough disk space available to do the complete dump, but I could repeat my initial trial and make it available in IPFS.

One problem I see with this approach, is that the Creative Commons License requires attribution, which is not embedded in the HTML files gozim creates. If it is decided that this way of doing it might not be such a bad idea, it might be possible to alter gozim to embed such license information. Or maybe we can simply put a LICENSE-file in the top-most directory.

@davidar
Copy link
Collaborator Author

davidar commented Sep 19, 2015

@DataWraith Just had a look at the gozim demo, looks really cool. In the short-term, this does seem like the best option (apologies for my terse reply earlier @rht :). Would it also be possible to also do client-side search with something like https://github.com/cebe/js-search ?

I'm not sure I have enough disk space available to do the complete dump

If you can give me a script, and an estimate of the storage requirements, I can run this on one of the storage nodes for you :)

One problem I see with this approach, is that the Creative Commons License requires attribution, which is not embedded in the HTML files gozim creates.

Are you sure? I can see:

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.

in the footer of http://scaleway.nobugware.com/zim/A/Wikipedia.html

Or maybe we can simply put a LICENSE-file in the top-most directory.

Definitely. See #25

@DataWraith
Copy link

@DataWraith Just had a look at the gozim demo, looks really cool. In the short-term, this does seem like the best option (apologies for my terse reply earlier @rht :). Would it also be possible to also do client-side search with something like https://github.com/cebe/js-search ?

I'm no JavaScript expert, but I don't see why not. We could pre-compile a search index and store it alongside the static files. However, resource usage on the client may or may not be prohibitively large.

I'm not sure I have enough disk space available to do the complete dump

If you can give me a script, and an estimate of the storage requirements, I can run this on one of the storage nodes for you :)

There is no real script. It's literally:

  1. gozimhttpd -path <wikipedia-dump> -port 8080 -mmap
  2. wget -e robots=off -m -k http://localhost:8080/zim/A/Internet.html

This will crawl everything reachable from 'Internet'. It may be possible to directly crawl the index of pages itself, but I haven't tried that yet.

You probably need to wrap gozimhttpd in a while loop, because it tends to crash once in a while. As for storage requirements: The 60.000 articles I extracted take up 5GiB of storage, so a full dump of the 5.000.000 articles in the dump is probably on the order of 500GiB.

One problem I see with this approach, is that the Creative Commons License requires attribution, which is not embedded in the HTML files gozim creates.

Are you sure? I can see:

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.

in the footer of http://scaleway.nobugware.com/zim/A/Wikipedia.html

Hm. Maybe that's because they are using a different dump, or a newer version of gozim (though the latter seems unlikely); the pages I extracted don't have that footer.

I'm currently running ipfs add on the pages I have extracted, to get a proof-of-concept going. It's inserting the pages alphabetically, but it tends to crash around the 'D's, with an unhelpful 'killed' message. Possibly ran out of memory.

@davidar
Copy link
Collaborator Author

davidar commented Sep 19, 2015

I'm no JavaScript expert, but I don't see why not. We could pre-compile a search index and store it alongside the static files.

For context, this is what @brewsterkahle uses for his IPFS-hosted blog

However, resource usage on the client may or may not be prohibitively large.

Yeah, that was my concern too. If so, it might have to wait until #8

There is no real script. It's literally:

gozimhttpd -path <wikipedia-dump> -port 8080 -mmap
wget -e robots=off -m -k http://localhost:8080/zim/A/Internet.html

Too easy

a full dump of the 5.000.000 articles in the dump is probably on the order of 500GiB.

Ok, we'll have to wait until we get some more storage then.

I'm currently running ipfs add on the pages I have extracted, to get a proof-of-concept going. It's inserting the pages alphabetically, but it tends to crash around the 'D's, with an unhelpful 'killed' message. Possibly ran out of memory.

Thanks. Ping me on http://chat.ipfs.io to help debug.

@DataWraith
Copy link

Short progress update: I'm now feeding files to ipfs add in batches of 25, that seems to have solved the memory issue for now. I hope that feeding in the files piecemeal will prevent the crash that occurs when adding the entire directory at once. I'll probably be able to try adding the entire thing again tomorrow.

I also took another look at gozim. It is relatively easy to extract the HTML-files without going through wget first -- should've thought of that before coming up with the wget-scheme. That way we won't miss any articles; I'll have to do more research on redirects though.

Quick & dirty dumping program here.

@DataWraith
Copy link

I had no luck getting ipfs add to ingest the HTML files; pre-adding the files in batches didn't do anything. ipfs (without the daemon running) consumed enough RAM to fill a 100GB swap file and then crashed with an error, runtime: out of memory. A script I wrote to add files one by one using the object patch subcommand was too slow, taking 3 to 5 seconds for a single page, so I abandoned that approach.

There are two related issues describing problems with ipfs add. I'll try again once those are resolved.

@davidar
Copy link
Collaborator Author

davidar commented Sep 23, 2015

@DataWraith Hmm, that's no good 😕. For the moment, could you tar/zip all the files together and add that?

CC: @whyrusleeping

@DataWraith
Copy link

Hi.

I've decided to delete the trial-files obtained using wget and go all out and try to actually dump the entire most-recent English Wikipedia snapshot (with images) with my program. It's currently in the 'D's (1.3 million articles done) and I estimate it will finish in another 60 to 70 hours. I'll try adding the dump using the undocumented ipfs tar add, which did not seem to blow up memory-wise in the small trial I did. Not sure why that would be different from the normal ipfs add, but apparently it is. If that still fails, I'll run the tar-archive through lrzip and upload that.

My initial estimate of space required was off, because the article sample I obtained using wget did not contain the small stub articles, of which there are many. The 1.3 million articles I have now add up to 40GiB, so, assuming that the distribution of article sizes is not skewed, we are looking at an overall size of about 160GiB plus maybe another 40GiB for the images. In addition, I'm using btrfs to store the dump, and its built-in compression support halves the actual amount of data stored, so size should not be a problem.

Edit: ipfs tar add is not much faster than the custom script I had cobbled together earlier. At 3 to 5 seconds per file, it'd take the better part of a year to add the entire dump. :/

@davidar
Copy link
Collaborator Author

davidar commented Sep 24, 2015

@DataWraith Awesome, can't wait to see it :)

Edit: ipfs tar add is not much faster than the custom script I had cobbled together earlier. At 3 to 5 seconds per file, it'd take the better part of a year to add the entire dump. :/

@whyrusleeping Please make ipfs add faster 🙏

@rht
Copy link

rht commented Sep 24, 2015

@whyrusleeping

For scale (foo/ is 11 MB, 10 files of 1.1 MB each):

  • cp: cp -r foo bar 0.00s user 0.01s system 86% cpu 0.008 total
  • master: ipfs add -q -r foo >actual 0.13s user 0.04s system 10% cpu 1.582 total
  • master (no sync on flatfs): ipfs add -q -r foo > actual 0.11s user 0.03s system 102% cpu 0.136 total (the remaining time bloat comes from leveldb)
  • git: git add foo 0.00s user 0.00s system 84% cpu 0.006 total
  • rsync: rsync -r foo bar 5.16s user 1.18s system 108% cpu 5.840 total
  • tar: tar cvf foo.tar foo 0.00s user 0.01s system 95% cpu 0.013 total
  • ipfs tar add: ipfs tar add foo.tar 0.25s user 0.05s system 35% cpu 0.857 total

It appears that cp doesn't have an explicit call to fsync in its implementation https://github.com/coreutils/coreutils/search?utf8=%E2%9C%93&q=fsync.
(I think it's fine to not have explicit sync call?)

@whyrusleeping
Copy link
Contributor

@davidar @rht okay, I'll make that top priority after UDT and ipns land.

@rht
Copy link

rht commented Sep 24, 2015

(git does explicit sync https://github.com/git/git/blob/master/pack-write.c#L277
edit: but only on pack updates)

@rht
Copy link

rht commented Sep 24, 2015

@davidar I get you point, which either mean 1. "if someone can put the kernel on the browser, why not pandoc", or 2. "we need to be able to do more than just viewing static simulated piece of paper" (more of what a "document"/"book" should be).
Though it is currently slow (e.g. pandoc pdf to html << (or maybe ~) pdf.js << browser plugin for pdf).

As with the client-side search, it works for small sites, but for huge sites (wikipedia?), transporting the index files to the client seems to be too much.

@rht
Copy link

rht commented Sep 24, 2015

I wonder if some of the critical operations should be offloaded to FPGA.

@davidar
Copy link
Collaborator Author

davidar commented Sep 25, 2015

  1. "if someone can put the kernel on the browser, why not pandoc", or 2. "we need to be able to do more than just viewing static simulated piece of paper" (more of what a "document"/"book" should be).

Uh oh, which side of this argument am I on now? #25 @jbenet

with the client-side search, it works for small sites, but for huge sites (wikipedia?), transporting the index files to the client seems to be too much.

The idea is that you'd encode the index as a trie and dump it into IPLD, so the client would only have to download small parts of the index to answer a query.

@rht
Copy link

rht commented Sep 25, 2015

The idea is that you'd encode the index as a trie and dump it into IPLD, so the client would only have to download small parts of the index to answer a query.

And this can be repurposed for any 'pre-computed' stuff, not just search indexes? e.g. (content sorted/filtered by paramX, or entire sql queries https://github.com/ipfs/ipfs/issues/82?)

@davidar
Copy link
Collaborator Author

davidar commented Sep 26, 2015

@rht yes, I would think so, I don't see any reason why it wouldn't be possible to build a SQL database format on top of IPLD (albeit non-trivial)

@davidar
Copy link
Collaborator Author

davidar commented Sep 26, 2015

@rht looks like someone already beat me to it: http://markup.rocks

@rht
Copy link

rht commented Sep 27, 2015

@davidar by a few months. Very useful to know that it is fast.
Currently imagining the possibilities.

Also, found this http://git.kernel.org/cgit/git/git.git/tree/Documentation/config.txt#n693:

This is a total waste of time and effort on a filesystem that orders data writes properly, but can be useful for filesystems that do not use journalling (traditional UNIX filesystems) or that only journal metadata and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback").

@whyrusleeping disable fsync by default and add a config flag to enable it? (wanted to close the gap with git, which is still 2 orders of magnitude away).

@davidar
Copy link
Collaborator Author

davidar commented Sep 27, 2015

Very useful to know that it is fast.

Yeah, Haskell is high-level enough that it tends to compile to JS reasonably well. The FP Complete IDE is also written in a subset of Haskell.

Currently imagining the possibilities.

Something like the ipfs markdown viewer but using pandoc would be cool.

@davidar
Copy link
Collaborator Author

davidar commented Sep 27, 2015

IPFS-hosted version of markup.rocks: https://ipfs.io/ipfs/QmSyfirfxBbgh8sZPzy4yyMQjHgzKX7iQeXG9Zet4VYk9P/

@rht
Copy link

rht commented Sep 27, 2015

@davidar saw it, neat. i.e. it's a pandoc but without the huge GHC stuff, cabal-install ritual, etc.
It's a pandoc.

Yeah, Haskell is high-level enough that it tends to compile to JS reasonably well.

But so does python, ruby, ... You mean sane type system?
https://github.com/faylang/fay/wiki says fay doesn't have GHC's STM, concurrency--which is fine.

This has nice things like:

Additionally, because all Fay code is Haskell code, certain modules can be shared between the ‘native’ Haskell and ‘web’ Haskell, most interestingly the types module of your project. This enables two things:
The enforced (by GHC) coherence of client-side and server-side data types. The transparent serializing and deserializing of data types between these two entities (e.g. over AJAX).

(haven't actually looked at a minimalist typed :lambda: calculus metacircular evaluator (the one people write (or chant) every day for the untyped ones))

@davidar
Copy link
Collaborator Author

davidar commented Sep 28, 2015

... You mean sane type system?

Yeah, I meant of the languages with a strong enough type system to be able to produce optimised code

@davidar
Copy link
Collaborator Author

davidar commented Sep 29, 2015

@davidar
Copy link
Collaborator Author

davidar commented Oct 4, 2015

@DataWraith Awesome, downloading now :)

@davidar
Copy link
Collaborator Author

davidar commented Oct 5, 2015

@DataWraith And now it's on IPFS 🎈

@whyrusleeping Looking forward to ipfs add being fast enough to handle the extracted version ;)

@DataWraith
Copy link

@davidar Awesome!

@whyrusleeping
Copy link
Contributor

@davidar its very high on my todo list.

@davidar
Copy link
Collaborator Author

davidar commented Oct 6, 2015

@whyrusleeping ❤️

@rht
Copy link

rht commented Nov 28, 2015

This can proceed with ipfs/kubo#1964 + ipfs/kubo#1973 merged (pending @jbenet's CR).
nosync is still not sufficient.

@davidar
Copy link
Collaborator Author

davidar commented Nov 28, 2015

@rht that's awesome :). Are you also testing perf on spinning disks (not just SSDs)? It seems to be the random access latency that really kills perf

Edit: also make sure the test files are created in a random order (not in lexicographical order)

@rht
Copy link

rht commented Nov 28, 2015

The first reduces the number of operations needed (including disk io), so will make add on HDD faster. For the second, channel iterators in golang has been reported to be slow (but I'm not sure of its direct impact on disk io), so should make add on HDD faster.

@jbenet
Copy link
Contributor

jbenet commented Dec 1, 2015

on it! (cr)

@DataWraith
Copy link

I'm trying out those pull requests on the Wikipedia dump right now. ipfs tar add still crashed with an out-of-memory error, but plain ipfs add -r -H -p . is chugging along nicely. It's been running for almost 12 hours now, so hopefully it's not going to crash.

It has added the articles starting with numbers, and is now working on the articles starting with A, so it'll be a while until the whole dump is processed.

@jbenet
Copy link
Contributor

jbenet commented Dec 2, 2015

@DataWraith thanks, good to hear -- btw, dev0.4.0 has many interesting perf upgrades, with flags like --no-sync which should make it much faster.

@dignifiedquire
Copy link

ipfs add is mich faster in 0.4 maybe we can revisit this and try to setup a script to constantly update the mirrored version in ipfs

@eminence
Copy link
Collaborator

Instead of working with the massive Wikipedia, I've been playing with the smaller, but still sizable Wikispecies project. It has 439,460 articles, and is about 4.5 GB on disk.

I've imported the static HTML dumps from the Kiwik openzim dump files. The dump to disk took less than 10 minutes, and the import into ipfs (with ipfs040 with Datastore.NoSync: true) took about 3 or 4 hours.

It's browsable on my local gateway, but I've not been able to get the site to load on the ipfs public gateways. Can any of you try?

http://localhost:8120/ipfs/QmbZp1H1mCbVSiD2K8xpFFhzRGoLJTU6E4keY9WQpyuxP1/A/index.htm

(edit Jan 14th -- after upgrading my nodes to master branch, I stopped running my dev040 node, so this hash is no longer available. Stay tuned for updates)

@davidar
Copy link
Collaborator Author

davidar commented Jan 12, 2016

I've not been able to get the site to load on the ipfs public gateways

Same :/

@eminence
Copy link
Collaborator

Ok, here is my next iteration on this project :

http://v04x.ipfs.io/ipfs/QmV6H1quZ4VwzaaoY1zDxmrZEtXMTN1WLJHpPWY627dYVJ/A/20/8f/Main_Page.html

This is also an IPFS-hosted version of Wikispecies, but with one major change:

Instead of having every article in one massive folder, each article has been partitioned into sub-folders based on the hash of the filename. For articles, there are two levels of hashing, and for images there is one level of hashing.

The goal of this is to reduce the number of links in the A/ and I/m nodes, since they appeared to be too large to load via the public IPFS gateways. I think in this regard, this has been successful.

However, there still seem to be some issues. As I browse around the Main_Page.html link (see above), sometimes the page will load quickly and instantly. Other times, images will be missing, the page will load slowly, or maybe even not at all. This is true even for pages that I've visited already (and thus should be in the gateway's cache)

I can't really tell what's going on here. Running ipfs refs on these hashes from another node of mine works pretty flawlessly. So I conclude the problem might not be with my node. But I'm not sure what other debugging tricks I can use to get to the bottom of this. I think this is a fairly important issue to resolve.

Finally, here are the two tools I wrote in the process of working on this:

zim dumping takes a few minutes, wiki_rewriting takes less than an hour, and ipfs add -r probably took a few hours. in all cases, i appear to disk-io bound

@whyrusleeping
Copy link
Contributor

@eminence this is great! It also further emphasizes the fact that we need to figure out directory sharding. I'll think on this today and see what I come up with.

Keep up the good work :)

@jbenet
Copy link
Contributor

jbenet commented Jan 19, 2016

@whyrusleeping note that directory sharding will go on top of IPLD, and that it should work for arbitrary objects (not just unxifs directories). Take a look at the spec. we can use another directive there.

@davidar
Copy link
Collaborator Author

davidar commented Jan 19, 2016

ipfs/notes#76

@davidar
Copy link
Collaborator Author

davidar commented Feb 4, 2016

@rht
Copy link

rht commented Feb 4, 2016

https://strategy.m.wikimedia.org/wiki/Proposal:Distributed_Wikipedia

(last updated ~3.5 years ago, but penned ~7 years ago)

@davidar
Copy link
Collaborator Author

davidar commented Feb 4, 2016

@rht yeah I know, but it might still be relevant

@yuvipanda
Copy link

@donothesitate
Copy link

donothesitate commented Jan 19, 2017

The question is if we want a HTML static only version, dynamic, or both.
As for static, the storage or filesystem where the data is stored can use compression.

In case of dynamic, with use of a Service Worker, zlib compression with dictionary, xml entries stored compressed, one could quickly fetch article, render as HTML, and link in pre-determined way. With the optional fallback in Service Worker to real wikipedia.

XML Wiki dump compressed with xz in 256k chunks, without dictionary equals the size of the bzip2 xml dump, and that is 13GB. Given English and pre-made zlib dictionary, I believe one can get to a nice number.

As for search a js variant for the terms only w/ suggestions of top terms could function well.

Edit: I'm being tempted by zim files. Having each cluster as a (raw) block.
Edit: Extracted a 1/1000 sparse sample of enwiki xml dump (105/13MB):
https://ipfs.io/ipfs/QmVYQwcq5jMnEjL1oXiFhED8Gp7S1um1wBHEjJrqWH3bzb/enwiki-20170101-pages-articles-1000th-sample.xml.7z

Edit: The only way to have good compression via widespread compression methods seems to be clustering. Compressing per record results in 4-5x the size. Which leads to storage compression.
The only other way could be a purpose crafted dictionary + huffman coder.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests