Skip to content
This repository has been archived by the owner on Apr 16, 2020. It is now read-only.

scholarpedia.org #32

Open
rht opened this issue Oct 19, 2015 · 11 comments
Open

scholarpedia.org #32

rht opened this issue Oct 19, 2015 · 11 comments
Labels

Comments

@rht
Copy link

rht commented Oct 19, 2015

LICENSE: CC BY-NC-SA 3.0 [1]
Like SEP but for science, e.g. http://www.scholarpedia.org/article/Faddeev-Popov_ghosts by Faddeev himself.
There is an outdated archive in https://archive.org/details/wiki-scholarpediaorg_w.

[1] http://www.scholarpedia.org/article/Scholarpedia:Terms_of_Use#Scholarpedia.27s_Licenses_to_You.2C_and_Your_license_to_parties_other_than_Scholarpedia

@davidar
Copy link
Collaborator

davidar commented Oct 20, 2015

SGTM! We can do this once #20 is resolved.

@vitzli
Copy link

vitzli commented Nov 5, 2015

There is a newer version now available at https://archive.org/details/wiki-scholarpediaorg-20151102

@davidar
Copy link
Collaborator

davidar commented Nov 6, 2015

📌 /ipfs/Qmaskk1Egq5zmZsGTd7dwNiiK1cwfmx7k1StG1WJQjwGDm

The articles are here.

@DataWraith Feel like converting these to HTML? :)

It's quite a bit smaller than wikipedia, so should hopefully be less problematic.

@DataWraith
Copy link

Heh. Eventually I'd like to write a program that converts a MediaWiki dump to HTML (probably by running it through pandoc), but right now I'm fairly busy, sorry.

I could only do the Wikipedia dump because a third party provided a dump in the OpenZIM format, and an easy-to-use library was available for reading and converting that.

With a raw XML dump, I'd have to roll my own solution, which would take more time than I currently have.

@rht
Copy link
Author

rht commented Nov 7, 2015

(@vitzli thanks for updating the archive in archive.org)

@davidar
Copy link
Collaborator

davidar commented Nov 7, 2015

@DataWraith No worries. I might have a go at getting it to render with https://github.com/davidar/markup.rocks

@vitzli didn't realise you where the one who pushed the updated copy - thanks :)

@DataWraith
Copy link

I took another look at this, and wanted to share what I found, in case it is useful to the next person.

Extracting the article markup from the XML dump is pretty easy, actually. But just having the article markup doesn't really gain you much. Simple articles can be rendered through pandoc, but more complicated elements (Images, Math, Templates) tend to break things.

I think our best bet is for someone to actually setup a MediaWiki instance and then use MWDumper to load the dump, and then export to HTML with mwoffliner. From what I can tell, this is the workflow that was used to create the HTML content for the ZIM files I used to dump Wikipedia.

The entire process is pretty convoluted though (Database, MediaWiki, Redis, Node...), so I'm currently not willing to tackle it.

If I were to do it, I'd probably try to setup everything in Docker containers with Docker Compose though, so that it is repeatable and applicable to other Wiki dumps.

Edit: Okay, so I couldn't resist fiddling around with this, despite my earlier words. Took much less time than I estimated too, because I could draw on pre-made docker images. The hard part (MWDumper) is yet to come, but I'm confident I'll have this figured out soonish, maybe even this weekend.

@DataWraith
Copy link

sigh

This is much harder than it looked in the beginning. I realize I'm flip-flopping on this a lot -- should've kept my mouth shut from the beginning. Anyway. This post is as much for venting as for information's sake, so feel free to ignore it.

I wanted the process of creating HTML dumps from XML dumps to be repeatable, so I set up everything in Docker containers. Turns out that the pre-made docker containers for the necessary software I could find are mostly outdated, so I had to make them from scratch after running into problems with version incompatibilities.

I managed to setup a local MediaWiki instance with a MySQL database and import the Scholarpedia dump using MWDumper in an automated and repeatable fashion, but getting MediaWiki to render mathematical equations took the better part of the weekend (TeX didn't work at all, no matter what I tried, so I had to switch to Mathoid, which meant getting yet another web service up and running...), and it's still not working to my satisfaction (occasionally returns HTTP 400 -- Bad Request). It doesn't help that the documentation on any of this is extremely sparse.

The entire process looks like this:

  1. Start MySQL and create the wiki database skeleton
  2. Run MWDumper to fill the database with the Scholarpedia articles
  3. Start the MediaWiki container
  4. Start the Mathoid container (for equation rendering)
  5. Start the Parsoid container (for HTML extraction)

Remaining work

  • Images need to be imported.

    There is a PHP script included with MediaWiki that should do that. But I'm not expecting it to be easy.

  • The Main_Page has custom CSS templates that MediaWiki isn't parsing out of the box, displaying them verbatim instead.

  • Actually creating static HTML files

    As I mentioned in the previous entry, mwoffliner should be able to use Parsoid to extract HTML via the MediaWiki API. However, it looks non-trivial to setup. It should be possible to create Docker containers for it, but that will take a while yet, so don't hold your breath. :/

@rht
Copy link
Author

rht commented Dec 1, 2015

(sounds more doable, as in, less headache than latex->html)
@DataWraith is the conversion using parsoid lossless?

@DataWraith
Copy link

Parsoid is intended to be able to convert from MediaWiki markup to HTML and back in a lossless fashion (they do 'round trip testing'). I haven't noticed any mistakes with the conversion, but from what I gather from the limited documentation available, the conversion process isn't 100% perfect yet.

The fact that they need to be able to make round trips also bloats the generated HTML somewhat. The files use absolute links too, so the additional step of using mwoffliner is necessary to produce an IPFS-suitable folder of files. I'll try to get that working next weekend (so that I have something to show even if the equations don't work quite right yet), but given my over-optimism so far, I don't want to promise anything.

@davidar
Copy link
Collaborator

davidar commented Dec 2, 2015

Hrm, it's unfortunate that MediaWiki is such a beast.

I've also converted it to a GitHub Wiki (example). It's somewhat passable, but definitely not perfect.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants