The original repo pulls Finnish data from the domain en.wiktionary.org
, which lacks some information from the Finnish counterpart fi.wiktionary.org
(hereinafter referred to as fi.wikt).
This fork will be an attempt at pulling data from fi.wikt instead, focusing on collecting exceptions in Finnish verb conjugation from NSK and KOTUS, e.g. the Finnish word sortaa
in fi.wikt versus in en.wikt.
- Write
fi.lua
file inlanguages/lua
- Write
src/wiktextextract/data/fi/config.json
copied from../en
- Write extractor scripts in
src/wiktextextract/extractor/fi
- Find enough time for all tasks.
- All modules at fi.wikt: https://fi.wiktionary.org/wiki/Toiminnot:Kaikki_sivut?from=&to=&namespace=828 via https://fi.wiktionary.org/wiki/Toiminnot:Tilastot via https://meta.wikimedia.org/wiki/Wiktionary#List_of_Wiktionaries
- at en.wikt: https://en.wiktionary.org/wiki/Category:Modules
- Download from kaikki.org
- at en.wikt: https://en.wiktionary.org/wiki/Category:Modules
- All templates at fi.wikt: https://fi.wiktionary.org/wiki/Luokka:Mallineet
- en.wikt template download also from kaikki.org
- All categories concerning Finnish at fi.wikt: https://fi.wiktionary.org/wiki/Toiminnot:Kaikki_sivut?from=Suom&to=&namespace=14
- https://kaiko.getalp.org/about-dbnary/ via Ylonen , T 2022 , Wiktextract: Wiktionary as Machine-Readable Structured Data . in N Calzolari , F Béchet & P Blache, et al. (eds) , Proceedings of the 13th Conference on Language Resources and Evaluation (LREC) . European Language Resources Association (ELRA) , Paris , pp. 1317-1325 , International Conference on Language Resources and Evaluation , Marseille , France , 20/06/2022 .
- Using Lua scripts with Scribunto
- Lua for beginners
- MediaWiki API
- parsing wikitext, stackoverflow
- Parsing fi.wikt dump: a bash script
- Parse wikitext->xml in Java