WARNING: You can run the notebooks on Colab, but some of the notebooks requires subscription. Have a look at the src folder and run most of the jobs on a local machine. More details below.
The trained model is available on 🤗
- Team members can access the data via Google Drive. Click here
- SST was trained on the latest Hungarian Common Voice corpus
- For output correction we trained a floret model using the Hungarian sub-corpus of OSCAR 2019 and all the articles from nyest.hu
Scripts in the src
folder are almost identical to the scripts in notebooks
.
The main difference between the two version is that paths in the scripts are relative paths
while the notebooks contains absolute paths on Google Drive.
The other minor difference is that bash commands like unzipping and concatenating files are not
presented in the scripts. Running command line utilities like floret
and KenLM
are
shown as special cells in the notebooks.
Our merged corpus (nyest + OSCAR19) contains 4466526 lines.
wc -l data/interim/merged_corpus.txt
On a single laptop/PC it takes ages to train a language model. You can take a sample from the corpus using the following command
shuf -n 1000 data/interim/merged_corpus.txt > data/interim/sample1000.txt
../../opt/floret/floret cbow -dim 300 -minn 3 -maxn 6 -mode floret -hashCount 4 -bucket 50000 -input data/interim/merged_corpus.txt -output models/lms/hufloret_
First things first, we have to clean up our corpus, so it will
contain only characters of the Hungarian alphabet.
Run src/data_tasks/preprocess_merged.py
. We need a vocabulary
file too, let's generate it by running **src/data_tasks/get_unigrams.py
Now, let's build a trigram model
../../opt/kenlm/build/bin/lmplz -o 4 < data/interim/merged_corpus_cleaned.txt > models/lms/hu_kenlm.arpa
You have to modify the resulting language model since it doesn't contain a few types required by HF.
Run src/data_tasks/post_process_kenml.py
Let's make a smaller, binary version of the LM.
../../opt/kenlm/build/bin/build_binary models/lms/hu_kenlm_corrected.arpa models/lms/hu_kenlm.binary
WER is high, CER is low. What's the problem? Word segmentation is bad.
- ? use model without kenlm ?
- remove whitespace from output
- use wordninja to slice words not found in the vocabulary aka slice up long words
- use deep pavlov which splits up oo words if it is possible -> is it possible to train a Hungarian version of dp????
- use floret to correct out of vocabulary words