colab_pipelines

WARNING: You can run the notebooks on Colab, but some of the notebooks requires subscription. Have a look at the src folder and run most of the jobs on a local machine. More details below.

Model

The trained model is available on 🤗

Data

Team members can access the data via Google Drive. Click here
SST was trained on the latest Hungarian Common Voice corpus
For output correction we trained a floret model using the Hungarian sub-corpus of OSCAR 2019 and all the articles from nyest.hu

Running the scripts

Scripts in the src folder are almost identical to the scripts in notebooks. The main difference between the two version is that paths in the scripts are relative paths while the notebooks contains absolute paths on Google Drive.

The other minor difference is that bash commands like unzipping and concatenating files are not presented in the scripts. Running command line utilities like floret and KenLM are shown as special cells in the notebooks.

Build language models

Our merged corpus (nyest + OSCAR19) contains 4466526 lines.

wc -l data/interim/merged_corpus.txt

On a single laptop/PC it takes ages to train a language model. You can take a sample from the corpus using the following command

shuf -n 1000 data/interim/merged_corpus.txt > data/interim/sample1000.txt

Training a floret model

../../opt/floret/floret cbow -dim 300 -minn 3 -maxn 6 -mode floret -hashCount 4 -bucket 50000 -input data/interim/merged_corpus.txt -output models/lms/hufloret_

Training a KenLM language model

First things first, we have to clean up our corpus, so it will contain only characters of the Hungarian alphabet. Run src/data_tasks/preprocess_merged.py. We need a vocabulary file too, let's generate it by running **src/data_tasks/get_unigrams.py

Now, let's build a trigram model

../../opt/kenlm/build/bin/lmplz -o 4 < data/interim/merged_corpus_cleaned.txt > models/lms/hu_kenlm.arpa

You have to modify the resulting language model since it doesn't contain a few types required by HF. Run src/data_tasks/post_process_kenml.py Let's make a smaller, binary version of the LM.

../../opt/kenlm/build/bin/build_binary models/lms/hu_kenlm_corrected.arpa models/lms/hu_kenlm.binary

TODOs

WER is high, CER is low. What's the problem? Word segmentation is bad.

? use model without kenlm ?
remove whitespace from output
use wordninja to slice words not found in the vocabulary aka slice up long words
use deep pavlov which splits up oo words if it is possible -> is it possible to train a Hungarian version of dp????
use floret to correct out of vocabulary words

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
media		media
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

colab_pipelines

Model

Data

Running the scripts

Build language models

Training a floret model

Training a KenLM language model

TODOs

About

Releases

Packages

Languages

License

Koffair/colab_pipelines

Folders and files

Latest commit

History

Repository files navigation

colab_pipelines

Model

Data

Running the scripts

Build language models

Training a floret model

Training a KenLM language model

TODOs

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages