Skip to content

Koffair/colab_pipelines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

colab_pipelines

WARNING: You can run the notebooks on Colab, but some of the notebooks requires subscription. Have a look at the src folder and run most of the jobs on a local machine. More details below.

Model

The trained model is available on 🤗

Data

Running the scripts

Scripts in the src folder are almost identical to the scripts in notebooks. The main difference between the two version is that paths in the scripts are relative paths while the notebooks contains absolute paths on Google Drive.

The other minor difference is that bash commands like unzipping and concatenating files are not presented in the scripts. Running command line utilities like floret and KenLM are shown as special cells in the notebooks.

Build language models

Our merged corpus (nyest + OSCAR19) contains 4466526 lines.

wc -l data/interim/merged_corpus.txt

On a single laptop/PC it takes ages to train a language model. You can take a sample from the corpus using the following command

shuf -n 1000 data/interim/merged_corpus.txt > data/interim/sample1000.txt

Training a floret model

../../opt/floret/floret cbow -dim 300 -minn 3 -maxn 6 -mode floret -hashCount 4 -bucket 50000 -input data/interim/merged_corpus.txt -output models/lms/hufloret_

Training a KenLM language model

First things first, we have to clean up our corpus, so it will contain only characters of the Hungarian alphabet. Run src/data_tasks/preprocess_merged.py. We need a vocabulary file too, let's generate it by running **src/data_tasks/get_unigrams.py

Now, let's build a trigram model

../../opt/kenlm/build/bin/lmplz -o 4 < data/interim/merged_corpus_cleaned.txt > models/lms/hu_kenlm.arpa

You have to modify the resulting language model since it doesn't contain a few types required by HF. Run src/data_tasks/post_process_kenml.py Let's make a smaller, binary version of the LM.

../../opt/kenlm/build/bin/build_binary models/lms/hu_kenlm_corrected.arpa models/lms/hu_kenlm.binary

TODOs

WER is high, CER is low. What's the problem? Word segmentation is bad.

  • ? use model without kenlm ?
  • remove whitespace from output
  • use wordninja to slice words not found in the vocabulary aka slice up long words
  • use deep pavlov which splits up oo words if it is possible -> is it possible to train a Hungarian version of dp????
  • use floret to correct out of vocabulary words

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published