Central Kurdish Neural Spell Corrector

Note: The documentation for this project is currently being written. I am working hard to make this project easily hackable so people can add new heuristics and train more models.

This repository contains a collection of neural spell correctors for the Central Kurdish language.These models have been trained on an extensive corpus of synthetically generated data. They are able to correct a wide range of spelling errors, including typos and grammatical errors.

Using various heuristics, we generate a rich dataset by mapping sequences containing misspellings to the correct sequence. We do this by randomly inserting valid characters, deleting characters or patterns, substituting characters with random ones or their keyboard neighbors, swapping two adjacent characters, shuffling sentences, and replacing specific predefined patterns with targeted alternatives.

Experiments

The error injection framework in prepare_data offers a method to inject errors according to a distortion ratio. I conducted the following experiments to determine the optimal ratio that allows the model to achieve the lowest Word Error Rate (WER) and Character Error Rate (CER) on the synthetic test set.

Model Name	Dataset Distortion	CER	WER
bart-base	5%	5.39%	34.73%
bart-base	10%	2.15%	11.19%
bart-base	Mixed (5% + 10%)	1.54%	8.31%
bart-base	15%	2.17%	12.3%

Evaluation on ASOSOFT Spelling Benchmark

The benchmark for this project is exclusively designed for single-word spelling corrections. The script create_asosoft_benchmark.py processes each word from the Amani dataset by searching for sentences with the correct spelling, checking if the sentence has not been included in train.csv and replaces it with the provided misspelling. This is hacky way to get a gold-standard benchmark. The current best-performing model achieves the following results:

Metric	Value
CER	9.6545
WER	21.7558
Bleu	68.1724

Evluation on Sorani Script Normalization Benchmark

The final generated dataset is also concatenated with the training dataset from Script Normalization for Unvonventional Writing project. Therefore, the model not only correct spelling but also normalize unconventional writings. "Unconventional Writing" means using the writing system of one language to write in another language.

They also employ a similiar approach to generate their data. But it's not wise to evaluate your model on the synthetic test set since the model can memorize the underlying patterns from the training set. Hence they provide a gold-standard benchmark for Central Kurdish and they use Bleu & chrF to measure the performance of their model.

Model	Bleu	chrF
Script Normalization	12.7	69.6
Bart-kurd-spell-base	13.8	73.9

Keep in mind of both these models have seen the same data for script normalization but our model is performing slighly better due to the additional data for spell correction.

Train a New Model

Since the problem is framed as mapping a sequence containing misspellings to a correct sequence, we can train different econder-decoder models such as T5.

Run train_tokenizer.py to build tokenizer for your chosen model with --tokenizer_name argument.
Create data.txt and put it in data dir. Check inspect_data.ipynb.
Check the arguments of pepare_data/process_data.py and run it to get train.csv and test.csv
Change the arguments in train.sh if your want to train a different model other than Bart. In case you want to train T5, you need to add --source_prefix "correct: ".
Evaluate the model on both data/asosoft_benchmark.csv and data/Sorani-Arabic.csv using eval.sh

Observations

Different heuristics could be added to the pipeline, for example, replacing ر at the start of every word with ڕ or replacing ك with ک. These aforementioned examples occur quite often in Central Kurdish texts online. But both of these problems can be solved using rule-based instead of being learned from the data. It is more practical to address such problems using rule-based solutions such as KLPT.

But in case you can think of more heuristics, they can be easily added to the pipeline in the get_text_distorter function.

PRs with additional models, evaluation, or data generation heuristics are welcome! 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Central Kurdish Neural Spell Corrector

Experiments

Evaluation on ASOSOFT Spelling Benchmark

Evluation on Sorani Script Normalization Benchmark

Train a New Model

Observations

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
prepare_data		prepare_data
.gitignore		.gitignore
README.md		README.md
app.py		app.py
ckb_helpers.py		ckb_helpers.py
create_asosoft_benchmark.py		create_asosoft_benchmark.py
eval.sh		eval.sh
inspect_data.ipynb		inspect_data.ipynb
requirements.txt		requirements.txt
run_summarization.py		run_summarization.py
train.sh		train.sh
train_tokenizer.py		train_tokenizer.py

RazhanHameed/kurd-spell

Folders and files

Latest commit

History

Repository files navigation

Central Kurdish Neural Spell Corrector

Experiments

Evaluation on ASOSOFT Spelling Benchmark

Evluation on Sorani Script Normalization Benchmark

Train a New Model

Observations

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages