A set of files used to train tesseract to read Georgian Mkhedruli script.
- kat.word.bigrams.clean: A list of the 40,000 most frequent word bigrams in the text of the Georgian-language Wikipedia, in descending order of frequency
- kat.wordlist.clean: A list of all unique words in the text of the Georgian-language Wikipedia, in descending order of frequency.
- kat.unicharambigs: A manually generated file of character ambiguities following tesseract's unicharambigs format
- kat.training_text: A text to use with tesstrain.sh or text2image to generation training images for tesseract.
- font_properties: A manually generated file containing the attributes of the fonts that I used to train tesseract, in tesseract's font_properties format
- count_stuff: Folder of Python scripts for generating bigrams and wordlists
- README.md: This README document
Steps I took to generate the training files
This file is based off the text available in the langdata repository, with the following manual modifications:
- As noted here, the text in the langdata repository contains characters from archaic Georgian scripts that are not used in modern Georgian. These characters have been removed.
- Several samples of the numero character (№), which is fairly frequent in modern Georgian texts, have been added.
- Examples of Roman numerals have been added; these are often used in Georgian texts for ordinal numbers.
These files were generated using a database dump from Wikipedia roughly as follows:
-
Download latest Georgian database dump from Wikipedia: https://dumps.wikimedia.org/backup-index.html
-
Run WikiExtractor.py to extract the Georgian text
-
Concatenate output into a single file with
find -type f <extraction_folder> | xargs cat > kawikitext.txt
-
Remove remaining tags with
sed -i '/^<doc/ d'
andsed -i '/^<\/doc/ d'
-
Run
python count_stuff/wordcounts.py --count-what [words|bigrams] --clean --no-counts kawikitext.txt > [kat.wordlist.clean|kat.word.bigrams.clean.full]
to extract words and/or bigrams from the Wikipedia text -
Run
head -n 40000 kat.word.bigrams.clean.full > kat.word.bigrams.clean
in order to limit the number of bigrams, which would otherwise be very large (~2 million)
I selected fonts that were freely licensed, and which included monospace, serif, and sans-serif fonts. In addition, there are several Georgian letters which can be written with different glyphs, so I made sure to include fonts which cover both glyphs (see here for details). A good selection of freely-licensed, Unicode Georgian fonts is available from BPG InfoTech. Other fonts are available in various places, but note that many commonly used Georgian fonts, such as AcadNusx and LitNusx, map Georgian glyphs onto Latin letters, making them unsuitable for automatically generating training images.
Tesseract was trained using tesstrain.sh without any modifications (except manual application of this patch).
The specific command executed to train tesseract was:
./tesstrain.sh \
--bin_dir /usr/local/bin/ \
--fonts_dir /usr/share/fonts/ \
--lang kat \
--langdata_dir /home/pi/tesseract/kat_train/staging/ \
--output_dir /home/pi/tesseract/kat_train/output/ \
--training_text /home/pi/tesseract/kat_train/staging/kat.training_text \
--wordlist /home/pi/tesseract/kat_train/staging/kat.wordlist.clean \
--tessdata_dir /usr/local/share/tessdata \
--fontlist "BPG Chkoni+BPG Chveulebrivi GPL&GNU+BPG Classic Medium,+BPG Courier GPL&GNU+BPG DedaEna+BPG Elite GPL&GNU+BPG Glaho GPL&GNU+BPG Glaho Traditional Arial+BPG Lia+BPG Rioni+Sylfaen"
(Yes, this was done on a Raspberry Pi.)
The count_stuff.py
script can theoretically also generate files containing punctuation and
numeral patterns, which tesstrain.sh can use to create DAWG files for punctuation and numbers.
However, I decided to forgo using these files in order to simplify the first pass at training, and
the results ended up being good enough that I haven't seen the need to add the punctuation and
number pattern files so far, so this feature of count_stuff.py may not work perfectly / at all.
Copyright 2015, Derek Dohler. I do not claim any copyright over kat.wordlist.clean or kat.word.bigrams.clean. I claim copyright over only the alterations which I made to kat.training_text, and not over the remainder of the file. Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0