ocr-arabic-script

Experiments in OCR for historical texts written in Arabic script.

Prerequisites

GNU Make
GNU gawk
mmv utility
xmllint, part of the libxml2-utils package.
A working Python3 environment
pip, updated to latest.
GNU parallel, for running kraken operations in parallel, which may be somewhat faster than kraken batched operations on multicore machines with lower core counts and no GPU.

Installation

make deps

Configuration

The system is configured via environment variables set in a local, non-versisoned file ./config

PyTorch device

To point to a GPU, set for example

DEVICE=cuda:0

The default device is cpu.

Number of threads for OCR step

This parameter is passed to kraken's ocr command. For a 4-core system,

NUM_THREADS=4

The default is `1'.

Test Runs

Binarization

make binarize-all

This will binarize all the images in data/fas, yielding image files ending in -bin.png

Optionally, use the parallelized version of this target:

make binarize-all-par

Segmentation

make segment-all

This will segment all the binaried images in data/fas, yielding ALTO XML files ending in -seg.xml

Optionally, use the parallelized version of this target:

make segment-all-par

Because the parallelized version runs multiple processes, the overhead of the initial load of the neural model is multiplied by the number of cores avialable on the machine (the parallel default). Experiment to determine whether parallelization is beneficial on your hardware. On a Macbook Pro (2019) the speedup is considerable.

Recognition

make ocr-all

This target will run kraken's OCR over the segmented images, again yielding ALTO XML files, this time containing <CONTENT> elements. The filenames of the output end in -rec.xml.

Optionally, use the parallelized version of this target:

make ocr-all-par

Same caveats apply.

Evaluation

make extract-gold-all
make create-eval-dirs
make eval-all

These final steps will construct the evaluation datasets and run programs in ./bin that yield a character accuracy report in report.txt

Everything

To run the entire sequence, including installation of dependencies:

make go

And wait.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
R		R
data		data
google-ocr		google-ocr
models		models
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
.kateproject		.kateproject
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ocr-arabic-script.Rproj		ocr-arabic-script.Rproj
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ocr-arabic-script

Prerequisites

Installation

Configuration

PyTorch device

Number of threads for OCR step

Test Runs

Binarization

Segmentation

Recognition

Evaluation

Everything

About

Releases

Packages

Contributors 3

Languages

License

free-variation/ocr-arabic-script

Folders and files

Latest commit

History

Repository files navigation

ocr-arabic-script

Prerequisites

Installation

Configuration

PyTorch device

Number of threads for OCR step

Test Runs

Binarization

Segmentation

Recognition

Evaluation

Everything

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages