Experiments in OCR for historical texts written in Arabic script.
- GNU Make
- GNU gawk
- mmv utility
- xmllint, part of the libxml2-utils package.
- A working Python3 environment
- pip, updated to latest.
- GNU parallel, for running kraken operations in parallel, which may be somewhat faster than kraken batched operations on multicore machines with lower core counts and no GPU.
make deps
The system is configured via environment variables set in a local, non-versisoned file ./config
To point to a GPU, set for example
DEVICE=cuda:0
The default device is cpu
.
This parameter is passed to kraken's ocr
command. For a 4-core system,
NUM_THREADS=4
The default is `1'.
make binarize-all
This will binarize all the images in data/fas
, yielding image files ending in -bin.png
Optionally, use the parallelized version of this target:
make binarize-all-par
make segment-all
This will segment all the binaried images in data/fas
, yielding ALTO XML files ending in -seg.xml
Optionally, use the parallelized version of this target:
make segment-all-par
Because the parallelized version runs multiple processes, the overhead of the initial load of the neural model is multiplied by the number of cores avialable on the machine (the parallel
default). Experiment to determine whether parallelization is beneficial on your hardware. On a Macbook Pro (2019) the speedup is considerable.
make ocr-all
This target will run kraken's OCR over the segmented images, again yielding ALTO XML files, this time containing <CONTENT>
elements. The filenames of the output end in -rec.xml
.
Optionally, use the parallelized version of this target:
make ocr-all-par
Same caveats apply.
make extract-gold-all
make create-eval-dirs
make eval-all
These final steps will construct the evaluation datasets and run programs in ./bin
that yield a character accuracy report in report.txt
To run the entire sequence, including installation of dependencies:
make go
And wait.