Skip to content

Workflow Guide generic transformations

Konstantin Baierer edited this page Feb 9, 2022 · 3 revisions

Sometimes PAGE-XML annotations need to be processed specially to make a workflow's processors interoperate properly. For example, a text producing processor might forget to make TextEquiv consistent between hierarchy levels, or it might be necessary to remove specific region types. Also, repairing minor syntactic or semantic deficiencies is usually required for export or visualization, like removing empty ReadingOrder and dead @regionRefs, ensuring each TextEquiv has a Unicode, or fixing negative or floating-point coordinates. While it is always possible to do that ad-hoc via scripts, it might help formulate this as a proper workflow step via processor CLI.

Available processors

Processor Parameter Remarks Call
ocrd-page-transform -P xsl page-remove-regions.xsl -P xslt-params "-s type=ImageRegion" Many useful XSLTs come as preinstalled resources, but can be passed any XSL file. Specify mimetype if the output is not PAGE-XML anymore ocrd-page-transform

Notes

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials
Discussions
Expert section on OCR-D- workflows
Particular workflow steps
Recommended workflows
Workflow Guide
Videos
Section on Ground Truth
Clone this wiki locally