-
Notifications
You must be signed in to change notification settings - Fork 15
TemplateDrivenSegmentation
Create a template segmentation (including some padding) with a PAGE editor like Aletheia or LAREX.
Then apply on many similar pages by replacing @imageFilename
or AlternativeImage/@filename
to the input images, respectively.
The actual problem for realistic data: Input images will always be slightly off to some degree, comprising elements of
- translation due to horizontal/vertical offset of the scan or of the paper in the press/typewriter,
- rotation due to skew of the scan or of the paper in the press/typewriter,
- scaling due to photography instead of scanning, or photocopying of the original before scanning, or bad digital processing history,
– potentially all occurring together.
Thus, we need to align our template to the actual images.
We could either approach this as a machine learning problem in itself (using neural networks to align input images with our template), or as an optimization task (using classical computer vision algorithms to find an affine transform aligning the two).
- CNN backbone computes a feature map for images
- extra layers leading up to regression heads for shift, skew and scale operations
- train on artificially misaligned (i.e. augmented) off-domain images, or on corrected in-domain images, or both
Since the template is merely a structural description, and input images have concrete substance to be abstracted from, we first need to transform both into a robust shared representation. Let's assume binary textline masks can do that for us:
- convert the template PAGE into a binary mask depicting where its textlines are
- binarize the input image, do a coarse textline detection, convert to a binary mask (and apply some morphological closing or dilation)
- find an affine transform which makes the input image's mask most similar (if not equal) to the template's mask:
- via correlation method, namely:
- take a 2d FFT of both masks
- transform into log-polar coordinates
- find the phase with maximal cross-correlation in the spatial spectra (which yields the rotation angle from the φ-coordinate and the scale factor from the ρ-coordinate)
- compensate the input mask's rotation and scaling accordingly
- find the phase with maximal cross-correlation in the mask images (which yields the x-offset from the x-coordiante and the y-offset from the y-coordinate)
- compensate the input image's rotation, scaling and translation accordingly, then apply template annotation to it
- via projection profiles analogously
- via corner detection on the masks
- via ORB on the masks
- via correlation method, namely:
(Simplified example of the correlation method, without scaling and compensating for skew externally: Census project.)
Another aspect is matching against different templates to achieve page classification, including rejection. This could be easily achieved by all alignment methods mentioned above by way of their alignment score (correlation / difference / ...). Precomputing distances between all the templates would help avoid unnecessary computational cost – again, cf. here.