TemplateDrivenSegmentation

Semi-fixed segmentation processor

Idea

Create a template segmentation (including some padding) with a PAGE editor like Aletheia or LAREX.

Then apply on many similar pages by replacing @imageFilename or AlternativeImage/@filename to the input images, respectively.

The actual problem for realistic data: Input images will always be slightly off to some degree, comprising elements of

translation due to horizontal/vertical offset of the scan or of the paper in the press/typewriter,
rotation due to skew of the scan or of the paper in the press/typewriter,
scaling due to photography instead of scanning, or photocopying of the original before scanning, or bad digital processing history,

– potentially all occurring together.

Thus, we need to align our template to the actual images.

Alignment

We could either approach this as a machine learning problem in itself (using neural networks to align input images with our template), or as an optimization task (using classical computer vision algorithms to find an affine transform aligning the two).

As a machine learning problem

CNN backbone computes a feature map for images
extra layers leading up to regression heads for shift, skew and scale operations
train on artificially misaligned (i.e. augmented) off-domain images, or on corrected in-domain images, or both

As a classical optimization task

Since the template is merely a structural description, and input images have concrete substance to be abstracted from, we first need to transform both into a robust shared representation. Let's assume binary textline masks can do that for us:

convert the template PAGE into a binary mask depicting where its textlines are
binarize the input image, do a coarse textline detection, convert to a binary mask (and apply some morphological closing or dilation)
find an affine transform which makes the input image's mask most similar (if not equal) to the template's mask:
- via correlation method, namely:
  1. take a 2d FFT of both masks
  2. transform into log-polar coordinates
  3. find the phase with maximal cross-correlation in the spatial spectra (which yields the rotation angle from the φ-coordinate and the scale factor from the ρ-coordinate)
  4. compensate the input mask's rotation and scaling accordingly
  5. find the phase with maximal cross-correlation in the mask images (which yields the x-offset from the x-coordiante and the y-offset from the y-coordinate)
  6. compensate the input image's rotation, scaling and translation accordingly, then apply template annotation to it
- via projection profiles analogously
- via corner detection on the masks
- via ORB on the masks

(Simplified example of the correlation method, without scaling and compensating for skew externally: Census project.)

Classification

Another aspect is matching against different templates to achieve page classification, including rejection. This could be easily achieved by all alignment methods mentioned above by way of their alignment score (correlation / difference / ...). Precomputing distances between all the templates would help avoid unnecessary computational cost – again, cf. here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly