fix/discuss recommended workflows #172

bertsky · 2020-10-06T23:26:06Z

I am surprised to see the following in our current recommendations:

Ocropy nlbin instead of one of the Olena algorithms
slow skimage binarize/denoise processors instead of Olena/Ocropy
only Tesseract deskewing (no Ocropy)
region clipping after Ocropy page segmentation (not necessary)
Ocropy line segmentation after Ocropy page segmentation (redundant)
line clipping after Ocropy line segmentation (not necessary)
Tesseract line segmentation without resegmentation or line clipping (to remove bbox overlaps)
Tesseract vs Calamari recognition (should be exchangeable regardless of workflow up to that point)

EDIT (thanks @jbarth-ubhd for reminding me): also

region clipping after region deskewing (impossible!)

Do these choices have some empirical grounding (measuring quality and/or performance on GT)?

The text was updated successfully, but these errors were encountered:

jbarth-ubhd · 2020-10-07T11:18:50Z

if olena binarization (other methods tested long before), I would recommend wolf:

("Wiener" snippet is from a larger image. Within 1 method left-to-right: +noise. Lines of same contrast at various levels (always dark on bright in this experiment) )

PS: Default settings applied.

PS2: One could argue "preprocess low-contrast first", but how to do this without knowing noise levels to skip etc... I think this is the primary task for binarization

jbarth-ubhd · 2020-10-07T11:26:02Z

But in Step 5 cis-ocropy-deskew is mentioned ( but not "recommended" in Step 9 (?) )

jbarth-ubhd · 2020-10-07T11:28:35Z

Hmm.. we are talking about https://ocr-d.de/en/workflows ?

The "Recommendations" at the end of that page?

bertsky · 2020-10-07T15:29:41Z

Hmm.. we are talking about https://ocr-d.de/en/workflows ?

The "Recommendations" at the end of that page?

Yes!

But in Step 5 cis-ocropy-deskew is mentioned ( but not "recommended" in Step 9 (?) )

I don't mind that it is not recommended in step 9, because IMHO on the region level orientation is more important than skew, and can differ between region and page, whereas skew is usually uniform across a page (otherwise you usually need dewarping anyway).

The above list really only concerns the overall workflow configuration recommendations.

if olena binarization (other methods tested long before), I would recommend wolf

Thanks again for that test suite and for paying attention in general on that front, which is not appreciated enough IMO. We've briefly discussed this in the chat already, and I would like to elaborate on some open points:

The example image is interesting and helpful to see what's going on, but it is also misleading, because it is highly artificial: Algorithms with local thresholding are quite sensitive to highly localized contrast/brightness changes, esp. if they are discontinuous. But realistically, except for the special case of inverted text, these would be spread more widely and continuously across the page. Even more so for noise, which usually appears equally across the page, which is why raw denoising typically measures noise levels globally.
Binarization does not (have to) stand alone. If we know contrast/brightness is far from normal, and noise is perceptible, then we would run normalization and raw denoising before anyway. Of course some algorithms are more robust against either of those than others. But if we want a fair competition, we should eliminate them (because we can).
Most algorithms have 2 degrees of freedom: the window size (influencing locality; dependent on pixel density) and threshold level (influencing stroke weight). One should allow optimising for them, or at least representing different choices for them. For example, ocrd-skimage-binarize and ocrd-olena-binarize (since v1.1.11) do already set the window size automatically based on a DPI rule of thumb by default. (But this requires having correct DPI annotated.) Niblack is one of those algorithms which is extremely sensitive to the correct window size.
As in point 2, but after binarization: Some algorithms produce noise which can be easily removed with a simple binary denoiser. So for a fair comparison all methods should enjoy that benefit.

So I think we need a different artificial test bed. And then we also need a set of true images of various appearances and qualities.

PS2: One could argue "preprocess low-contrast first", but how to do this without knowing noise levels to skip etc... I think this is the primary task for binarization

I disagree. IMO there should be specialisation and modularisation. So binarization processors can concentrate on its core problem, and others can try to solve related ones. If a processor chooses to do 2 or 3 steps in one, fine (we've seen this elsewhere), but we should always have the option to freely combine. And that in turn means we must do it for a fair evaluation, too.

jbarth-ubhd · 2020-10-08T12:30:17Z

1. The example image is interesting and helpful to see what's going on, but it is also misleading, 
because it is highly artificial: Algorithms with local thresholding are quite sensitive to highly 
localized contrast/brightness changes, esp. if they are discontinuous.

The word "Wiener" is just a snippet from a 1238 × 388 px image, but I now realize that I forgot to set the DPI.

2. Binarization does not (have to) stand alone. If we know contrast/brightness is far 
from normal, and noise is perceptible, then we would run normalization and raw 
denoising before anyway. Of course some algorithms are more robust against either 
of those than others. But if we want a fair competition, we should eliminate them (because we can

We have images with weak contrast, partly with black surrounding background. Then we must do preprocessing after the clipping step, too, I assume.

Will redo with DPI set.

jbarth-ubhd · 2020-10-08T13:57:36Z

With imagemagick convert -contrast-stretch 1%x7% # assumption: black on white and 300 DPI:

bertsky · 2020-10-08T17:37:41Z

The word "Wiener" is just a snippet from a 1238 × 388 px image, but I now realize that I forgot to set the DPI.

Wow! They rarely come in as pristine in quality as this!

We have images with weak contrast, partly with black surrounding background. Then we must do preprocessing after the clipping step, too, I assume.

If you mean cropping instead of clipping, then yes. And contrast/brightness normalization afterwards.

With imagemagick convert -contrast-stretch 1%x7% # assumption: black on white and 300 DPI:

Thanks! Very interesting. You immediately see the class of algorithms that are most sensitive to level dynamics: Sauvola type (including Kim). They are usually implemented without determinining the r from the input. (For example, Olena/Scribo just uses #define SCRIBO_DEFAULT_SAUVOLA_R 128 in a uint8 space.)

But just to be sure: did you use the window size zero default for Olena (which should set it to the odd-valued number closest to DPI)? I would expect Niblack to look somewhat better...

Also, your contrast-stretch recipe is somewhat different to normalize as used by Fred's IM script textcleaner or by one of ocrd_wrap's presets. I wonder what made you consider increasing the white-out tolerance up to 7%. Perhaps we can get to an OCR-specific optimum preset here?

jbarth-ubhd · 2020-10-08T17:50:29Z

cropping instead of clipping

yes

did you use the window size zero default for Olena

I did not add any parameter -- defaults only.

I wonder what made you consider increasing the white-out tolerance up to 7%

from here: https://sourceforge.net/p/localcontrast/code/ci/default/tree/ctmf.c , line 362.

7/1 = white/black ratio when using "Stempel Garamond" Font (black on white) with reasonable leading. This is what I am used to do when I don't know better. The base idea is that black + white pixels are gaussian distributed and far enough apart, so I crop "equally" on both ends.

bertsky · 2020-10-09T13:34:23Z

from here: https://sourceforge.net/p/localcontrast/code/ci/default/tree/ctmf.c , line 362.

7/1 = white/black ratio when using "Stempel Garamond" Font (black on white) with reasonable leading. This is what I am used to do when I don't know better. The base idea is that black + white pixels are gaussian distributed and far enough apart, so I crop "equally" on both ends.

Oh I see. So IIUC your logic goes:

take an average Antiqua-style (modern serif) font,
look at its white/black ratio when set with an average leading
assume this will be the minimum white/black ratio, because between text blocks there will be even more white
expect input to be a binary distribution of the signal (non-normalized) plus some Gaussian noise (but not large enough to make both peaks coincide)
generally, try to normalize the dynamic range by stretching the histogram to the full dynamic before thresholding
but prior to that, try to make better use of the dynamic range in the interesting area between both history peaks (pushing them further apart) by clipping the extremes beyond the peaks
since both peaks must be expected to be of different heights, with at least as much more white than black as in the artificial consideration above, clip 7 times more of the brightest pixels to absolute white than of the darkest pixels to absolute black

So you could have used convert -contrast-stretch 0.5%x3.5% by the same argument, right?

Also, I'd like to verify that ratio point for concrete scans, because on average I assume initials, separators, ornaments and images will increase the share of black. (I'll make a coarse measurement on a real corpus to check.)

jbarth-ubhd · 2020-10-09T17:04:30Z

Yes, this the idea.

Some interesting binarization with much more equations someone else pointed me to: https://arxiv.org/pdf/2007.07350.pdf

bertsky · 2020-10-09T17:26:41Z

Some interesting binarization with much more equations someone else pointed me to: https://arxiv.org/pdf/2007.07350.pdf

Yes, we've briefly discussed that in the Lobby. Here is the implementation. Unfortunately does not combine with existing local thresholding algorithms yet.

bertsky · 2020-10-09T17:29:06Z

So I think we need a different artificial test bed.

As a first step, @jbarth-ubhd could you please change your code to do each point in your matrix (i.e. noise columns, brightness rows) on the full image, and only tile it in the final summary image for visualisation?

jbarth-ubhd · 2020-10-09T17:32:08Z

Yes, this is was I've done. Process the "full" 1238 × 388 px image (from PDF, 300 DPI, DIN A7) and extract the word "Wiener" for compact comparison.

bertsky · 2020-10-09T17:49:37Z

Yes, this is was I've done. Process the "full" 1238 × 388 px image (from PDF, 300 DPI, DIN A7) and extract the word "Wiener" for compact comparison.

Oh, I see! Then I misunderstood you above. In that case you can scratch my point 1 entirely.

So how about running ocrd-skimage-normalize in comparison to your contrast-stretch, and running ocrd-skimage-denoise-raw even before that, and running ocrd-skimage-denoise after binarization? That would be points 2 and 4. Finally, point 3 would be running with different window sizes, say 101 (i.e. smaller than default, more localized) and 401 (i.e. larger), and thresholds, say k=0.1 (i.e. heavier foreground) and k=0.4 (i.e. lighter foreground). If you point me to your implementation, I can help with a PR/patch...

jbarth-ubhd · 2020-10-09T18:03:10Z

https://digi.ub.uni-heidelberg.de/diglitData/v/various-levels-black-white-noise.tgz . Feel free to do anything with it. Sorry no docs. "gen" generates various bXXXwXXXnX.ppm from Beethoven-testtext.pgm downsampling 25%. Sorry width+hight hard coded in gen.c++. Convert it to .tif . In methods/ run do.cmd, afterwards montage.pl . I could do it on monday.

bertsky · 2020-10-13T23:23:12Z

Feel free to do anything with it. Sorry no docs. "gen" generates various bXXXwXXXnX.ppm from Beethoven-testtext.pgm downsampling 25%. Sorry width+hight hard coded in gen.c++. Convert it to .tif . In methods/ run do.cmd, afterwards montage.pl

Thanks! I'll give it a shot soon.

But first, I must correct myself significantly regarding my previous recommendations on binarization.

Looking at a representative subset of pages from Deutsches Textarchiv, I found that (contrary to what I said before)…

A. re-binarization after cropping or
B. binarization after contrast normalization

…may actually impair quality for most algorithms!

Here is a sample image with heavy show-through:

This is its histogram:

And that is (a section of) the result of binarization in 7 of Olena's algorithms, on

the original image (left column)
the cropped original image (middle column)
the cropped and normalized image (right column)

where normalization is ocrd-skimage-normalize (i.e. contrast stretching, now with 1% black-point and 7% white-point clipping by default):

As you can see:

niblack is quite invariant (but unusable; perhaps a search for better k might help)
otsu, wolf and sauvola-ms already become unusable due to show-through when using only the cropped image
sauvola, kim and singh become unusable due to show-through when using the cropped image and normalizing it
wolf and sauvola-ms (but not sauvola proper!) look pretty much like otsu when the dynamics are improper
kim is always too "light" (perhaps a search for better k might help)
singh is always a little noisy (perhaps binary denoising afterwards might help)
it is best to use the full image without normalization here (for any algorithm but niblack)

IMHO the explanation for this is in the above histogram: Cropping to the page will also cut out the true black from the histogram, leaving foreground ink very close to show-through.

Where do we go from here? How do we formalise this problem so we can include it in the artificial test set above, and possibly address this in the processor implementation already? Would Generalized History Thresholding help?

jbarth-ubhd · 2020-10-14T08:03:42Z

I think show-through can't be binarized correctly in all cases. What, if this was a blank page and the whole reverse page showes through? Perhaps we could build some statistics over all pages of a book so we can estimate the average minimum (local) contrast, but what then when a page had weak ink...

jbarth-ubhd · 2020-10-14T13:50:22Z

sbb-binarize:

jbarth-ubhd · 2020-10-15T02:11:50Z

no contrast stretch before:

bertsky · 2020-10-15T06:25:32Z

Would be interesting to see how sbb-binarize copes with normalized and with cropped images. But the message is already clear: Good neural modelling is superior.

I think show-through can't be binarized correctly in all cases. What, if this was a blank page and the whole reverse page showes through?

I don't think this is really an issue practically. We will need (and have) page classification anyway, and having a class empty page beside title, index, content (or whatever) should not be difficult.

Perhaps we could build some statistics over all pages of a book so we can estimate the average minimum (local) contrast, but what then when a page had weak ink...

Good idea, but robust heuristic binarization needs to be locally adaptive, so it might be difficult to go global even across the page. Perhaps some algorithms are more suited for this than others. And certainly quality estimation will build on such global statistics.

jbarth-ubhd · 2020-10-19T09:41:52Z

Just for completeness: ocropus-nlbin with -n; not normalized before:

bertsky · 2020-10-19T09:49:02Z

Just for completeness: ocropus-nlbin with -n; not normalized before:

Should be the same as ocrd-cis-ocropy-binarize, right?

jbarth-ubhd · 2020-10-19T10:05:49Z

(venv) xx@yy:~/ocrd_all> find . -name "*.py"|egrep -v '/venv/'|xargs grep -3  percentile_filter
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py-            # if not, we need to flatten it by estimating the local whitelevel
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py-            LOG.info("Flattening")
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py-            m = interpolation.zoom(image, self.parameter['zoom'])
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py:            m = filters.percentile_filter(
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py-                m, self.parameter['perc'], size=(self.parameter['range'], 2))
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py:            m = filters.percentile_filter(
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py-                m, self.parameter['perc'], size=(2, self.parameter['range']))
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py-            m = interpolation.zoom(m, 1.0/self.parameter['zoom'])
./ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_binarize.py-            if self.parameter['debug'] > 0:
grep: ./ocrd_olena/repo/olena/dynamic-use-of-static-c++/swig/python/ltihooks.py: Datei oder Verzeichnis nicht gefunden
--
./ocrd_cis/ocrd_cis/ocropy/common.py-        warnings.simplefilter('ignore')
./ocrd_cis/ocrd_cis/ocropy/common.py-        # calculate at reduced pixel density to save CPU time
./ocrd_cis/ocrd_cis/ocropy/common.py-        m = interpolation.zoom(image, zoom, mode='nearest')
./ocrd_cis/ocrd_cis/ocropy/common.py:        m = filters.percentile_filter(m, perc, size=(range_, 2))
./ocrd_cis/ocrd_cis/ocropy/common.py:        m = filters.percentile_filter(m, perc, size=(2, range_))
./ocrd_cis/ocrd_cis/ocropy/common.py-        m = interpolation.zoom(m, 1. / zoom)
./ocrd_cis/ocrd_cis/ocropy/common.py-    ##w, h = np.minimum(np.array(image.shape), np.array(m.shape))
./ocrd_cis/ocrd_cis/ocropy/common.py-    ##flat = np.clip(image[:w, :h] - m[:w, :h] + 1, 0, 1)

bertsky · 2020-10-23T07:38:25Z

@jbarth-ubhd I'm not sure what you want to say with this. But here's a comparison of both wrappers for old ocropus-nlbin:

anybaseocr-binarize	cis-ocropy-binarize
OCR-D wrapper only formally correct	OCR-D wrapper adequate
exposes all params as is	exposes only relevant ones, controls for others (e.g. zoom via DPI)
original code without changes	includes some fixes:
	* pixel-correct image size
	* robustness against NaN
	* zoom and plausibilise sizes relative to DPI
	* opt-in for additional deskewing and/or denoising
	* opt-in for grayscale normalization

jbarth-ubhd · 2020-10-23T07:48:00Z

Should be the same as ocrd-cis-ocropy-binarize, right?

Didn't know if orcropus-nlbin is the same as cis-ocropy-binarize, so tried to find out and found some lines that look very similar.

bertsky · 2020-10-23T07:55:48Z

Should be the same as ocrd-cis-ocropy-binarize, right?

Didn't know if orcropus-nlbin is the same as cis-ocropy-binarize, so tried to find out and found some lines that look very similar.

Oh, now I got it. (Sorry, have never found the time to summarise my ocropy/ocrolib changes and re-expose the old ocropus CLIs from it.)

My question was actually whether the picture looks any different when you use the OCR-D wrapper.

This was referenced Oct 8, 2020

Bug: OcropyClip: TypeError: function takes exactly 1 argument (2 given) cisocrgroup/ocrd_cis#72

Closed

Modularize wfguide #164

Merged

bertsky mentioned this issue Oct 14, 2020

fix workflows OCR-D/ocrd-workflows#3

Open

bertsky mentioned this issue Oct 21, 2020

Add fontshape processor and all-in-one segmentation OCR-D/ocrd_tesserocr#158

Merged

kba mentioned this issue Oct 22, 2020

add qurator-spk/sbb_binarization OCR-D/ocrd_all#214

Merged

bertsky mentioned this issue Sep 9, 2021

RFC: allow flexible or better binarization tesseract-ocr/tesseract#3083

Open

bertsky mentioned this issue Nov 4, 2021

workflows.md, Step 7 #268

Open

bertsky mentioned this issue Mar 17, 2023

update and consolidate guides #337

Open

12 tasks

lena-hinrichsen assigned mweidling Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix/discuss recommended workflows #172

fix/discuss recommended workflows #172

bertsky commented Oct 6, 2020 •

edited

Loading

jbarth-ubhd commented Oct 7, 2020 •

edited

Loading

jbarth-ubhd commented Oct 7, 2020 •

edited

Loading

jbarth-ubhd commented Oct 7, 2020 •

edited

Loading

bertsky commented Oct 7, 2020

jbarth-ubhd commented Oct 8, 2020 •

edited

Loading

jbarth-ubhd commented Oct 8, 2020 •

edited

Loading

bertsky commented Oct 8, 2020

jbarth-ubhd commented Oct 8, 2020 •

edited

Loading

bertsky commented Oct 9, 2020 •

edited

Loading

jbarth-ubhd commented Oct 9, 2020

bertsky commented Oct 9, 2020

bertsky commented Oct 9, 2020

jbarth-ubhd commented Oct 9, 2020

bertsky commented Oct 9, 2020

jbarth-ubhd commented Oct 9, 2020 •

edited

Loading

bertsky commented Oct 13, 2020 •

edited

Loading

jbarth-ubhd commented Oct 14, 2020

jbarth-ubhd commented Oct 14, 2020

jbarth-ubhd commented Oct 15, 2020 •

edited

Loading

bertsky commented Oct 15, 2020

jbarth-ubhd commented Oct 19, 2020

bertsky commented Oct 19, 2020

jbarth-ubhd commented Oct 19, 2020

bertsky commented Oct 23, 2020

jbarth-ubhd commented Oct 23, 2020

bertsky commented Oct 23, 2020

fix/discuss recommended workflows #172

fix/discuss recommended workflows #172

Comments

bertsky commented Oct 6, 2020 • edited Loading

jbarth-ubhd commented Oct 7, 2020 • edited Loading

jbarth-ubhd commented Oct 7, 2020 • edited Loading

jbarth-ubhd commented Oct 7, 2020 • edited Loading

bertsky commented Oct 7, 2020

jbarth-ubhd commented Oct 8, 2020 • edited Loading

jbarth-ubhd commented Oct 8, 2020 • edited Loading

bertsky commented Oct 8, 2020

jbarth-ubhd commented Oct 8, 2020 • edited Loading

bertsky commented Oct 9, 2020 • edited Loading

jbarth-ubhd commented Oct 9, 2020

bertsky commented Oct 9, 2020

bertsky commented Oct 9, 2020

jbarth-ubhd commented Oct 9, 2020

bertsky commented Oct 9, 2020

jbarth-ubhd commented Oct 9, 2020 • edited Loading

bertsky commented Oct 13, 2020 • edited Loading

jbarth-ubhd commented Oct 14, 2020

jbarth-ubhd commented Oct 14, 2020

jbarth-ubhd commented Oct 15, 2020 • edited Loading

bertsky commented Oct 15, 2020

jbarth-ubhd commented Oct 19, 2020

bertsky commented Oct 19, 2020

jbarth-ubhd commented Oct 19, 2020

bertsky commented Oct 23, 2020

jbarth-ubhd commented Oct 23, 2020

bertsky commented Oct 23, 2020

bertsky commented Oct 6, 2020 •

edited

Loading

jbarth-ubhd commented Oct 7, 2020 •

edited

Loading

jbarth-ubhd commented Oct 7, 2020 •

edited

Loading

jbarth-ubhd commented Oct 7, 2020 •

edited

Loading

jbarth-ubhd commented Oct 8, 2020 •

edited

Loading

jbarth-ubhd commented Oct 8, 2020 •

edited

Loading

jbarth-ubhd commented Oct 8, 2020 •

edited

Loading

bertsky commented Oct 9, 2020 •

edited

Loading

jbarth-ubhd commented Oct 9, 2020 •

edited

Loading

bertsky commented Oct 13, 2020 •

edited

Loading

jbarth-ubhd commented Oct 15, 2020 •

edited

Loading