-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix/discuss recommended workflows #172
Comments
if olena binarization (other methods tested long before), I would recommend wolf: ("Wiener" snippet is from a larger image. Within 1 method left-to-right: +noise. Lines of same contrast at various levels (always dark on bright in this experiment) ) PS: Default settings applied. PS2: One could argue "preprocess low-contrast first", but how to do this without knowing noise levels to skip etc... I think this is the primary task for binarization |
But in Step 5 cis-ocropy-deskew is mentioned ( but not "recommended" in Step 9 (?) ) |
Hmm.. we are talking about https://ocr-d.de/en/workflows ? The "Recommendations" at the end of that page? |
Yes!
I don't mind that it is not The above list really only concerns the overall workflow configuration recommendations.
Thanks again for that test suite and for paying attention in general on that front, which is not appreciated enough IMO. We've briefly discussed this in the chat already, and I would like to elaborate on some open points:
So I think we need a different artificial test bed. And then we also need a set of true images of various appearances and qualities.
I disagree. IMO there should be specialisation and modularisation. So binarization processors can concentrate on its core problem, and others can try to solve related ones. If a processor chooses to do 2 or 3 steps in one, fine (we've seen this elsewhere), but we should always have the option to freely combine. And that in turn means we must do it for a fair evaluation, too. |
Wow! They rarely come in as pristine in quality as this!
If you mean cropping instead of clipping, then yes. And contrast/brightness normalization afterwards.
Thanks! Very interesting. You immediately see the class of algorithms that are most sensitive to level dynamics: Sauvola type (including Kim). They are usually implemented without determinining the But just to be sure: did you use the window size zero default for Olena (which should set it to the odd-valued number closest to DPI)? I would expect Niblack to look somewhat better... Also, your |
yes
I did not add any parameter -- defaults only.
from here: https://sourceforge.net/p/localcontrast/code/ci/default/tree/ctmf.c , line 362. 7/1 = white/black ratio when using "Stempel Garamond" Font (black on white) with reasonable leading. This is what I am used to do when I don't know better. The base idea is that black + white pixels are gaussian distributed and far enough apart, so I crop "equally" on both ends. |
Oh I see. So IIUC your logic goes:
So you could have used Also, I'd like to verify that ratio point for concrete scans, because on average I assume initials, separators, ornaments and images will increase the share of black. (I'll make a coarse measurement on a real corpus to check.) |
Yes, this the idea. Some interesting binarization with much more equations someone else pointed me to: https://arxiv.org/pdf/2007.07350.pdf |
Yes, we've briefly discussed that in the Lobby. Here is the implementation. Unfortunately does not combine with existing local thresholding algorithms yet. |
As a first step, @jbarth-ubhd could you please change your code to do each point in your matrix (i.e. noise columns, brightness rows) on the full image, and only tile it in the final summary image for visualisation? |
Yes, this is was I've done. Process the "full" 1238 × 388 px image (from PDF, 300 DPI, DIN A7) and extract the word "Wiener" for compact comparison. |
Oh, I see! Then I misunderstood you above. In that case you can scratch my point 1 entirely. So how about running |
https://digi.ub.uni-heidelberg.de/diglitData/v/various-levels-black-white-noise.tgz . Feel free to do anything with it. Sorry no docs. "gen" generates various bXXXwXXXnX.ppm from Beethoven-testtext.pgm downsampling 25%. Sorry width+hight hard coded in gen.c++. Convert it to .tif . In methods/ run do.cmd, afterwards montage.pl . I could do it on monday. |
Thanks! I'll give it a shot soon. But first, I must correct myself significantly regarding my previous recommendations on binarization. Looking at a representative subset of pages from Deutsches Textarchiv, I found that (contrary to what I said before)… A. re-binarization after cropping or …may actually impair quality for most algorithms! Here is a sample image with heavy show-through: And that is (a section of) the result of binarization in 7 of Olena's algorithms, on
where normalization is As you can see:
IMHO the explanation for this is in the above histogram: Cropping to the page will also cut out the true black from the histogram, leaving foreground ink very close to show-through. Where do we go from here? How do we formalise this problem so we can include it in the artificial test set above, and possibly address this in the processor implementation already? Would Generalized History Thresholding help? |
I think show-through can't be binarized correctly in all cases. What, if this was a blank page and the whole reverse page showes through? Perhaps we could build some statistics over all pages of a book so we can estimate the average minimum (local) contrast, but what then when a page had weak ink... |
Would be interesting to see how
I don't think this is really an issue practically. We will need (and have) page classification anyway, and having a class empty page beside title, index, content (or whatever) should not be difficult.
Good idea, but robust heuristic binarization needs to be locally adaptive, so it might be difficult to go global even across the page. Perhaps some algorithms are more suited for this than others. And certainly quality estimation will build on such global statistics. |
Should be the same as |
|
@jbarth-ubhd I'm not sure what you want to say with this. But here's a comparison of both wrappers for old ocropus-nlbin:
|
Didn't know if orcropus-nlbin is the same as cis-ocropy-binarize, so tried to find out and found some lines that look very similar. |
Oh, now I got it. (Sorry, have never found the time to summarise my ocropy/ocrolib changes and re-expose the old ocropus CLIs from it.) My question was actually whether the picture looks any different when you use the OCR-D wrapper. |
I am surprised to see the following in our current recommendations:
skimage
binarize/denoise processors instead of Olena/OcropyEDIT (thanks @jbarth-ubhd for reminding me): also
Do these choices have some empirical grounding (measuring quality and/or performance on GT)?
The text was updated successfully, but these errors were encountered: