-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
workflows.md, Step 7 #268
Comments
BTW only ocrd-tesserocr-segment and ocrd-tesserocr-segment-region are recommended within step 7 ... really? I do remember that ocrd-pc-segmentation's performance was the worst and ocrd-eynollah-segment the best (but slow) — what about giving grades & est. processing times & memory requirements to processors? |
PS: ocrd-tesserocr-segment* (recommended) are not in the »Best results for selected pages« workflow. (see below) |
I would remove the sentence »Alternatively, consider using the all-in-one capabilities of ocrd-tesserocr-segment and ocrd-tesserocr-recognize, which can do region segmentation and line segmentation (and optionally also text recognition) in one step by querying Tesseract’s internal iterator (accessing the more precise polygon outlines instead of just coarse bounding boxes with lots of hard-to-recover overlap).« — ocrd-tesserocr-segment and ocrd-tesserocr-recognize are mentioned in the note above. |
Best results for selected pages — workflow
|
Why? You'd only need that for region segmentation (page→regions). The two paragraphs above the one you quoted clearly explain that.
I agree – this information does not reflect the new or changed processors from the last 2 years. (I believe See also #172
Grades are too simplistic for the diversity of materials (from simple single-column books to multi-column ornamented/illustrated pages and title pages) and problems (region types, region shape complexity, region recursion, reading order, line segmentation in warped/straight imaging, in dense/floating typesetting, in tables). Processing times and memory requirements, too, may depend on the image resolution and content. But indeed, we should try to provide some guesstimate or experience. See also OCR-D/ocrd_all#112 and OCR-D/assets#75 (and OCR-D/core#607)
That sentence is part of the paragraph which explains the need for postprocessing when not using all-in-one segmentation or shrink_polygons with Tesseract – so it is necessary there. (No one without minute knowledge of Tesseract internals would understand that dependency.)
|
and and Sorry, I'm confused. |
yes
the latter
in this paragraph (as in all of our documentation), recognition contrasts with segmentation (and preprocessing and postprocessing), so the latter
because this paragraph describes a multi-step processor that can include (text) recognition |
I'm missing the word
region
in the parameters (regions→lines)The text was updated successfully, but these errors were encountered: