add OCR alignment and difference view #13

bertsky · 2020-10-15T20:54:19Z

This is clearly a desideratum here, but how do we approach it?

Considerations:

The additional view would need 2 FileGroupSelectors instead of 1
There are 2 cases:
- A: equal segmentation but different recognition results: character alignment and difference highlighting within lines only
- B: different segmentation and recognition results: textline alignment and difference highlighting within larger chunks
The actual alignment code needs to be fast and reliable. The underlying problem of global sequence alignment (Needleman-Wunsch algorithm) has O(n²) (or O(n³) under arbitrary weights). There are many different packages for this on PyPI with various levels of features (including cost functions or weights) and efficiency (including C library backends). But not all of them are
- suited for Unicode (or arbitrary lists of objects),
- robust (both in terms of crashes and glitches on strange input and heap/stack restrictions),
- actually efficient (in terms of average complexity or best case complexity)
- well maintained and packaged.
For historical text specifically, one must treat grapheme clusters as single objects to compare, and probably normalize certain sequences (or at least reduce their distance/cost to the normalized equivalent), e.g. aͤ vs ä or ﬅ vs ſt or even ſ vs s.
It would therefore seem natural to delegate to one of the existing OCR-D processors for OCR evaluation (or its backend library modules), i.e. ocrd-dinglehopper and ocrd-cor-asv-ann-evaluate, which have quite a few differences:

ocrd-dinglehopper ocrd-cor-asv-ann-evaluate

CER and WER and visualization only CER (currently)

only single pages aggregates over all pages

result is HTML with visual diff + JSON report result is logging

alignment written Python (slow) difflib.SequenceMatcher (fast; I tried many libraries on lots of data for robustness and speed, and decided to revert to that by consequence)

uniseg.graphemeclusters to get alignment+distances on graphemes (lists of objects) calculates alignment on codepoints (faster) but then post-processes to join combining sequences with their base character, so distances are almost always on graphemes as well

a set of normalizations that (roughly) target OCR-D GT transcription guidelines level 3 to level 2 (which is laudable) offers plain Levenshtein for GT level 3, NFC/NFKC/NFKD/NFD for GT level 2, and a custom normalization (called historic_latin) that targets GT level 1 (because NFKC is both quite incomplete and too much already)

text alignment of complete page text concatenated (suitable for A or B) text alignment on identical textlines (suitable for B only)

compares 1:1 compares 1:N

Whatever module we choose, and whatever method to integrate its core functionality (without the actual OCR-D processor), we need to visualise the difference with Gtk facilities. For GtkSource.LanguageManager, an off-the-shelf highlighter that would lend itself is diff (coloring diff -u line output). But this does not colorize within the lines (like git diff --word-diff, wdiff, dwdiff etc), which is the most important contribution IMHO. So perhaps we need to use some existing word-diff syntax and write our own highlighter after all. Or we integrate dinglehopper's HTML and display it via WebKit directly.

The text was updated successfully, but these errors were encountered:

bertsky · 2020-11-05T20:38:02Z

Or we integrate dinglehopper's HTML and display it via WebKit directly.

…is what #25 brought. Still, creating comparisons on the fly (without the need to run ocrd-dinglehopper on the complete workspace) would be preferable IMHO. And when it is clear that both sides have the same line segmentation, a simple diff highlighter might still be better. So let's keep this open for discussion etc.

mikegerber · 2021-01-25T18:19:32Z

Still, creating comparisons on the fly (without the need to run ocrd-dinglehopper on the complete workspace) would be preferable IMHO

I haven't tested it, but it should be possible to use -g to just process one page. I have also some speed improvements planned, so I guess that should help too.

bertsky · 2021-01-25T18:31:19Z

I haven't tested it, but it should be possible to use -g to just process one page.

The problem is that we want to avoid creating new fileGrps just for viewing. We would need to re-load the workspace model (expensive), and the temporary fileGrps would have to be removed afterwards.

So we actually need some API or non-OCRD CLI integration here – independent of METS, perhaps in-memory altogether. Even if the alignment/diff-rendering is expensive, it could be cached (and perhaps calculated asynchronously, so the UI would not stall)...

hnesk · 2021-02-05T11:04:19Z

There is a proof of concept in the branch diff-view. For now it uses simply the build-in python difflib.SequenceMatcher without notion of a eventually preexisting segmentation. The algorithm is really quite naive, but worksforme. It shouldn't be to hard to wrap other algorithms to return the results in a TaggedText class, but I'd really like to extend the TaggedText/TaggedString data-model first to include some more information (id of the TextNodes especially) before merging.

kba · 2021-02-05T11:44:53Z

Very nice, here's how that looks, comparing calamari/tesseract output from ocrd-galley:

hnesk · 2021-07-22T13:51:13Z

Closed by #29

bertsky mentioned this issue Oct 17, 2020

diff view prediction/GT OCR4all/LAREX#221

Closed

bertsky mentioned this issue Nov 3, 2020

document.page_for_id: catch empty imageFilename case #25

Merged

kba mentioned this issue Dec 10, 2020

Feature Request: Scroll lock panels #27

Open

hnesk closed this as completed Jul 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add OCR alignment and difference view #13

add OCR alignment and difference view #13

bertsky commented Oct 15, 2020

bertsky commented Nov 5, 2020

mikegerber commented Jan 25, 2021

bertsky commented Jan 25, 2021

hnesk commented Feb 5, 2021

kba commented Feb 5, 2021

hnesk commented Jul 22, 2021

add OCR alignment and difference view #13

add OCR alignment and difference view #13

Comments

bertsky commented Oct 15, 2020

bertsky commented Nov 5, 2020

mikegerber commented Jan 25, 2021

bertsky commented Jan 25, 2021

hnesk commented Feb 5, 2021

kba commented Feb 5, 2021

hnesk commented Jul 22, 2021