Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add OCR alignment and difference view #13

Closed
bertsky opened this issue Oct 15, 2020 · 6 comments
Closed

add OCR alignment and difference view #13

bertsky opened this issue Oct 15, 2020 · 6 comments

Comments

@bertsky
Copy link
Contributor

bertsky commented Oct 15, 2020

This is clearly a desideratum here, but how do we approach it?

Considerations:

  1. The additional view would need 2 FileGroupSelectors instead of 1
  2. There are 2 cases:
    • A: equal segmentation but different recognition results: character alignment and difference highlighting within lines only
    • B: different segmentation and recognition results: textline alignment and difference highlighting within larger chunks
  3. The actual alignment code needs to be fast and reliable. The underlying problem of global sequence alignment (Needleman-Wunsch algorithm) has O(n²) (or O(n³) under arbitrary weights). There are many different packages for this on PyPI with various levels of features (including cost functions or weights) and efficiency (including C library backends). But not all of them are
    • suited for Unicode (or arbitrary lists of objects),
    • robust (both in terms of crashes and glitches on strange input and heap/stack restrictions),
    • actually efficient (in terms of average complexity or best case complexity)
    • well maintained and packaged.
  4. For historical text specifically, one must treat grapheme clusters as single objects to compare, and probably normalize certain sequences (or at least reduce their distance/cost to the normalized equivalent), e.g. vs ä or vs ſt or even ſ vs s.
  5. It would therefore seem natural to delegate to one of the existing OCR-D processors for OCR evaluation (or its backend library modules), i.e. ocrd-dinglehopper and ocrd-cor-asv-ann-evaluate, which have quite a few differences:
ocrd-dinglehopper ocrd-cor-asv-ann-evaluate
CER and WER and visualization only CER (currently)
only single pages aggregates over all pages
result is HTML with visual diff + JSON report result is logging
alignment written Python (slow) difflib.SequenceMatcher (fast; I tried many libraries on lots of data for robustness and speed, and decided to revert to that by consequence)
uniseg.graphemeclusters to get alignment+distances on graphemes (lists of objects) calculates alignment on codepoints (faster) but then post-processes to join combining sequences with their base character, so distances are almost always on graphemes as well
a set of normalizations that (roughly) target OCR-D GT transcription guidelines level 3 to level 2 (which is laudable) offers plain Levenshtein for GT level 3, NFC/NFKC/NFKD/NFD for GT level 2, and a custom normalization (called historic_latin) that targets GT level 1 (because NFKC is both quite incomplete and too much already)
text alignment of complete page text concatenated (suitable for A or B) text alignment on identical textlines (suitable for B only)
compares 1:1 compares 1:N
  1. Whatever module we choose, and whatever method to integrate its core functionality (without the actual OCR-D processor), we need to visualise the difference with Gtk facilities. For GtkSource.LanguageManager, an off-the-shelf highlighter that would lend itself is diff (coloring diff -u line output). But this does not colorize within the lines (like git diff --word-diff, wdiff, dwdiff etc), which is the most important contribution IMHO. So perhaps we need to use some existing word-diff syntax and write our own highlighter after all. Or we integrate dinglehopper's HTML and display it via WebKit directly.
@bertsky
Copy link
Contributor Author

bertsky commented Nov 5, 2020

  1. Or we integrate dinglehopper's HTML and display it via WebKit directly.

…is what #25 brought. Still, creating comparisons on the fly (without the need to run ocrd-dinglehopper on the complete workspace) would be preferable IMHO. And when it is clear that both sides have the same line segmentation, a simple diff highlighter might still be better. So let's keep this open for discussion etc.

@mikegerber
Copy link

Still, creating comparisons on the fly (without the need to run ocrd-dinglehopper on the complete workspace) would be preferable IMHO

I haven't tested it, but it should be possible to use -g to just process one page. I have also some speed improvements planned, so I guess that should help too.

@bertsky
Copy link
Contributor Author

bertsky commented Jan 25, 2021

I haven't tested it, but it should be possible to use -g to just process one page.

The problem is that we want to avoid creating new fileGrps just for viewing. We would need to re-load the workspace model (expensive), and the temporary fileGrps would have to be removed afterwards.

So we actually need some API or non-OCRD CLI integration here – independent of METS, perhaps in-memory altogether. Even if the alignment/diff-rendering is expensive, it could be cached (and perhaps calculated asynchronously, so the UI would not stall)...

@hnesk
Copy link
Owner

hnesk commented Feb 5, 2021

There is a proof of concept in the branch diff-view. For now it uses simply the build-in python difflib.SequenceMatcher without notion of a eventually preexisting segmentation. The algorithm is really quite naive, but worksforme. It shouldn't be to hard to wrap other algorithms to return the results in a TaggedText class, but I'd really like to extend the TaggedText/TaggedString data-model first to include some more information (id of the TextNodes especially) before merging.

@kba
Copy link
Contributor

kba commented Feb 5, 2021

Very nice, here's how that looks, comparing calamari/tesseract output from ocrd-galley:

image

@hnesk
Copy link
Owner

hnesk commented Jul 22, 2021

Closed by #29

@hnesk hnesk closed this as completed Jul 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants