-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add OCR alignment and difference view #13
Comments
…is what #25 brought. Still, creating comparisons on the fly (without the need to run |
I haven't tested it, but it should be possible to use |
The problem is that we want to avoid creating new fileGrps just for viewing. We would need to re-load the workspace model (expensive), and the temporary fileGrps would have to be removed afterwards. So we actually need some API or non-OCRD CLI integration here – independent of METS, perhaps in-memory altogether. Even if the alignment/diff-rendering is expensive, it could be cached (and perhaps calculated asynchronously, so the UI would not stall)... |
There is a proof of concept in the branch diff-view. For now it uses simply the build-in python difflib.SequenceMatcher without notion of a eventually preexisting segmentation. The algorithm is really quite naive, but worksforme. It shouldn't be to hard to wrap other algorithms to return the results in a TaggedText class, but I'd really like to extend the TaggedText/TaggedString data-model first to include some more information (id of the TextNodes especially) before merging. |
Closed by #29 |
This is clearly a desideratum here, but how do we approach it?
Considerations:
FileGroupSelector
s instead of 1O(n²)
(orO(n³)
under arbitrary weights). There are many different packages for this on PyPI with various levels of features (including cost functions or weights) and efficiency (including C library backends). But not all of them areaͤ
vsä
orſt
vsſt
or evenſ
vss
.GtkSource.LanguageManager
, an off-the-shelf highlighter that would lend itself isdiff
(coloringdiff -u
line output). But this does not colorize within the lines (likegit diff --word-diff
,wdiff
,dwdiff
etc), which is the most important contribution IMHO. So perhaps we need to use some existing word-diff syntax and write our own highlighter after all. Or we integrate dinglehopper's HTML and display it via WebKit directly.The text was updated successfully, but these errors were encountered: