add OCR alignment and difference view
Closed this issue · 6 comments
This is clearly a desideratum here, but how do we approach it?
Considerations:
- The additional view would need 2
FileGroupSelector
s instead of 1 - There are 2 cases:
- A: equal segmentation but different recognition results: character alignment and difference highlighting within lines only
- B: different segmentation and recognition results: textline alignment and difference highlighting within larger chunks
- The actual alignment code needs to be fast and reliable. The underlying problem of global sequence alignment (Needleman-Wunsch algorithm) has
O(n²)
(orO(n³)
under arbitrary weights). There are many different packages for this on PyPI with various levels of features (including cost functions or weights) and efficiency (including C library backends). But not all of them are- suited for Unicode (or arbitrary lists of objects),
- robust (both in terms of crashes and glitches on strange input and heap/stack restrictions),
- actually efficient (in terms of average complexity or best case complexity)
- well maintained and packaged.
- For historical text specifically, one must treat grapheme clusters as single objects to compare, and probably normalize certain sequences (or at least reduce their distance/cost to the normalized equivalent), e.g.
aͤ
vsä
orſt
vsſt
or evenſ
vss
. - It would therefore seem natural to delegate to one of the existing OCR-D processors for OCR evaluation (or its backend library modules), i.e. ocrd-dinglehopper and ocrd-cor-asv-ann-evaluate, which have quite a few differences:
ocrd-dinglehopper ocrd-cor-asv-ann-evaluate CER and WER and visualization only CER (currently) only single pages aggregates over all pages result is HTML with visual diff + JSON report result is logging alignment written Python (slow) difflib.SequenceMatcher (fast; I tried many libraries on lots of data for robustness and speed, and decided to revert to that by consequence) uniseg.graphemeclusters to get alignment+distances on graphemes (lists of objects) calculates alignment on codepoints (faster) but then post-processes to join combining sequences with their base character, so distances are almost always on graphemes as well a set of normalizations that (roughly) target OCR-D GT transcription guidelines level 3 to level 2 (which is laudable) offers plain Levenshtein for GT level 3, NFC/NFKC/NFKD/NFD for GT level 2, and a custom normalization (called historic_latin
) that targets GT level 1 (because NFKC is both quite incomplete and too much already)text alignment of complete page text concatenated (suitable for A or B) text alignment on identical textlines (suitable for B only) compares 1:1 compares 1:N
- Whatever module we choose, and whatever method to integrate its core functionality (without the actual OCR-D processor), we need to visualise the difference with Gtk facilities. For
GtkSource.LanguageManager
, an off-the-shelf highlighter that would lend itself isdiff
(coloringdiff -u
line output). But this does not colorize within the lines (likegit diff --word-diff
,wdiff
,dwdiff
etc), which is the most important contribution IMHO. So perhaps we need to use some existing word-diff syntax and write our own highlighter after all. Or we integrate dinglehopper's HTML and display it via WebKit directly.
- Or we integrate dinglehopper's HTML and display it via WebKit directly.
…is what #25 brought. Still, creating comparisons on the fly (without the need to run ocrd-dinglehopper
on the complete workspace) would be preferable IMHO. And when it is clear that both sides have the same line segmentation, a simple diff highlighter might still be better. So let's keep this open for discussion etc.
Still, creating comparisons on the fly (without the need to run
ocrd-dinglehopper
on the complete workspace) would be preferable IMHO
I haven't tested it, but it should be possible to use -g
to just process one page. I have also some speed improvements planned, so I guess that should help too.
I haven't tested it, but it should be possible to use
-g
to just process one page.
The problem is that we want to avoid creating new fileGrps just for viewing. We would need to re-load the workspace model (expensive), and the temporary fileGrps would have to be removed afterwards.
So we actually need some API or non-OCRD CLI integration here – independent of METS, perhaps in-memory altogether. Even if the alignment/diff-rendering is expensive, it could be cached (and perhaps calculated asynchronously, so the UI would not stall)...
There is a proof of concept in the branch diff-view. For now it uses simply the build-in python difflib.SequenceMatcher without notion of a eventually preexisting segmentation. The algorithm is really quite naive, but worksforme. It shouldn't be to hard to wrap other algorithms to return the results in a TaggedText class, but I'd really like to extend the TaggedText/TaggedString data-model first to include some more information (id of the TextNodes especially) before merging.