hnesk/browse-ocrd

add OCR alignment and difference view

Closed this issue · 6 comments

This is clearly a desideratum here, but how do we approach it?

Considerations:

  1. The additional view would need 2 FileGroupSelectors instead of 1
  2. There are 2 cases:
    • A: equal segmentation but different recognition results: character alignment and difference highlighting within lines only
    • B: different segmentation and recognition results: textline alignment and difference highlighting within larger chunks
  3. The actual alignment code needs to be fast and reliable. The underlying problem of global sequence alignment (Needleman-Wunsch algorithm) has O(n²) (or O(n³) under arbitrary weights). There are many different packages for this on PyPI with various levels of features (including cost functions or weights) and efficiency (including C library backends). But not all of them are
    • suited for Unicode (or arbitrary lists of objects),
    • robust (both in terms of crashes and glitches on strange input and heap/stack restrictions),
    • actually efficient (in terms of average complexity or best case complexity)
    • well maintained and packaged.
  4. For historical text specifically, one must treat grapheme clusters as single objects to compare, and probably normalize certain sequences (or at least reduce their distance/cost to the normalized equivalent), e.g. vs ä or vs ſt or even ſ vs s.
  5. It would therefore seem natural to delegate to one of the existing OCR-D processors for OCR evaluation (or its backend library modules), i.e. ocrd-dinglehopper and ocrd-cor-asv-ann-evaluate, which have quite a few differences:
ocrd-dinglehopper ocrd-cor-asv-ann-evaluate
CER and WER and visualization only CER (currently)
only single pages aggregates over all pages
result is HTML with visual diff + JSON report result is logging
alignment written Python (slow) difflib.SequenceMatcher (fast; I tried many libraries on lots of data for robustness and speed, and decided to revert to that by consequence)
uniseg.graphemeclusters to get alignment+distances on graphemes (lists of objects) calculates alignment on codepoints (faster) but then post-processes to join combining sequences with their base character, so distances are almost always on graphemes as well
a set of normalizations that (roughly) target OCR-D GT transcription guidelines level 3 to level 2 (which is laudable) offers plain Levenshtein for GT level 3, NFC/NFKC/NFKD/NFD for GT level 2, and a custom normalization (called historic_latin) that targets GT level 1 (because NFKC is both quite incomplete and too much already)
text alignment of complete page text concatenated (suitable for A or B) text alignment on identical textlines (suitable for B only)
compares 1:1 compares 1:N
  1. Whatever module we choose, and whatever method to integrate its core functionality (without the actual OCR-D processor), we need to visualise the difference with Gtk facilities. For GtkSource.LanguageManager, an off-the-shelf highlighter that would lend itself is diff (coloring diff -u line output). But this does not colorize within the lines (like git diff --word-diff, wdiff, dwdiff etc), which is the most important contribution IMHO. So perhaps we need to use some existing word-diff syntax and write our own highlighter after all. Or we integrate dinglehopper's HTML and display it via WebKit directly.
  1. Or we integrate dinglehopper's HTML and display it via WebKit directly.

…is what #25 brought. Still, creating comparisons on the fly (without the need to run ocrd-dinglehopper on the complete workspace) would be preferable IMHO. And when it is clear that both sides have the same line segmentation, a simple diff highlighter might still be better. So let's keep this open for discussion etc.

Still, creating comparisons on the fly (without the need to run ocrd-dinglehopper on the complete workspace) would be preferable IMHO

I haven't tested it, but it should be possible to use -g to just process one page. I have also some speed improvements planned, so I guess that should help too.

I haven't tested it, but it should be possible to use -g to just process one page.

The problem is that we want to avoid creating new fileGrps just for viewing. We would need to re-load the workspace model (expensive), and the temporary fileGrps would have to be removed afterwards.

So we actually need some API or non-OCRD CLI integration here – independent of METS, perhaps in-memory altogether. Even if the alignment/diff-rendering is expensive, it could be cached (and perhaps calculated asynchronously, so the UI would not stall)...

hnesk commented

There is a proof of concept in the branch diff-view. For now it uses simply the build-in python difflib.SequenceMatcher without notion of a eventually preexisting segmentation. The algorithm is really quite naive, but worksforme. It shouldn't be to hard to wrap other algorithms to return the results in a TaggedText class, but I'd really like to extend the TaggedText/TaggedString data-model first to include some more information (id of the TextNodes especially) before merging.

kba commented

Very nice, here's how that looks, comparing calamari/tesseract output from ocrd-galley:

image

hnesk commented

Closed by #29