qurator-spk/dinglehopper

Support comparing line GT directories with line OCR directories

Closed this issue · 22 comments

In #62, @stweil's original problem was - as I understand it - to compare a directory with line GT text files with a directory of line OCR text files. For now I've created fake test data to implement this fake-line-gt.zip. It looks like this:

% ls *
gt:
line001.gt.txt  line003.gt.txt  line005.gt.txt  line007.gt.txt  line009.gt.txt  line011.gt.txt
line002.gt.txt  line004.gt.txt  line006.gt.txt  line008.gt.txt  line010.gt.txt

some-ocr:
line001.some-ocr.txt  line003.some-ocr.txt  line005.some-ocr.txt  line007.some-ocr.txt  line009.some-ocr.txt  line011.some-ocr.txt
line002.some-ocr.txt  line004.some-ocr.txt  line006.some-ocr.txt  line008.some-ocr.txt  line010.some-ocr.txt

A first implementation should compare the text of pairs files (matching by filename) and produce a report of metrics & differences over all of the lines. First idea of the CLI interface:

dinglehopper-lines gt/ --gt-suffix .gt.txt some-ocr/ --ocr-suffix .some-ocr.txt

I'm not sure if this will be the final CLI interface but it's what seems necessary on first glance.

What about an even simpler interface:

dinglehopper [OPTIONS] GTDIR OCRDIR [REPORT_PREFIX]

The existing dinglehopper could be extended to accept directory names for its GT and OCR argument and then either strip all extensions when matching ground truth and ocr lines by default or use new optional --gt-suffix and --ocr-suffix options.

What about an even simpler interface:

dinglehopper [OPTIONS] GTDIR OCRDIR [REPORT_PREFIX]

The existing dinglehopper could be extended to accept directory names for its GT and OCR argument

For now and until the interface is finalized I'd like to keep the CLI interface separate, it will share the code anyway.

and then either strip all extensions when matching ground truth and ocr lines by default or use new optional --gt-suffix and --ocr-suffix options.

For the stripping of all extensions to work we would need to assume that the common prefix for a pair does not contain a dot, and the explicit suffix options seemed saner.

But I think I'll start implementing this, CLI details can still be refined later.

For the stripping of all extensions to work we would need to assume that the common prefix for a pair does not contain a dot, and the explicit suffix options seemed saner.

They will default to something useful: the longest common suffix, i.e.

import itertools


def all_equal(iterable):
    g = itertools.groupby(iterable)
    return next(g, True) and not next(g, False)


def common_prefix(its):
    return [p[0] for p in itertools.takewhile(all_equal, zip(*its))]


def common_suffix(its):
    return reversed(common_prefix(reversed(it) for it in its))


#print("".join(common_prefix(["line001.gt.txt", "line02.gt.txt", "line3.gt.txt"])))
print("".join(common_suffix(["line001.gt.txt", "line02.gt.txt", "line3.gt.txt"])))

(gives .gt.txt)

dinglehopper-line-dirs gt some-ocr from the feat/compare-line-texts branch now compares the line texts from the gt and some-ocr. It auto-detects the file suffixes. It's WIP - but only WER and word differences are missing.

@stweil Could you test if this works for you?

image

The lines also line up perfectly, because each pair is put into its own <div class="row">!

My first test fails:

dinglehopper-line-dirs gt frak2021_1.069 frak2021_1.069
free(): invalid next size (fast)
Aborted

The crash happens in rapidfuzz-1.9.0-py3.9-linux-x86_64.egg/rapidfuzz/cpp_string_metric.cpython-39-x86_64-linux-gnu.so.

@maxbachmann, I now tried to debug the RapidFuzz code, but pip install . fails:

 src/cpp_common.hpp:4:10: fatal error: rapidfuzz/fuzz.hpp: No such file or directory

I can't reproduce with Python 3.9 and rapidfuzz-1.9.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl. Hmm. Could we have a look at the data (or a portion of it) that triggers this?

@stweil did you clone the repository including submodules?

git clone --recursive git@github.com:maxbachmann/RapidFuzz.git

As @mikegerber mentioned it would help if you could provide me with some data to reproduce this.

Minimal single line test case (found by bisecting the original large test set):

mkdir a b
echo "Vorjahres.“ (24 % gegenüber 42 %. Daneben auch Anſtiege um 11 %, 22 %, 34 %," >a/demo.txt
echo "PVorſahres.“ (24 0% gegenüber 42 95, Daneben auch Anſtiege um 11 % 22 % 34" >b/demo.txt
dinglehopper-line-dirs a b c

did you clone the repository including submodules?

No, I did not. The installation works after git submodule update --init. I suggest to add that information to the instructions in the README.

Minimal single line test case (found by bisecting the original large test set):

thanks I could reproduce the crash. I will look into it

Ouch, I had a typo in the edit distance calculation: rapidfuzz/rapidfuzz-cpp@103674d
I am honestly surprised, that this never crash on the input of a fuzz testing tool ...

I released a new version of RapidFuzz with the fix: https://github.com/maxbachmann/RapidFuzz/releases/tag/v1.9.1

Great this bug is fixed. I've bumped the rapidfuzz dependency to >=1.9.1!

@stweil Could you try https://github.com/qurator-spk/dinglehopper/tree/feat/compare-line-texts again, after updating?

The feat/compare-line-text branch now also computes WER and word differences. So, if it's tested, it's ready.

A new test with the latest code shows that the memory issue is fixed, but with the full test set I get a new error (an endless recursion in word_error_rate.py line 25, test data is available online):

$ dinglehopper-line-dirs a b c
Traceback (most recent call last):
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/bin/dinglehopper-line-dirs", line 11, in <module>
    load_entry_point('dinglehopper==0.0.0', 'console_scripts', 'dinglehopper-line-dirs')()
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/click-8.0.3-py3.9.egg/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/click-8.0.3-py3.9.egg/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/click-8.0.3-py3.9.egg/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/click-8.0.3-py3.9.egg/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/cli_line_dirs.py", line 138, in main
    process(gt, ocr, report_prefix, metrics=metrics)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/cli_line_dirs.py", line 67, in process
    l_wer, l_n_words = word_error_rate_n(gt_text, ocr_text)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/multimethod-1.3-py3.9.egg/multimethod.py", line 171, in __call__
    return self[tuple(map(self.get_type, args))](*args, **kwargs)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/word_error_rate.py", line 76, in word_error_rate_n
    return word_error_rate_n(reference.text, compared.text)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/multimethod-1.3-py3.9.egg/multimethod.py", line 171, in __call__
    return self[tuple(map(self.get_type, args))](*args, **kwargs)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/word_error_rate.py", line 68, in word_error_rate_n
    compared_seq = list(words_normalized(compared))
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/word_error_rate.py", line 43, in words
    for word in uniseg.wordbreak.words(s):
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/breaking.py", line 59, in break_units
    for j, bk in enumerate(breakables):
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/wordbreak.py", line 185, in word_breakables
    primitive_boundaries = list(_preprocess_boundaries(s))
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/wordbreak.py", line 153, in _preprocess_boundaries
    prop = word_break(c)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/word_error_rate.py", line 25, in new_word_break
    return old_word_break(c, index)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/word_error_rate.py", line 25, in new_word_break
    return old_word_break(c, index)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/word_error_rate.py", line 25, in new_word_break
    return old_word_break(c, index)
  [Previous line repeated 975 more times]
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/wordbreak.py", line 129, in word_break
    return _word_break(code_point(c, index))
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/db.py", line 75, in word_break
    (ord(u),))
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/codepoint.py", line 127, in ord
    return ord_impl(c, index)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/codepoint.py", line 75, in ord_impl
    return _ord(c if index is None else c[index])
RecursionError: maximum recursion depth exceeded while calling a Python object

With commits cb2be96 and 5b39464 reverted (= no WER), my full data set is processed in 5 seconds (no crash).

Great that half of it is working now! Unfortunately I'm on vacation now, so triaging the WER problem will have to wait until January. Thanks for the test data, this will help greatly!

I've found the problem and fixed it in 8a3f5e4! The feature is now merged.

% /usr/bin/time -f'%e %M' dinglehopper-line-dirs a b
2.19 54028

~ 2 seconds and max. 55MB memory for your example data! 🍾

@stweil Let me know if that's working for you! I'll close this issue, feel free to re-open or open another issue if something's still wrong.

@stweil Did you run the latest version on your full data? Did it work?