qurator-spk/dinglehopper

dinglehopper keep hanging and test errors

whisere opened this issue · 10 comments

running dinglehopper gt txt and dinglehopper-line-dirs keep hanging without message, and pytest returns errors:

collected 62 items / 18 deselected / 44 selected                                                   

qurator/dinglehopper/tests/extracted_text_test.py .............                              [ 29%]
qurator/dinglehopper/tests/test_align.py .......F..                                          [ 52%]
qurator/dinglehopper/tests/test_character_error_rate.py ..                                   [ 56%]
qurator/dinglehopper/tests/test_edit_distance.py .                                           [ 59%]
qurator/dinglehopper/tests/test_editops.py ..                                                [ 63%]
qurator/dinglehopper/tests/test_ocr_files.py .............                                   [ 93%]
qurator/dinglehopper/tests/test_word_error_rate.py ...                                       [100%]

============================================= FAILURES =============================================
__________________________________ test_with_some_fake_ocr_errors __________________________________

    def test_with_some_fake_ocr_errors():
>       result = list(
            align(
                "Über die vielen Sorgen wegen desselben vergaß",
                "SomeJunk MoreJunk Übey die vielen Sorgen wegen AdditionalJunk deffelben vcrgab",
            )
        )

qurator/dinglehopper/tests/test_align.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

s1 = ['Ü', 'b', 'e', 'r', ' ', 'd', ...], s2 = ['S', 'o', 'm', 'e', 'J', 'u', ...]

    def seq_align(s1, s2):
        """Align general sequences."""
        s1 = list(s1)
        s2 = list(s2)
        ops = levenshtein_editops(s1, s2)
        i = 0
        j = 0
    
        while i < len(s1) or j < len(s2):
            o = None
            try:
                ot = ops[0]
                if ot[1] == i and ot[2] == j:
                    ops = ops[1:]
                    o = ot
            except IndexError:
                pass
    
            if o:
                if o[0] == "insert":
                    yield None, s2[j]
                    j += 1
                elif o[0] == "delete":
                    yield s1[i], None
                    i += 1
                elif o[0] == "replace":
                    yield s1[i], s2[j]
                    i += 1
                    j += 1
            else:
>               yield s1[i], s2[j]
E               IndexError: list index out of range

qurator/dinglehopper/align.py:42: IndexError
===================================== short test summary info ======================================
FAILED qurator/dinglehopper/tests/test_align.py::test_with_some_fake_ocr_errors - IndexError: lis...
=========================== 1 failed, 43 passed, 18 deselected in 30.24s ===========================

also stuck with:
qurator/dinglehopper/tests/test_integ_table_extraction.py ..... [ 83%]
qurator/dinglehopper/tests/test_integ_word_error_rate_ocr.py ..

python version 3.9.0. Thanks.

Also tried on python 3.10.0, 3.8.9, 3.6.15, they are all the same.

I can't reproduce and tested a fresh install on Python 3.9. Could you please provide the full output of your pytest call? This would include more useful information e.g. the platform.

There is another problem with rapidfuzz which leads to tests getting stuck on qurator/dinglehopper/tests/test_integ_ocrd_cli.py (and with the pytest process consuming 100%). This is fixed with downgrading to pip install rapidfuzz==1.9.1.

@maxbachmann Any idea how to debug this properly? Reproducer would be using Python 3.9, installing dinglehopper with rapidfuzz 2.0.4 (including both requirements*.txt) and running

% pytest -k test_integ_ocrd_cli.py          
==================================================================== test session starts ====================================================================
platform linux -- Python 3.9.10, pytest-7.0.1, pluggy-1.0.0
rootdir: /home/mike/devel/dinglehopper-github, configfile: pytest.ini
plugins: flake8-1.0.7, cov-3.0.0, mypy-0.9.1
collected 62 items / 61 deselected / 1 selected                                                                                                             

qurator/dinglehopper/tests/test_integ_ocrd_cli.py .                                                                                                   [100%]

============================================================= 1 passed, 61 deselected in 1.14s ==============================================================
% pip install -U rapidfuzz
Requirement already satisfied: rapidfuzz in /home/mike/.virtualenvs/dinglehopper-github/lib64/python3.9/site-packages (1.9.1)
Collecting rapidfuzz
  Using cached rapidfuzz-2.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
Installing collected packages: rapidfuzz
  Attempting uninstall: rapidfuzz
    Found existing installation: rapidfuzz 1.9.1
    Uninstalling rapidfuzz-1.9.1:
      Successfully uninstalled rapidfuzz-1.9.1
Successfully installed rapidfuzz-2.0.4
% pytest -k test_integ_ocrd_cli.py 
==================================================================== test session starts ====================================================================
platform linux -- Python 3.9.10, pytest-7.0.1, pluggy-1.0.0
rootdir: /home/mike/devel/dinglehopper-github, configfile: pytest.ini
plugins: flake8-1.0.7, cov-3.0.0, mypy-0.9.1
collected 62 items / 61 deselected / 1 selected                                                                                                             

qurator/dinglehopper/tests/test_integ_ocrd_cli.py ^Z
[1]  + 521125 suspended  pytest -k test_integ_ocrd_cli.py
% kill %1
[1]  + 521125 terminated  pytest -k test_integ_ocrd_cli.py

(First call using 1.9.1 runs fine, second using 2.0.4 hangs)

rapidfuzz had a new release 19 hours ago that has a bugfix for relevant code, make sure you have rapidfuzz 2.0.4+!

% pip list | grep rapidfuzz
rapidfuzz              2.0.4

Sorry, downgrade! pip install rapidfuzz==1.9.1

@mikegerber I can reproduce the issue and will look into it.

I tracked down a small reproducing sample:

from rapidfuzz import string_metric

a = [2425437992138244740]
b = [-4086774168534702970]

string_metric.levenshtein_editops(a, b)

Apparently I replaced uint64_t with int64_t in one to many places, which did lead to signed integer overflows inside the hashmap implementation. This is fixed by rapidfuzz/rapidfuzz-cpp@fadfb75. This is fixed in v2.0.5.

dinglehopper gt ocr is not hanging after running pip install rapidfuzz==2.0.5 Thanks!

pytest reported:
E ModuleNotFoundError: No module named 'qurator.dinglehopper.tests'
Hint: make sure your test modules/packages have valid Python names.
===================================== short test summary info ======================================
ERROR qurator/dinglehopper/tests/extracted_text_test.py
ERROR qurator/dinglehopper/tests/test_align.py
ERROR qurator/dinglehopper/tests/test_character_error_rate.py
ERROR qurator/dinglehopper/tests/test_edit_distance.py
ERROR qurator/dinglehopper/tests/test_editops.py
ERROR qurator/dinglehopper/tests/test_integ_align.py
ERROR qurator/dinglehopper/tests/test_integ_character_error_rate_ocr.py
ERROR qurator/dinglehopper/tests/test_integ_cli_valid_json.py
ERROR qurator/dinglehopper/tests/test_integ_edit_distance_ocr.py
ERROR qurator/dinglehopper/tests/test_integ_ocrd_cli.py
ERROR qurator/dinglehopper/tests/test_integ_table_extraction.py
ERROR qurator/dinglehopper/tests/test_integ_word_error_rate_ocr.py
ERROR qurator/dinglehopper/tests/test_ocr_files.py
ERROR qurator/dinglehopper/tests/test_word_error_rate.py
!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 14 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!

under python 3.9.0. I guess it doesn't matter since dinglehopper is running okay? Thanks.

dinglehopper gt ocr is not hanging after running pip install rapidfuzz==2.0.5 Thanks!

Great! I'm bumping the dependency to >= 2.0.5.

pytest reported: E ModuleNotFoundError: No module named 'qurator.dinglehopper.tests'

That's a different problem. Did you follow the instructions in README-DEV.txt?

Apparently I replaced uint64_t with int64_t in one to many places, which did lead to signed integer overflows inside the hashmap implementation. This is fixed by maxbachmann/rapidfuzz-cpp@fadfb75. This is fixed in v2.0.5.

This update also fixes my tests, great!