OCR-D/ocrd_calamari

Python 3.11: re.error: global flags not at the start of the expression at position 3

Closed this issue · 9 comments

Possibly a problem only with Python 3.11:

FAILED test/test_recognize.py::test_recognize - re.error: global flags not at the start of the expression at position 3
FAILED test/test_recognize.py::test_recognize_should_warn_if_given_rgb_image_and_single_channel_model - re.error: global flags not at the start of the expression at position 3
FAILED test/test_recognize.py::test_word_segmentation - re.error: global flags not at the start of the expression at position 3
FAILED test/test_recognize.py::test_glyphs - re.error: global flags not at the start of the expression at position 3

Python 3.11 changelog seems to support this assumption.

(This would have been nice to have been caught early, using a linter. → Opening another issue.)


  • Fix it here
  • Merge #78
  • Document the workaround here
  • Get a model update/fix procedure upstream

This does indeed not happen with Python 3.10.12.

It's a problem in calamari-ocr, not ocrd_calamari:

/home/b-mg106/devel/ocrd_calamari/ocrd_calamari/recognize.py:129: in process
    for line, line_coords, raw_results in zip(textlines, line_coordss, raw_results_all):
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/calamari_ocr/ocr/predictor.py:250: in predict_raw
    for result in zip(*prediction):
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/calamari_ocr/ocr/predictor.py:167: in predict_raw
    yield PredictionResult(p.decoded, codec=self.codec, text_postproc=self.text_postproc,
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/calamari_ocr/ocr/predictor.py:37: in __init__
    self.sentence = self.text_postproc.apply("".join(self.chars))
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/calamari_ocr/ocr/text_processing/text_processor.py:12: in apply
    return self._apply_single(txts)
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/calamari_ocr/ocr/text_processing/text_processor.py:44: in _apply_single
    txt = proc._apply_single(txt)
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/calamari_ocr/ocr/text_processing/text_regularizer.py:350: in _apply_single
    txt = re.sub(replacement.old, replacement.new, txt)
/home/b-mg106/.pyenv/versions/3.11.3/lib/python3.11/re/__init__.py:185: in sub
    return _compile(pattern, flags).sub(repl, string, count)
/home/b-mg106/.pyenv/versions/3.11.3/lib/python3.11/re/__init__.py:294: in _compile
    p = _compiler.compile(pattern, flags)
/home/b-mg106/.pyenv/versions/3.11.3/lib/python3.11/re/_compiler.py:743: in compile
    p = _parser.parse(p, flags)
/home/b-mg106/.pyenv/versions/3.11.3/lib/python3.11/re/_parser.py:980: in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
/home/b-mg106/.pyenv/versions/3.11.3/lib/python3.11/re/_parser.py:455: in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,

calamari-ocr 1.0.6

This took me a while to find. The problem lies in the regexen defined in the model!

E.g. 0.ckpt.json:

            {
              "old": "\\s+(?u)",
              "new": " ",
              "regex": true
            },
            {
              "old": "\\n(?u)",
              "regex": true
            },
            {
              "old": "^\\s+(?u)",
              "regex": true
            },
            {
              "old": "\\s+$(?u)",
              "regex": true
            }

Fixing the regexen in *.ckpt.json indeed fixes running on Python 3.11. I only tested make test for now, but this is promising.

Q&D script to fix the model:

import re
import json
from glob import glob

for fn in glob("*.json"):
    with open(fn, "r") as fp:
        j = json.load(fp)

    for v in j["model"].values():
        if type(v) != dict:
            continue
        for child in v.get("children", []):
            for replacement in child.get("replacements", []):
                # Move global flags in front
                replacement["old"] = re.sub(
                    r"^(.*)\(\?u\)$", r"(?u)\1", replacement["old"]
                )

    with open(fn, "w") as fp:
        json.dump(j, fp, indent=2)

master now includes the above script as fix-calamari1-model:

❯ fix-calamari1-model ~/.local/share/ocrd-resources/ocrd-calamari-recognize/qurator-gt4histocr-1.0
 [ ... unrelated numpy warning ... ]
0.ckpt.json fixed.
1.ckpt.json fixed.
2.ckpt.json fixed.
3.ckpt.json fixed.
4.ckpt.json fixed.

This (or something equivalent) should probably go into Calamari's 1.0 branch.

Fixing the regexen in *.ckpt.json indeed fixes running on Python 3.11. I only tested make test for now, but this is promising.

ocrd-calamari-recognize also works with the fixed model.

I've opened an issue upstream: Calamari-OCR/calamari#348

I've opened an issue upstream: Calamari-OCR/calamari#348

And a PR against calamari/1.0 branch that fixes the issue: Calamari-OCR/calamari#349

* [ ]  Document the workaround here 
* [ ]  Get a model update/fix procedure upstream

This issue should be enough for documentation, especially since nobody else uses 3.11 yet (fingers crossed). The fix is merged upstream and having a new release is tracked in #94.