Python 3.11: re.error: global flags not at the start of the expression at position 3
Closed this issue · 9 comments
Possibly a problem only with Python 3.11:
FAILED test/test_recognize.py::test_recognize - re.error: global flags not at the start of the expression at position 3
FAILED test/test_recognize.py::test_recognize_should_warn_if_given_rgb_image_and_single_channel_model - re.error: global flags not at the start of the expression at position 3
FAILED test/test_recognize.py::test_word_segmentation - re.error: global flags not at the start of the expression at position 3
FAILED test/test_recognize.py::test_glyphs - re.error: global flags not at the start of the expression at position 3
Python 3.11 changelog seems to support this assumption.
(This would have been nice to have been caught early, using a linter. → Opening another issue.)
- Fix it here
- Merge #78
- Document the workaround here
- Get a model update/fix procedure upstream
This does indeed not happen with Python 3.10.12.
It's a problem in calamari-ocr, not ocrd_calamari:
/home/b-mg106/devel/ocrd_calamari/ocrd_calamari/recognize.py:129: in process
for line, line_coords, raw_results in zip(textlines, line_coordss, raw_results_all):
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/calamari_ocr/ocr/predictor.py:250: in predict_raw
for result in zip(*prediction):
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/calamari_ocr/ocr/predictor.py:167: in predict_raw
yield PredictionResult(p.decoded, codec=self.codec, text_postproc=self.text_postproc,
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/calamari_ocr/ocr/predictor.py:37: in __init__
self.sentence = self.text_postproc.apply("".join(self.chars))
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/calamari_ocr/ocr/text_processing/text_processor.py:12: in apply
return self._apply_single(txts)
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/calamari_ocr/ocr/text_processing/text_processor.py:44: in _apply_single
txt = proc._apply_single(txt)
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/calamari_ocr/ocr/text_processing/text_regularizer.py:350: in _apply_single
txt = re.sub(replacement.old, replacement.new, txt)
/home/b-mg106/.pyenv/versions/3.11.3/lib/python3.11/re/__init__.py:185: in sub
return _compile(pattern, flags).sub(repl, string, count)
/home/b-mg106/.pyenv/versions/3.11.3/lib/python3.11/re/__init__.py:294: in _compile
p = _compiler.compile(pattern, flags)
/home/b-mg106/.pyenv/versions/3.11.3/lib/python3.11/re/_compiler.py:743: in compile
p = _parser.parse(p, flags)
/home/b-mg106/.pyenv/versions/3.11.3/lib/python3.11/re/_parser.py:980: in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
/home/b-mg106/.pyenv/versions/3.11.3/lib/python3.11/re/_parser.py:455: in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
calamari-ocr 1.0.6
This took me a while to find. The problem lies in the regexen defined in the model!
E.g. 0.ckpt.json
:
{
"old": "\\s+(?u)",
"new": " ",
"regex": true
},
{
"old": "\\n(?u)",
"regex": true
},
{
"old": "^\\s+(?u)",
"regex": true
},
{
"old": "\\s+$(?u)",
"regex": true
}
Fixing the regexen in *.ckpt.json
indeed fixes running on Python 3.11. I only tested make test
for now, but this is promising.
Q&D script to fix the model:
import re
import json
from glob import glob
for fn in glob("*.json"):
with open(fn, "r") as fp:
j = json.load(fp)
for v in j["model"].values():
if type(v) != dict:
continue
for child in v.get("children", []):
for replacement in child.get("replacements", []):
# Move global flags in front
replacement["old"] = re.sub(
r"^(.*)\(\?u\)$", r"(?u)\1", replacement["old"]
)
with open(fn, "w") as fp:
json.dump(j, fp, indent=2)
master now includes the above script as fix-calamari1-model
:
❯ fix-calamari1-model ~/.local/share/ocrd-resources/ocrd-calamari-recognize/qurator-gt4histocr-1.0
[ ... unrelated numpy warning ... ]
0.ckpt.json fixed.
1.ckpt.json fixed.
2.ckpt.json fixed.
3.ckpt.json fixed.
4.ckpt.json fixed.
This (or something equivalent) should probably go into Calamari's 1.0 branch.
Fixing the regexen in
*.ckpt.json
indeed fixes running on Python 3.11. I only testedmake test
for now, but this is promising.
ocrd-calamari-recognize
also works with the fixed model.
I've opened an issue upstream: Calamari-OCR/calamari#348
I've opened an issue upstream: Calamari-OCR/calamari#348
And a PR against calamari/1.0 branch that fixes the issue: Calamari-OCR/calamari#349
* [ ] Document the workaround here * [ ] Get a model update/fix procedure upstream
This issue should be enough for documentation, especially since nobody else uses 3.11 yet (fingers crossed). The fix is merged upstream and having a new release is tracked in #94.