qurator-spk/dinglehopper

UnorderedGroup

mikegerber opened this issue · 0 comments

@cneud reported problems with the ENP dataset. Example files:

example.zip

The GT file contains an UnorderedGroup which triggers an NotImplementedError:

% dinglehopper 00008061.gt.xml 00008061.eng.xml
Traceback (most recent call last):
  File "/home/mike/.virtualenvs/dinglehopper-github/bin/dinglehopper", line 11, in <module>
    load_entry_point('dinglehopper', 'console_scripts', 'dinglehopper')()
  File "/home/mike/.virtualenvs/dinglehopper-github/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/mike/.virtualenvs/dinglehopper-github/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/mike/.virtualenvs/dinglehopper-github/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mike/.virtualenvs/dinglehopper-github/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/mike/devel/dinglehopper-github/qurator/dinglehopper/cli.py", line 180, in main
    process(gt, ocr, report_prefix, metrics=metrics, textequiv_level=textequiv_level)
  File "/home/mike/devel/dinglehopper-github/qurator/dinglehopper/cli.py", line 93, in process
    gt_text = extract(gt, textequiv_level=textequiv_level)
  File "/home/mike/devel/dinglehopper-github/qurator/dinglehopper/ocr_files.py", line 155, in extract
    return page_extract(tree, textequiv_level=textequiv_level)
  File "/home/mike/devel/dinglehopper-github/qurator/dinglehopper/ocr_files.py", line 79, in page_extract
    raise NotImplementedError
NotImplementedError
  • Make this a warning and read UnorderedGroups in XML order
  • Check what other tools do with this
  • Find a proper solution (Hard!)