wanghaisheng/awesome-ocr

多种 OCR 引擎结果评估后处理

wanghaisheng opened this issue · 2 comments

Process, enhance and evaluate multiple ocr-ouput.
https://github.com/UB-Mannheim/ocromore

  1. Parsing all ocr-outputfiles to an database
    (This step only has to be done once)

  2. Pre-process the gathered information
    The results from the following processes can also
    be stored directly to the database

    • Line-matching all files
    • Unspacing words in each file
      Unspacing means to delete whitespaces in spaced text
      (E.g. H e l l o => Hello)
    • Word-matching all files per line
  3. Combine file information

    • Different compare methods
      • Textdistance-Keying
        • Levenshtein
        • Damerau-Levenshtein
        • ...
      • Multi-Sequence-Alignment (MSA)
        • pivot-based
        • linewise/wordwise
        • Adjustable search-space-processor correction
          • Matching similar character
          • Whitespace/Wildcard improvements
        • Adjustable decision parameter
          • Char confidence
          • Best-of-n
  4. The output can be stored in the database and/or
    as *.txt or *.hocr.

  5. Evaluate the output against groundtruth files or each other and generate a accuracy report.
    Or compare the files visual via diff-tools.

https://github.com/KBNLresearch/ochre

Ochre

Ochre is a toolbox for OCR post-correction.

Overview of OCR post-correction data sets
Preprocess data sets
Train character-based language models/LSTMs for OCR post-correction
Do the post-correction
Assess the performance of OCR post-correction
Analyze OCR errors

Ochre contains ready-to-use data processing workflows (based on CWL). The software also allows you to create your own (OCR post-correction related) workflows. Examples of how to create these can be found in the notebooks directory (to be able to use those, make sure you have Jupyter Notebooks installed). This directory also contains notebooks that show how results can be analyzed and visualized.