OCR-D Tool for Evaluation

This tool can be used for evaluation OCR-D processors with taverna workflow.

This java library contains utility classes to extract all metadata of OCR-D workspaces.

Prerequisites

In order to build this library you'll need:

Java SE Development Kit 8 or higher
ocrd 2.7.1 or higher

Building Tool

For building the tool go to project folder and type:

# Build library
user@localhost:workflow_generator/$./gradlew clean build 
BUILD SUCCESSFUL in 11s
10 actionable tasks: 10 executed

As a result, a jar containing all the utility classes is created at 'build/libs/WorkflowTool-0.1.0.jar'.

First Test

To run tool go to project folder and type:

# Build library
user@localhost:workflow_generator/$java -jar ./build/libs/WorkflowTool-0.1.0.jar
Missing parameters!
Usage:
  test      path/to/workflow_configuration4permuation.txt [workflow_configuration_new.txt] 
  permutate path/to/workflow_configuration4permuation.txt [workflow_configuration_new.txt [max_no_of_processor_steps_per_file]] 
  eval      path/to/workspace [evaluation_results.csv] 
  listProv  path/to/workspace


Explanation:

Tool for creating workflow configuration file(s) used by taverna workflow.

test      - Compiles a workflow_configuration file with all active processors.
permutate - Compiles workflow_configuration file(s) with all active processors.
eval      - Evaluates the json files generated by dinglehopper.
listProv  - Evaluates the provenance file(s) and prints the input/output file groups of each processor.
user@localhost:workflow_generator/$

Task 'test'

This task may be used to test all active processors/parameters once. This is useful to detect typos. The created workflow can be executed with taverna workflow and analyze afterwards with the task 'listProv'

Task 'permutate'

Make a permutation of all 'active' processors. It's recommended to restrict the number of possible processing steps per workflow configuration file. If too much files are listed inside the mets file it slows down the processing and may break due to memory issues. This parameter depends on hardware and size of the manuscript. If there are more processing steps than allowed per file it creaes multiple files with an index at the end.

Task 'eval'

Evaluation of different workflows. It collects all json files created by 'dinglehopper' and the provenance file(s) to compile a csv file with all relevant values. This csv file can be analyzed by pivot tables.

Task 'evalGT'

Evaluation of same workflow on different manuscripts. It collects all json files created by 'dinglehopper' and compile a csv file with statistics of all manuscripts. This csv file can be analyzed by pivot tables.

Task 'listProv'

Analzye the provenance file created by taverna workflow. It prints all processors with their input file group and their created output file groups. If no output file groups are created the stdout und stderr files should be checked for further information.

Template for workflow configuration file

You'll find a template for a workflow configuration file with working OCR-D processors at 'src/main/resources/workflow_configuration4permutation.txt'

To test a processor the leading '#' has to be removed. For each step at least one processor has to be 'active' (no leading '#') If the workflow step should be skipped activate the dummy processor ('ocrd_dummy')

:NOTE: Before starting replace all Strings '${TAVERNA_INSTALL_DIR} with the installation directory of taverna. (e.g.: /home/user/ocrd/taverna)

OCR-D/workflow_generator