This tool can be used for evaluation OCR-D processors with taverna workflow.
This java library contains utility classes to extract all metadata of OCR-D workspaces.
In order to build this library you'll need:
- Java SE Development Kit 8 or higher
- ocrd 2.7.1 or higher
For building the tool go to project folder and type:
# Build library
user@localhost:workflow_generator/$./gradlew clean build
BUILD SUCCESSFUL in 11s
10 actionable tasks: 10 executed
As a result, a jar containing all the utility classes is created at 'build/libs/WorkflowTool-0.1.0.jar'.
To run tool go to project folder and type:
# Build library
user@localhost:workflow_generator/$java -jar ./build/libs/WorkflowTool-0.1.0.jar
Missing parameters!
Usage:
test path/to/workflow_configuration4permuation.txt [workflow_configuration_new.txt]
permutate path/to/workflow_configuration4permuation.txt [workflow_configuration_new.txt [max_no_of_processor_steps_per_file]]
eval path/to/workspace [evaluation_results.csv]
listProv path/to/workspace
Explanation:
Tool for creating workflow configuration file(s) used by taverna workflow.
test - Compiles a workflow_configuration file with all active processors.
permutate - Compiles workflow_configuration file(s) with all active processors.
eval - Evaluates the json files generated by dinglehopper.
listProv - Evaluates the provenance file(s) and prints the input/output file groups of each processor.
user@localhost:workflow_generator/$
This task may be used to test all active processors/parameters once. This is useful to detect typos. The created workflow can be executed with taverna workflow and analyze afterwards with the task 'listProv'
Make a permutation of all 'active' processors. It's recommended to restrict the number of possible processing steps per workflow configuration file. If too much files are listed inside the mets file it slows down the processing and may break due to memory issues. This parameter depends on hardware and size of the manuscript. If there are more processing steps than allowed per file it creaes multiple files with an index at the end.
Evaluation of different workflows. It collects all json files created by 'dinglehopper' and the provenance file(s) to compile a csv file with all relevant values. This csv file can be analyzed by pivot tables.
Evaluation of same workflow on different manuscripts. It collects all json files created by 'dinglehopper' and compile a csv file with statistics of all manuscripts. This csv file can be analyzed by pivot tables.
Analzye the provenance file created by taverna workflow. It prints all processors with their input file group and their created output file groups. If no output file groups are created the stdout und stderr files should be checked for further information.
You'll find a template for a workflow configuration file with working OCR-D processors at 'src/main/resources/workflow_configuration4permutation.txt'
To test a processor the leading '#' has to be removed. For each step at least one processor has to be 'active' (no leading '#') If the workflow step should be skipped activate the dummy processor ('ocrd_dummy')
:NOTE: Before starting replace all Strings '${TAVERNA_INSTALL_DIR} with the installation directory of taverna. (e.g.: /home/user/ocrd/taverna)
See scripts and helper files in 'tools' directory.
The library is licensed under the Apache License, Version 2.0.