ASVLeipzig/cor-asv-ann

ann-mark: normalization parameter does not work

Closed this issue · 2 comments

#!/bin/bash
set -x
sudo rm -rf OCR-D-* mets.xml
set -e
docker-ocrd ocrd-import -R '\.png$' .
docker-ocrd ocrd-tesserocr-recognize -I OCR-D-IMG -O OCR-D-OCR -P segmentation_
►level region -P textequiv_level word -P model deu
docker-ocrd ocrd-cor-asv-ann-mark -I OCR-D-OCR -O OCR-D-COR -P command 'hunspell
► -i utf-8 -d de_DE -w' -P format NOTFOUND -P normalization "{'ſ': 's', 'a†0364
►‡': 'ä'}"

→
Exception: Invalid parameters ['[normalization] "{\'ſ\': \'s\', \'aͤ\': \'ä\'}" is not of type \'object\'']

PS: Hex numbers between † and ‡ are unicode code points

The problem is that you must pass all parameters to OCR-D processors as JSON. For simply typed literals, there is on-the-fly conversion, but mappings (in JSON parlance: "objects") are passed as is.

And JSON does not allow single quotes. So the help string of the normalization parameter is misleading: It should be

{"ſ": "s", "aͤ": "ä"}

Note that the shell will interfere with its own quote expansion, so you'll have to escape the quotes on the CLI:

ocrd-cor-asv-ann-mark -I OCR-D-OCR -O OCR-D-COR -P command 'hunspell -i utf-8 -d de_DE -w' -P format NOTFOUND -P normalization "{\"ſ\": \"s\", \"a$(echo -e \\u0364)\": \"ä\"}"

Alternatively, you can use single outer quotes…

ocrd-cor-asv-ann-mark -I OCR-D-OCR -O OCR-D-COR -P command 'hunspell -i utf-8 -d de_DE -w' -P format NOTFOUND -P normalization '{"ſ": "s", "aͤ": "ä"}'

…or just use a JSON parameter file…

ocrd-cor-asv-ann-mark -I OCR-D-OCR -O OCR-D-COR -p mark-hunspell.json
{
  "command":  "hunspell -i utf-8 -d de_DE -w",
  "format": "NOTFOUND",
  "normalization": {
    "ſ": "s", "aͤ": "ä"
  }
}

description of the parameter now uses double quotes.