ann-mark: normalization parameter does not work
Closed this issue · 2 comments
jbarth-ubhd commented
#!/bin/bash
set -x
sudo rm -rf OCR-D-* mets.xml
set -e
docker-ocrd ocrd-import -R '\.png$' .
docker-ocrd ocrd-tesserocr-recognize -I OCR-D-IMG -O OCR-D-OCR -P segmentation_
►level region -P textequiv_level word -P model deu
docker-ocrd ocrd-cor-asv-ann-mark -I OCR-D-OCR -O OCR-D-COR -P command 'hunspell
► -i utf-8 -d de_DE -w' -P format NOTFOUND -P normalization "{'ſ': 's', 'a†0364
►‡': 'ä'}"
→
Exception: Invalid parameters ['[normalization] "{\'ſ\': \'s\', \'aͤ\': \'ä\'}" is not of type \'object\'']
PS: Hex numbers between † and ‡ are unicode code points
bertsky commented
The problem is that you must pass all parameters to OCR-D processors as JSON. For simply typed literals, there is on-the-fly conversion, but mappings (in JSON parlance: "objects") are passed as is.
And JSON does not allow single quotes. So the help string of the normalization
parameter is misleading: It should be
{"ſ": "s", "aͤ": "ä"}
Note that the shell will interfere with its own quote expansion, so you'll have to escape the quotes on the CLI:
ocrd-cor-asv-ann-mark -I OCR-D-OCR -O OCR-D-COR -P command 'hunspell -i utf-8 -d de_DE -w' -P format NOTFOUND -P normalization "{\"ſ\": \"s\", \"a$(echo -e \\u0364)\": \"ä\"}"
Alternatively, you can use single outer quotes…
ocrd-cor-asv-ann-mark -I OCR-D-OCR -O OCR-D-COR -P command 'hunspell -i utf-8 -d de_DE -w' -P format NOTFOUND -P normalization '{"ſ": "s", "aͤ": "ä"}'
…or just use a JSON parameter file…
ocrd-cor-asv-ann-mark -I OCR-D-OCR -O OCR-D-COR -p mark-hunspell.json
{
"command": "hunspell -i utf-8 -d de_DE -w",
"format": "NOTFOUND",
"normalization": {
"ſ": "s", "aͤ": "ä"
}
}
bertsky commented
description of the parameter now uses double quotes.