Ignore BOM
Closed this issue · 2 comments
mikegerber commented
If I create an empty file gt.txt
(0 bytes) and a file ocr.txt
that only contains a BOM (3 bytes), dinglehopper computes a CER of Infinity. It should ignore the BOM.
❯ ./reproduce
+ echo -ne ''
+ echo -ne '\xEF\xBB\xBF'
+ ls -l gt.txt ocr-just-bom.txt
-rw-r--r-- 1 b-mg106 b-mg106 0 Apr 20 19:58 gt.txt
-rw-r--r-- 1 b-mg106 b-mg106 3 Apr 20 19:58 ocr-just-bom.txt
+ dinglehopper gt.txt ocr-just-bom.txt
+ grep cer report.json
"cer": Infinity,
See also #79.
mikegerber commented
Tested on Python 3.11.3, Windows WSL Debian
mikegerber commented
Reproducer:
❯ cat reproduce
#!/bin/bash
# Must be run in bash for "echo -ne" to work. (not sh!)
set -x
echo -ne "This is a test." > gt.txt
echo -ne "\xEF\xBB\xBFThis is a test." > ocr-just-bom.txt
ls -l *.txt
dinglehopper gt.txt ocr-just-bom.txt
grep cer report.json # should be 0