qurator-spk/dinglehopper

Ignore BOM

Closed this issue · 2 comments

If I create an empty file gt.txt (0 bytes) and a file ocr.txt that only contains a BOM (3 bytes), dinglehopper computes a CER of Infinity. It should ignore the BOM.

❯ ./reproduce
+ echo -ne ''
+ echo -ne '\xEF\xBB\xBF'
+ ls -l gt.txt ocr-just-bom.txt
-rw-r--r-- 1 b-mg106 b-mg106 0 Apr 20 19:58 gt.txt
-rw-r--r-- 1 b-mg106 b-mg106 3 Apr 20 19:58 ocr-just-bom.txt
+ dinglehopper gt.txt ocr-just-bom.txt
+ grep cer report.json
    "cer": Infinity,

See also #79.

Tested on Python 3.11.3, Windows WSL Debian

Reproducer:

❯ cat reproduce
#!/bin/bash

# Must be run in bash for "echo -ne" to work. (not sh!)

set -x
echo -ne "This is a test." > gt.txt
echo -ne "\xEF\xBB\xBFThis is a test." > ocr-just-bom.txt
ls -l *.txt
dinglehopper gt.txt ocr-just-bom.txt
grep cer report.json   # should be 0