qurator-spk/dinglehopper

Dinglehopper seems to act not coherently when it comes to empty files

imlabormitlea-code opened this issue · 15 comments

Hey!
I noticed that dinglehopper seems to have three cases to handle both empty GT and empty OCR:

  1. CER = inf
  2. CER = 0.0
  3. CER = NaN
    Could anyone say me when dinglehopper chooses which opportunity and how to unify the values consistently?
    Kind regards!

That's an interesting question! The CER here is normalized by len(GT) (see also Rice's dissertation for the definition of character accuracy, CER is 1-character accuracy), so it should be

  • CER = 0 if there is no error (which roughly means OCR==GT)
  • CER = inf or NaN if len(GT)==0

I would need to actually check what happens exactly when len(GT)==0 and make sure that's consistent. If you have any such examples - especially ones that show inconsistencies - please upload them here (preferably text files), that would help!

(There have also been discussions about redefining CER in terms of probabilities, but I didn't make time to think about that more.)

Note to self: check JSON report, I remember having problems with NaN not being valid JSON.
Nah, that was not it: The problem was with infinity.

@imlabormitlea-code I checked the code, it should never produce NaN as the CER, and if it does, it may very well be a bug. I'd need the files used to reproduce it and fix it!

Note to self: check JSON report, I remember having problems with NaN not being valid JSON. Nah, that was not it: The problem was with infinity.

Potential problem: reading the non-standard JSON could make a NaN from the "Infinity".

Hei!
Unfortunately I can't make my examples public here, but I clearly have inconsistencies reagrding 0/inf using the same GT document. I could show you the files, if you want, in a video call. Pls write me an e-mail if you are interested.
I'm beginning to figuring out what my problem is:
I have len(GT) = 0 files and want to distinguish whether the OCR software gives empty text or sees artefacts. Therefore I can't have the same measure (inf) for both. But it seems as if CER = inf and WER = 0 is the first case and both inf is the second one.. nevertheless i still have cases where dinglehopper gives 0 for len(gt) = 0...
i want to have a mean cer/wer value comparing different ocr-softwares. having inf in the values gives inf as mean. not including inf values is problematic, because it excludes the quality of detecting blank pages.
Kind regards.

grafik
grafik
same picture and GT, different OCR-software. both OCR-softwares gave empty files

This is really hard to understand/reproduce without any files. I'll test a few more examples I can think of, but especially in these edge cases it's really good to have some real world examples. The CER and WER also depend on the extracted text and this extraction could go wrong in various ways, too1.

If the GT is empty (len(GT)=0) there are two cases possible:

  • OCR also gives empty results (and empty means len(OCR)=0), the CER should be 0. WER would also be 0 here.
  • If there is OCR, even only whitespace (len(OCR)>0), dinglehopper should produce CER=Inf. WER would be 0 or Inf, depending on if there can be "words" extracted from the OCR text.

It may also very well be that in the case where GT is empty (the page is empty) that dinglehopper is not the right method to evaluate this.

From your screenshots I suspect that the first OCR output is not empty but contains whitespace, that would explain the CER being Inf and the WER being 0 (because from whitespace we can't extract any words, and so GT words == OCR words, thus WER = 0). So as far as I can see, with the information I have, this could be valid output.

Footnotes

  1. While I am relatively confident in the text extraction, I am also sure that there edge cases I haven't considered/implemented correctly. Especially when there is whitespace involved, which I suspect being the problem here.

@imlabormitlea-code If it helps: I don't need the images. ALTO/PAGE/text would be enough.

Thx for the reply, I checked for whitespaces, there are non. But generating an example for this, I recognized that there may be just line breaks. Will check that and report.

Okay well, this is funny...
In my ABBYY-Output (XML to txt with dinglehopper) some process gave me empty files with 'U+FEFF' chars. I didn't see them until now.

Ah, a BOM :) I'll have a look!

Ah, so this is a whitespace problem:

❯ ls -l
total 0
-rw-r--r-- 1 b-mg106 b-mg106 0 Apr 20 14:33 gt.txt
-rw-r--r-- 1 b-mg106 b-mg106 3 Apr 20 14:32 ocr1.txt
-rw-r--r-- 1 b-mg106 b-mg106 1 Apr 20 14:33 ocr2.txt

(Files the same as yours above, just renamed).

  • ocr1.txt contains a UTF-8 BOM (0xefbbbf, 3 bytes)
  • ocr2.txt contains a single newline character (0x0a, 1 byte)

Interestingly, I get CER=Infinity and WER=0 for both files 🤪 (Debian 11 on Windows WSL, Python 3.11.3)

  • For ocr1.txt, this is clearly a bug, as the BOM is not a character and should be ignored
  • For ocr2.txt, this is a matter of interpretation: Currently, we count LF as a character, which sometimes also makes sense (e.g. you'd want to a missing newline to count as an error) and so Infinity is the correct CER for our currently used definition of the CER.

The WER being 0 makes sense, there are no words in either GT or OCR, so there are no error and thus the WER is 0.

@imlabormitlea-code I can't tell which text file is for which of your results. If I had to guess, I'd say I used the same order as in your results and maybe your platform handled the BOM differently and yielded CER=0? Could you verify that you get CER for the 3-byte BOM file and tell me which platform (Windows? Mac? Python version?) you're running it?

@imlabormitlea-code dinglehopper now handles the BOM properly. So there's one problem down. I have to re-read this issue again to look for remaining problems.