gxrxrdx/tesseract-ocr

When creating searchable pdf, file contents are not flushed and file handle is not released

Closed this issue · 7 comments

What steps will reproduce the problem?
1. Use tesseract 3.04
2. In file api/tesseractmain.cpp add a sleep before the program exits. For 
example:
  ....
  fprintf(stdout, "DONE\n");
  sleep(60);
  fprintf(stdout, "EXITING\n");

  PERF_COUNT_END
  return 0;                      // Normal exit
}

3. Run tesseract to create a searchable pdf. On a different console, monitor 
the result. For example:
> tail -f result.pdf


What is the expected output? What do you see instead?
After DONE is printed, some of the contents of the searchable pdf are written 
on file result.pdf. The expected result is that the whole pdf content up to the 
last EOF is written to the file and the file is properly closed. However this 
only happens when after EXIT is printed, when the program finally exits.

Please use labels and text to provide additional information.

Original issue reported on code.google.com by gpapadop73 on 27 Jul 2015 at 10:15

a) what platform is this, Linux?
b) is this streaming to stdout, e.g. tesseract input.tif - pdf > output.pdf
c) if yes, do you also get it with other formats, e.g. tesseract input.tif - 
hocr > output.hocr

It's quite possible we can "fix" this by closing the stdout stream when
we finish writing. This will have the benefit of making it impossible
for someone to accidentally stream multiple output formats to stdout
and cause silent data corruption.

Not sure where the code is, let me spend a couple minutes checking.

PS. Just curious, how did you even notice this?


Original comment by breidenb...@gmail.com on 28 Jul 2015 at 7:01

Yeah, it's right here.

https://github.com/tesseract-ocr/tesseract/blob/master/api/renderer.cpp#L33

The original idea of not closing stdout after we finish with it was 
introduced by Zdenko back in Dec 23, 2012. I don't know why. Zdenko, 
do you remember what you were thinking about?

https://github.com/tesseract-ocr/tesseract/commit/4812fac33e25f0b384d473b597e935
08725ce058

Original comment by breidenb...@gmail.com on 28 Jul 2015 at 7:17

IMO reporter does not use stdout ("On a different console, monitor the 
result")...

Regarding closing stdout - AFAIK if we perform fclose(stdout) - (especially 
outside of main) it will cause program will not be able to write to stdout 
(e.g. warning, some info) and program will crash. So fclose(stdout) is not 
considered as wise action.

Original comment by zde...@gmail.com on 29 Jul 2015 at 6:35

a) I tried this on Linux. Originally I found this on Windows with tess4j java 
wrapper. But in order to confirm that the problem is not on the wrapper, I 
tried it on Linux.
b) No I am streaming to a file. Here is my command:
tesseract tesseract-3.04.00/testing/eurotext.png result --tessdata-dir 
tesseract-ocr -c tessedit_create_pdf=true
It produces file result.pdf
c) I am only interested in pdf, so I have not tried other formats.

I have an application where the user can work on several images. We want to 
provide the ability to create a searchable pdf from an image. The problem 
becomes obvious because the user creates one pdf and if he tries to open it, it 
fails. The produced pdf can be correctly opened only when the user closes the 
application.

I now see that the problem is that the renderer's destructor is called when the 
main function is about to return. 

Original comment by gpapadop73 on 29 Jul 2015 at 7:09

The problem happened because in the java code there was no call to 
TessDeleteResultRenderer. But the tricky part was that adding this call did not 
solve the problem. The reason was the 
delete[] renderer;
instead of 
delete renderer;
which you fixed in file api/capi.cpp

So after getting your source from label 3.04.01dev and fixing the java wrapper, 
it works fine.

Thank you

Original comment by gpapadop73 on 29 Jul 2015 at 2:00

Original comment by zde...@gmail.com on 29 Jul 2015 at 4:01

  • Changed state: Fixed
Regarding comment #4, I sure hope warnings or info go to stderr, not stdout.

But since gpapadop73 is happy, then I am too.

Original comment by breidenb...@gmail.com on 29 Jul 2015 at 9:56