When creating searchable pdf, file contents are not flushed and file handle is not released
Closed this issue · 7 comments
GoogleCodeExporter commented
What steps will reproduce the problem?
1. Use tesseract 3.04
2. In file api/tesseractmain.cpp add a sleep before the program exits. For
example:
....
fprintf(stdout, "DONE\n");
sleep(60);
fprintf(stdout, "EXITING\n");
PERF_COUNT_END
return 0; // Normal exit
}
3. Run tesseract to create a searchable pdf. On a different console, monitor
the result. For example:
> tail -f result.pdf
What is the expected output? What do you see instead?
After DONE is printed, some of the contents of the searchable pdf are written
on file result.pdf. The expected result is that the whole pdf content up to the
last EOF is written to the file and the file is properly closed. However this
only happens when after EXIT is printed, when the program finally exits.
Please use labels and text to provide additional information.
Original issue reported on code.google.com by gpapadop73
on 27 Jul 2015 at 10:15
GoogleCodeExporter commented
a) what platform is this, Linux?
b) is this streaming to stdout, e.g. tesseract input.tif - pdf > output.pdf
c) if yes, do you also get it with other formats, e.g. tesseract input.tif -
hocr > output.hocr
It's quite possible we can "fix" this by closing the stdout stream when
we finish writing. This will have the benefit of making it impossible
for someone to accidentally stream multiple output formats to stdout
and cause silent data corruption.
Not sure where the code is, let me spend a couple minutes checking.
PS. Just curious, how did you even notice this?
Original comment by breidenb...@gmail.com
on 28 Jul 2015 at 7:01
GoogleCodeExporter commented
Yeah, it's right here.
https://github.com/tesseract-ocr/tesseract/blob/master/api/renderer.cpp#L33
The original idea of not closing stdout after we finish with it was
introduced by Zdenko back in Dec 23, 2012. I don't know why. Zdenko,
do you remember what you were thinking about?
https://github.com/tesseract-ocr/tesseract/commit/4812fac33e25f0b384d473b597e935
08725ce058
Original comment by breidenb...@gmail.com
on 28 Jul 2015 at 7:17
GoogleCodeExporter commented
IMO reporter does not use stdout ("On a different console, monitor the
result")...
Regarding closing stdout - AFAIK if we perform fclose(stdout) - (especially
outside of main) it will cause program will not be able to write to stdout
(e.g. warning, some info) and program will crash. So fclose(stdout) is not
considered as wise action.
Original comment by zde...@gmail.com
on 29 Jul 2015 at 6:35
GoogleCodeExporter commented
a) I tried this on Linux. Originally I found this on Windows with tess4j java
wrapper. But in order to confirm that the problem is not on the wrapper, I
tried it on Linux.
b) No I am streaming to a file. Here is my command:
tesseract tesseract-3.04.00/testing/eurotext.png result --tessdata-dir
tesseract-ocr -c tessedit_create_pdf=true
It produces file result.pdf
c) I am only interested in pdf, so I have not tried other formats.
I have an application where the user can work on several images. We want to
provide the ability to create a searchable pdf from an image. The problem
becomes obvious because the user creates one pdf and if he tries to open it, it
fails. The produced pdf can be correctly opened only when the user closes the
application.
I now see that the problem is that the renderer's destructor is called when the
main function is about to return.
Original comment by gpapadop73
on 29 Jul 2015 at 7:09
GoogleCodeExporter commented
The problem happened because in the java code there was no call to
TessDeleteResultRenderer. But the tricky part was that adding this call did not
solve the problem. The reason was the
delete[] renderer;
instead of
delete renderer;
which you fixed in file api/capi.cpp
So after getting your source from label 3.04.01dev and fixing the java wrapper,
it works fine.
Thank you
Original comment by gpapadop73
on 29 Jul 2015 at 2:00
GoogleCodeExporter commented
Original comment by zde...@gmail.com
on 29 Jul 2015 at 4:01
- Changed state: Fixed
GoogleCodeExporter commented
Regarding comment #4, I sure hope warnings or info go to stderr, not stdout.
But since gpapadop73 is happy, then I am too.
Original comment by breidenb...@gmail.com
on 29 Jul 2015 at 9:56