
PDFTextStripper - parsing incorrectness

fungc opened this issue · 5 comments

fungc commented


I am using PDFTextStripper, from the PDFbox library, to parse the text out of the pdf generated from html using openhtmltopdf.

Code for parsing:
final PDDocument document = PDDocument.load(pdfBytes);
final PDFTextStripper pdfTextStripper = new PDFTextStripper();
return pdfTextStripper.getText(document);

However, I am seeing a few problems:

  1. Invisible, redundant text
    sometimes the PDF will have invisible text in front of the actual text.


line2 (<--- invisible)

This happens even when you just open the pdf and select / copy the text.

  1. commas are places in the wrong position, when parsed
    commas show up correctly, but when parsed, they show in incorrect position
    hello, my name, is

,,hello my name is

NOTE this does not happen when you open the pdf and select / copy the text.

  1. Interestingly, the comma problem goes away when I parse like this
    final PDDocument document = PDDocument.load(pdfBytes);
    final PDFTextStripper pdfTextStripper = new PDFTextStripper();
    return pdfTextStripper.getText(document);

However, all superscripts / subscripts then gets messed up on the output
e.g. receptiońs becomes receptións

Do you know why these happens?

Thank you!

Number 1 may be a serious bug in this library, so I'd love to get the html to reproduce it.

Number 2 and 3, I'm not sure. Does this happen with other PDFs or just ones produced by this library?

fungc commented


I can't get you the html at the moment, but here is an output pdf
I think (1) has to do with paging, it always happens at the end of a page or at the beginning.

(2) (3) does not happen with other PDFs; I was testing with Apache FOP.

do you have an email we can chat?

fungc commented


Found another bug. For extra long strings, the end of the string becomes invisible but copy-able

@fungc could you please provide html code for these issues?

@fungc, I know it has been a while, but I was able to reproduce but only with ordered list items. Was that your experience?

Anyway, I will try to debug.