ffalt/pdf.js-extract

Mismatch between no. of divs in PDF.js and 0.2.0 version

akash-agr opened this issue · 3 comments

Hi,

Thanks a lot for creating this library. I am using this library more than a year now. I was working with 0.1.5 version. Everything was working fine, as i can match no. of divs created by PDF.js in web browser.

But after upgradation to 0.2.0 version, i noticed that no. of divs generated are significantly higher than 0.1.5 version. Hence, they are not matching with DIVs generated by PDF.js in web browser. Please see, if there is a bug or i am missing something.

ffalt commented

Please provide a demo PDF for me to check if there is something I can do or this may just be a new behaviour of the more recent pdf.js version. This library is currently using pdf.js 2.14.110. Which version of pdf.js is your browser using?

Hi, Thanks for replying back.

PDF host - https://aqua-kelley-58.tiiny.site/
I am using "pdfjs-dist": "2.0.338" version in the react application.

I cant understand why two separate versions outputs different number of divs. Please have a look. Thank you.

Also please don't deprecate the older version.

PS. I have downgraded my pdf.js-extract to 0.1.5, and it is working as fine. I just wanted to highlight the issue.

I'm not sure if this is related, but the normalizeWhitespace option has been removed from pdf.js API. This unfortunately makes the output unusable for me so I will also be remaining on the old version and I will have to find something else to extract text if there is no way to maintain the whitespace from pdf.js.