Mismatch between no. of divs in PDF.js and 0.2.0 version

Question

Mismatch between no. of divs in PDF.js and 0.2.0 version

akash-agr opened this issue 2 years ago · 3 comments

Hi,

Thanks a lot for creating this library. I am using this library more than a year now. I was working with 0.1.5 version. Everything was working fine, as i can match no. of divs created by PDF.js in web browser.

But after upgradation to 0.2.0 version, i noticed that no. of divs generated are significantly higher than 0.1.5 version. Hence, they are not matching with DIVs generated by PDF.js in web browser. Please see, if there is a bug or i am missing something.

Answer 1 · 2022-05-22T09:49:02.000Z

Please provide a demo PDF for me to check if there is something I can do or this may just be a new behaviour of the more recent pdf.js version. This library is currently using pdf.js 2.14.110. Which version of pdf.js is your browser using?

Answer 2 · 2022-05-24T12:53:27.000Z

Hi, Thanks for replying back.

PDF host - https://aqua-kelley-58.tiiny.site/
I am using "pdfjs-dist": "2.0.338" version in the react application.

I cant understand why two separate versions outputs different number of divs. Please have a look. Thank you.

Also please don't deprecate the older version.

PS. I have downgraded my pdf.js-extract to 0.1.5, and it is working as fine. I just wanted to highlight the issue.

Answer 3 · 2022-06-26T13:21:55.000Z

I'm not sure if this is related, but the normalizeWhitespace option has been removed from pdf.js API. This unfortunately makes the output unusable for me so I will also be remaining on the old version and I will have to find something else to extract text if there is no way to maintain the whitespace from pdf.js.