Parsing text from PDF

Question

Parsing text from PDF

nunofgs opened this issue 5 years ago · 3 comments

Hi @Hopding, thank you for the great lib.

Apologies if this is a newbie question, but I can't seem to find a way to parse text out of an existing PDF. I'm looking to retrieve a string from a PDF in order to determine which page it's on.

Any idea how I could accomplish this?

Answer 1 · 2019-07-16T11:39:49.000Z

I'm personally looking to find some text and replace the "field"'s contents

Answer 2 · 2019-07-20T23:04:25.000Z

Hello @nunofgs!

It is not currently possible to parse plain text out of a document with pdf-lib (but you can extract the content of acroform fields). I'd suggest you consider using PDF.js to parse/extract text.

Of course, this isn't an ideal solution since it requires two different libraries for a seemingly simple task. But it's the best approach I know of for now, until pdf-lib gains support for text parsing.

Answer 3 · 2019-07-20T23:08:35.000Z

@dasilvacontin Is the field you are working with just plain text? Or is it an acroform field? If it is raw text, I'm afraid pdf-lib doesn't have the necessary features to parse it (but as I mentioned above, you could use PDF.js instead).

However, if it's in an acroform, pdf-lib should be able to do what you need. pdf-lib's acroform support isn't currently well documented, so I'd suggest taking a look at some of the existing acroform issues. Please let me know if you have any questions!