Parsing text from PDF
nunofgs opened this issue · 3 comments
Hi @Hopding, thank you for the great lib.
Apologies if this is a newbie question, but I can't seem to find a way to parse text out of an existing PDF. I'm looking to retrieve a string from a PDF in order to determine which page it's on.
Any idea how I could accomplish this?
I'm personally looking to find some text and replace the "field"'s contents
Hello @nunofgs!
It is not currently possible to parse plain text out of a document with pdf-lib (but you can extract the content of acroform fields). I'd suggest you consider using PDF.js to parse/extract text.
Of course, this isn't an ideal solution since it requires two different libraries for a seemingly simple task. But it's the best approach I know of for now, until pdf-lib gains support for text parsing.
@dasilvacontin Is the field you are working with just plain text? Or is it an acroform field? If it is raw text, I'm afraid pdf-lib doesn't have the necessary features to parse it (but as I mentioned above, you could use PDF.js instead).
However, if it's in an acroform, pdf-lib should be able to do what you need. pdf-lib's acroform support isn't currently well documented, so I'd suggest taking a look at some of the existing acroform issues. Please let me know if you have any questions!