Reading contents of a PDF
santiagomed opened this issue · 9 comments
Is there an example on how to simply read the contents of a PDF successfully? I tried looking into read.rs but it seems to be outdated so I can't run it. Any way to read a PDF?
What content do you want?
There is a lot in there.
- Content stream? You can get that from the page object.
- Text? See the pdf_render and pdf_text crates.
You can use the pdf crate in two version:
- from crates.io, then use the example that match it: https://github.com/pdf-rs/pdf/tree/a6e2abc96b23b64aa1051966bb000aabf1275d9f
- master with the latest fixes.
The pdf_render and pdf_text crates only work with the latest master.
What are the pdf_render and pdf_text crates ? crates.io doesn't know anything about that.
They are not on crates.io because they do not meet my stability requirements for publishing there.
pdf_render … renders pdfs.
pdf_text extracts text.
pdf-extract crate exists, but depends on lopdf
, not pdf
. This video benchmarks it against poppler, a C library.
I'd be curious to see a C/Rust comparison but with poppler against pdf_text
.