kanzure/pdfparanoia

JSTOR

kanzure opened this issue · 0 comments

JSTOR is not immediately obvious. Here's my scratch notes:

import pdfparanoia
pdf = pdfparanoia.parser.parse_content(open("sample.pdf", "rb").read())
text = pdf.catalog["Pages"].resolve()["Kids"][0].resolve()["Contents"].resolve().data

list(pdf.get_pages())[0].contents[0].decode
text2 = list(pdf.get_pages())[0].contents[0].data

"This content downloaded on" appears in the "text" and "text2" variables. This seems to be text for the entire page, so deleting the entire object wont solve the problem.