JSTOR
kanzure opened this issue · 0 comments
kanzure commented
JSTOR is not immediately obvious. Here's my scratch notes:
import pdfparanoia
pdf = pdfparanoia.parser.parse_content(open("sample.pdf", "rb").read())
text = pdf.catalog["Pages"].resolve()["Kids"][0].resolve()["Contents"].resolve().data
list(pdf.get_pages())[0].contents[0].decode
text2 = list(pdf.get_pages())[0].contents[0].data
"This content downloaded on" appears in the "text" and "text2" variables. This seems to be text for the entire page, so deleting the entire object wont solve the problem.