problems when parsing older paper in PDF format
XueliPan opened this issue · 2 comments
XueliPan commented
Hi, thanks for this great toolkit!
I tried the papermage with several PDF files. It works really well with recent papers but when I tried to parse some papers published in 1980 or 1989, papermage failed to parse the sentences.
doc = recipe.run("1980.pdf")
for sen in doc.sentences:
print(sen.text)
'''
output:
Received
January
1978;
revised
October
1979;
accepted
December 1979
References
1.
Avery,
K.
R.
,
and
Avery,
C.
A.
Design
and
development
of an interactive
statistical
system
(SIPS).
Proc.
Comptr.
Sci.
and
Statistics: 8th
Ann.
Symp.
on
'''
kyleclo commented
Interesting! could you send me the PDF so I can have a look at it? older PDFs not something we really investigated much