problems when parsing older paper in PDF format

Question

problems when parsing older paper in PDF format

XueliPan opened this issue a year ago · 2 comments

Hi, thanks for this great toolkit!
I tried the papermage with several PDF files. It works really well with recent papers but when I tried to parse some papers published in 1980 or 1989, papermage failed to parse the sentences.

doc = recipe.run("1980.pdf")
for sen in doc.sentences:
    print(sen.text)
'''
output:
Received
January
1978;
revised
October
1979;
accepted
December 1979
References
1.
Avery,
K.
R.
,
and
Avery,
C.
A.
Design
and
development
of an interactive
statistical
system
(SIPS).
Proc.
Comptr.
Sci.
and
Statistics: 8th
Ann.
Symp.
on
'''

Answer 1 · 2023-12-19T01:10:40.000Z

Interesting! could you send me the PDF so I can have a look at it? older PDFs not something we really investigated much

Answer 2 · 2023-12-19T11:38:03.000Z

1980.pdf
1989.pdf
These are the two PDF files that I have tested. Thanks!