Failed to identify title from JMF
eliotlencelot opened this issue · 2 comments
Hello metebalci,
I am not able to use pdftitle -p PDF
to extract the title of scientific articles from the Journal of Medicinal Food.
For example this file do not produce a title:
woo2019.pdf
Is it possible to change a bit the algorithm for this kind of articles?
I have tried the new option pdftitle -a max2 -p PDF
without success. I do not see a list of parameters that can be passed to -a
in the readme, so to the best of my knowledge, reading this github repository, there is only the options -a max2
and -a default
. If not, please note that I have not tried other algorithms.
Thank you!
I do also have a python error raised by pdfminer when adding the verbose -v
option :
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/pdfminer/pdffont.py", line 593, in to_unichr
return self.cid2unicode[cid]
KeyError: 2
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/pdftitle.py", line 589, in run
title = get_title_from_file(args.pdf)
File "/usr/local/lib/python3.7/dist-packages/pdftitle.py", line 523, in get_title_from_file
return get_title_from_io(raw_file)
File "/usr/local/lib/python3.7/dist-packages/pdftitle.py", line 444, in get_title_from_io
interpreter.process_page(page)
File "/usr/local/lib/python3.7/dist-packages/pdfminer/pdfinterp.py", line 895, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/usr/local/lib/python3.7/dist-packages/pdfminer/pdfinterp.py", line 908, in render_contents
self.execute(list_value(streams))
File "/usr/local/lib/python3.7/dist-packages/pdfminer/pdfinterp.py", line 933, in execute
func(*args)
File "/usr/local/lib/python3.7/dist-packages/pdftitle.py", line 291, in do_Tj
self.do_TJ([s])
File "/usr/local/lib/python3.7/dist-packages/pdftitle.py", line 323, in do_TJ
self.device.process_string(self.mpts, seq)
File "/usr/local/lib/python3.7/dist-packages/pdftitle.py", line 374, in process_string
self.draw_cid(ts, cid)
File "/usr/local/lib/python3.7/dist-packages/pdftitle.py", line 394, in draw_cid
unichar = ts.Tf.to_unichr(cid)
File "/usr/local/lib/python3.7/dist-packages/pdfminer/pdffont.py", line 595, in to_unichr
raise PDFUnicodeNotDefined(None, cid)
pdfminer.pdffont.PDFUnicodeNotDefined: (None, 2)
Hello and thanks for raising this issue.
First, the error you mentioned on the second comment is because there is a character in the pdf that does not exist in the font. To overcome this, you can use --replace-missing-char
option (e.g. use ' ' to replace missing chars with space). I was silently ignoring the exceptions in normal (no verbose) mode, I have changed this behavior in the new version 0.9.
I checked the pdf you linked, and the problem was the first letter (A) of the first paragraph was the largest font in the page, not the title. So I implemented another -more general than original- algorithm, called eliot, in version 0.9. With this algorithm, you can select which font size (not the absolute value but in terms of its order in size, e.g. 0 is the largest, 1 is the second largest).
Now the result is:
$ pdftitle -a eliot --eliot-tfs 1 -p woo2019.pdf --replace-missing-char ' '
Lactobacillus HY2782 and Bifidobacterium HY8002 Decrease Airway Hyperresponsiveness Induced by Chronic PM2.5 Inhalation in Mice