pdf-extract instead of pdfreader
retrography opened this issue · 2 comments
I don't know if you have noticed this project or not: https://github.com/blusquare/pdfextract
It is an abandoned CrossRef project, but this fork still works well for extracting references. The gem does structural analysis on the PDF file, and thus needs literally no input from the user in order to detect the references. Probably a better match than using the raw pdfreader.
The project is MIT-licensed, and doesn't impose restrictions on derivative work.
pdf-extract extract --references glas.pdf > glas.xml
sed -r -e 's/<[^<>]+>//ig' -e 's/^ +//' glas.xml > glas.txt
anystyle -f bib find glas.txt
I was not aware of that project, no. You're right, the finder component currently works on plain text so we're losing a lot of valuable information (font styles, metrics, exact positioning) -- using rich text information was supposed to be a next step (if necessary at all). I'd have to look at pdfextract more closely but, yes, it may be a great fit. Meanwhile, cool that you can still plug it into parser module as above.
The naming is a little bit confusing, but this one is actually pdf-extract not pdfextract (The gem name I mean, not the repo name). It doesn't give you rich text, but it automatically detects the reference pages and eliminates the margins with no user intervention. The output is a XML file, with each reference enclosed as plain text in a XML tag. Anystyle really likes the output!