perrette/papers

DOI parsing fails in a few cases

perrette opened this issue · 2 comments

The current method to retrieve DOI consists in search for regular expressions over the first two pages, and to keep the first one that appear.

Accepted prefixes are (lower or upper case):

'doi:', 'doi: ', 'doi ', 'dx\.doi\.org/', 'doi/'

DOI itself is searched as:

r"10\.\d\d\d\d/.*?"

And is expected to finish with:

r"[, \n]"

The method fails in a few cases:

  • when DOI spreads over two lines (e.g. here)
  • when other DOIs appear before the actual paper's DOI, for example here

These could be solved by more permissive parsing of DOI, but keep it conservative for now until a good solution is found.

Nevertheless, existing edits / fixes currently include:

  • underscore sometimes gets converted into an empty space by pdftotxt, so we also detect ending with any space followed by a digit. This solves at least one case.

https://github.com/MicheleCotrufo/pdf2doi might be a candidate, just to continue the conversation from #28

Yes. Thanks for the suggestion. If you end up exploring that sort of things (with your large PDF database to test with !) I'd be glad if you could report back about what works best. And who knows, perhaps someone shows up who feels like merging all the good tools into @MicheleCotrufo's pdf2doi (if practical) or another stand-alone package. That should be a library with python bindings, ideally not a verbose command line tool (so that it can be used in other command line tools), though one could certainly call it with subprocess (as is already done here with poppler utils).