Option for disabling document text method
m0dd0 opened this issue · 1 comments
First, thanks for this very helpful library!
For many of the papers I read your algorithm works fine and finds the correct doi.
But as you already mention in the README, for some papers the used document_text
method results in a wrong doi as the doi of other papers appear first.
Unfortunately this is very often the case for papers of certain conferences I read often as they contain arxiv IDs in the references and do not contain their own doi anywhere else in the text. At the same time, when I comment out the document_text method, I get pretty good results with the fourth method.
I am wondering if one of the following features might help to reduce these type of errors:
- only using the first pages to look for doi in text
- having an option to disable certain steps in the search process
- being able to customize the order of the search methods
Do you think one of these options (or smth else) is something which the library would benefit from and can be implemented with a reasonable effort? If so, I can see if I find the time to turn my current "comment-out-workaround" into a mergable feature.
All options you suggested are possible.
I would avoid the first option, because it would be tricky to come up with the "right amount of pages to look into". In many journals, the DOI is at the end of the paper.
Options 2 and 3 are relatively easy, although they would make the command line more verbose. You are more than welcome to do a PR with your code!