Sanity check updated results
JensHeinrich opened this issue · 7 comments
Currently the following entry
@misc{BillBaker,
author = {Baker, Bill},
title = {Untitled Talk},
date = {2011/2012},
}
gets updated to
@misc{BillBaker,
author = {Ellams, Inua},
title = {Untitled},
date = {2011/2012},
doi = {10.5040/9781350210301.00000006},
source = {Crossref},
url = {https://doi.org/10.5040/9781350210301.00000006},
publisher = {Oberon Books},
year = {2010},
}
I wouldn't know how to improve this. Imagine the entry was
@misc{BillBaker,
author = {Ela, Ia},
title = {Untitled Talk},
date = {2011/2012},
}
-- what's the correct entry? The line is blurry, and it's impossible to draw a clear line. Eventually, the user will have to decide. Right now, betterbib does this. If you have a better suggestion, feel free to PR.
What do you think about raising an exception/warning when both the title and the author would be changed and it's not a real fix match, eg. ISBN or DOI?
The message could say:
f"The best match would change author from {author_orig} to {author_new} AND title from {title_orig} to {title_old}. If you want this, please fix one of the entries manually."
Another idea would be to add a parameter, minimum-match-score
or something like that, that excludes matches below that match score.
There now is minimum_score
. Set it to something greater than 0 to exclude certain results.
@nschloe, where would I find documentation on how the score is defined so I can set it to a "no false positives" kind of mode.
Thanks, Nico.
Unfortunately to us CrossRef is a huge project with only some parts FOSS - I couldn't find any information on sane defaults to minimize wrongful retrievals.
If it is unchanged, the following snippet might help others empirically assess scores that may be useful (requires crossref-commons
).
from crossref_commons.iteration import iterate_publications_as_json
# See http://api.crossref.org/swagger-ui/index.html#/Works/get_works.
filter = {"key": "value"}
queries = {"query.key": "query_value"}
for p in iterate_publications_as_json(max_results=10, filter=filter, queries=queries):
print(p["score"])