texworld/betterbib

Sanity check updated results

JensHeinrich opened this issue · 7 comments

Currently the following entry

@misc{BillBaker,
        author = {Baker, Bill},
        title = {Untitled Talk},
        date = {2011/2012},
}

gets updated to

@misc{BillBaker,
        author = {Ellams, Inua},
        title = {Untitled},
        date = {2011/2012},
        doi = {10.5040/9781350210301.00000006},
        source = {Crossref},
        url = {https://doi.org/10.5040/9781350210301.00000006},
        publisher = {Oberon Books},
        year = {2010},
}

I wouldn't know how to improve this. Imagine the entry was

@misc{BillBaker,
        author = {Ela, Ia},
        title = {Untitled Talk},
        date = {2011/2012},
}

-- what's the correct entry? The line is blurry, and it's impossible to draw a clear line. Eventually, the user will have to decide. Right now, betterbib does this. If you have a better suggestion, feel free to PR.

What do you think about raising an exception/warning when both the title and the author would be changed and it's not a real fix match, eg. ISBN or DOI?
The message could say:

f"The best match would change author from {author_orig} to {author_new} AND title from {title_orig} to {title_old}. If you want this, please fix one of the entries manually."

Another idea would be to add a parameter, minimum-match-score or something like that, that excludes matches below that match score.

There now is minimum_score. Set it to something greater than 0 to exclude certain results.

@nschloe, where would I find documentation on how the score is defined so I can set it to a "no false positives" kind of mode.

@vitorsr This is crossref's "relevance score". No idea exactly how it's defined.

Thanks, Nico.

Unfortunately to us CrossRef is a huge project with only some parts FOSS - I couldn't find any information on sane defaults to minimize wrongful retrievals.

If it is unchanged, the following snippet might help others empirically assess scores that may be useful (requires crossref-commons).

from crossref_commons.iteration import iterate_publications_as_json

# See http://api.crossref.org/swagger-ui/index.html#/Works/get_works.
filter = {"key": "value"}
queries = {"query.key": "query_value"}
for p in iterate_publications_as_json(max_results=10, filter=filter, queries=queries):
    print(p["score"])