Sanity check updated results

Question

Sanity check updated results

JensHeinrich opened this issue 3 years ago · 7 comments

Currently the following entry

@misc{BillBaker,
        author = {Baker, Bill},
        title = {Untitled Talk},
        date = {2011/2012},
}

gets updated to

@misc{BillBaker,
        author = {Ellams, Inua},
        title = {Untitled},
        date = {2011/2012},
        doi = {10.5040/9781350210301.00000006},
        source = {Crossref},
        url = {https://doi.org/10.5040/9781350210301.00000006},
        publisher = {Oberon Books},
        year = {2010},
}

Answer 1 · 2021-09-15T11:18:36.000Z

I wouldn't know how to improve this. Imagine the entry was

@misc{BillBaker,
        author = {Ela, Ia},
        title = {Untitled Talk},
        date = {2011/2012},
}

-- what's the correct entry? The line is blurry, and it's impossible to draw a clear line. Eventually, the user will have to decide. Right now, betterbib does this. If you have a better suggestion, feel free to PR.

Answer 2 · 2021-09-15T13:58:53.000Z

What do you think about raising an exception/warning when both the title and the author would be changed and it's not a real fix match, eg. ISBN or DOI?
The message could say:

f"The best match would change author from {author_orig} to {author_new} AND title from {title_orig} to {title_old}. If you want this, please fix one of the entries manually."

Answer 3 · 2021-09-16T06:55:07.000Z

Another idea would be to add a parameter, minimum-match-score or something like that, that excludes matches below that match score.

Answer 4 · 2022-04-02T14:40:20.000Z

There now is minimum_score. Set it to something greater than 0 to exclude certain results.

Answer 5 · 2023-01-02T04:33:58.000Z

@nschloe, where would I find documentation on how the score is defined so I can set it to a "no false positives" kind of mode.

Answer 6 · 2023-01-02T13:56:08.000Z

@vitorsr This is crossref's "relevance score". No idea exactly how it's defined.

Answer 7 · 2023-01-02T16:08:39.000Z

Thanks, Nico.

Unfortunately to us CrossRef is a huge project with only some parts FOSS - I couldn't find any information on sane defaults to minimize wrongful retrievals.

If it is unchanged, the following snippet might help others empirically assess scores that may be useful (requires crossref-commons).

from crossref_commons.iteration import iterate_publications_as_json

# See http://api.crossref.org/swagger-ui/index.html#/Works/get_works.
filter = {"key": "value"}
queries = {"query.key": "query_value"}
for p in iterate_publications_as_json(max_results=10, filter=filter, queries=queries):
    print(p["score"])