diagnostic tool to find out contradicting authorities
Opened this issue · 5 comments
I can easily put the query for nanotyrannus into the advanced tab, but I'd prefer to make it a bit more generally useful first.
There is no query for all synonyms yet, for Tyrannosaurus I just manually run the query for all synonyms and removed all entries without conflicts by hand
Here is a more general query:
################################################################################
# #
# Note: This query ONLY works with the treatment.ld.plazi.org sparql endpoint! #
# #
################################################################################
PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>
PREFIX dwcFP: <http://filteredpush.org/ontologies/oa/dwcFP#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX treat: <http://plazi.org/vocab/treatment#>
SELECT DISTINCT ("--" AS ?simple) ?name ?authority (GROUP_CONCAT(?treatment; separator=",") AS ?treatments) WHERE {
?tc treat:hasTaxonName ?name .
?tc dwc:scientificNameAuthorship ?authority1 .
GRAPH ?treatment {
?tc dwc:scientificNameAuthorship ?authority .
}
FILTER(?authority1 != ?authority)
}
GROUP BY ?name ?authority
ORDER BY ?name
LIMIT 100
Running a similar query reveals there to be 26467 names with multiple authorities in the data, so manual fixup would be quite the effort
For a given taxon name, the follwowing lists all treatments for it and their authority and some useful metadata to help in deciding which one is correct:
################################################################################
# #
# Note: This query ONLY works with the treatment.ld.plazi.org sparql endpoint! #
# #
################################################################################
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX treat: <http://plazi.org/vocab/treatment#>
SELECT DISTINCT ?name ?authority ?year ?treatment ?authors ?title WHERE {
# Replace Name here as relevant
BIND(<http://taxon-name.plazi.org/id/Animalia/Laelaps_incrassatus> AS ?name)
?tc treat:hasTaxonName ?name .
GRAPH ?treatment {
?tc dwc:scientificNameAuthorship ?authority .
}
BIND(IRI(REPLACE(STR(?treatment), "https", "http")) AS ?treatment_http)
?treatment_http dc:creator ?authors ;
dc:title ?title ;
treat:publishedIn/dc:date ?year .
}
ORDER BY ?year
For example, for Laelaps incrassatus, it gives
which to me indicates that the latter two treatments are probably wrong and should be fixed with
- Authority: Cope, 1876
- base-authority: None
A quick glance at the list provided by the first query above shows that most "disagreements" are (Name, 1234)
vs Name, 1234
(i.e. Name
as baseAuthority
vs as authority
).
These cannot be fixed easily "after-the-fact" and require a human to check if it is supposed to be base- or non-base-authority.
However, i have found a handful of cases that could be "fixed" as such:
- Fixable in GG2RDF with better normalization:
- Differences in
&
vs,
to separate Names - Random special characters like double commas or stray quotation marks
- First-name initials some times given: (
A. Name
orA. A. Name
should be normalized toName
) L.
→Linnaeus
- Names starting with species or subspecies → gg2rdf should emit a warning and remove latin name
- Names containing parentheses → gg2rdf should emit a warning and remove trailing parenthezised parts
- Differences in
In other cases, some variants are redundant shorter versions of others, so these could be hidden in Synospecies by hiding some names (and putting them into a small (i)-popup with a "Authority also given as:" notice):
- e.g. "Shelley, McAllister & Hollis, 2003" and "Shelley & Hollis, 2003" -> hide second variant
- "Name et al" is redundant if longer authority "Name, ..." exists
- "Name" is redundant if longer authority "Name, 1234" exists