factsmission/synospecies

diagnostic tool to find out contradicting authorities

Opened this issue · 5 comments

@nleanba can you please add the sparql query to find out conflicting authorities into the canned queries in "advanced". See the one you did on Trex

I can easily put the query for nanotyrannus into the advanced tab, but I'd prefer to make it a bit more generally useful first.

There is no query for all synonyms yet, for Tyrannosaurus I just manually run the query for all synonyms and removed all entries without conflicts by hand

Here is a more general query:

################################################################################
#                                                                              #
# Note: This query ONLY works with the treatment.ld.plazi.org sparql endpoint! #
#                                                                              #
################################################################################
PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>
PREFIX dwcFP: <http://filteredpush.org/ontologies/oa/dwcFP#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX treat: <http://plazi.org/vocab/treatment#>
SELECT DISTINCT ("--" AS ?simple) ?name ?authority (GROUP_CONCAT(?treatment; separator=",") AS ?treatments) WHERE {
  ?tc treat:hasTaxonName ?name .
  ?tc dwc:scientificNameAuthorship ?authority1 .

  GRAPH ?treatment {
    ?tc dwc:scientificNameAuthorship ?authority .
  }
  FILTER(?authority1 != ?authority)
}
GROUP BY ?name ?authority
ORDER BY ?name
LIMIT 100

Running a similar query reveals there to be 26467 names with multiple authorities in the data, so manual fixup would be quite the effort

For a given taxon name, the follwowing lists all treatments for it and their authority and some useful metadata to help in deciding which one is correct:

################################################################################
#                                                                              #
# Note: This query ONLY works with the treatment.ld.plazi.org sparql endpoint! #
#                                                                              #
################################################################################
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX treat: <http://plazi.org/vocab/treatment#>
SELECT DISTINCT ?name ?authority ?year ?treatment ?authors ?title WHERE {

  # Replace Name here as relevant
  BIND(<http://taxon-name.plazi.org/id/Animalia/Laelaps_incrassatus> AS ?name)


  ?tc treat:hasTaxonName ?name .
  GRAPH ?treatment {
    ?tc dwc:scientificNameAuthorship ?authority .
  }
  
  BIND(IRI(REPLACE(STR(?treatment), "https", "http")) AS ?treatment_http)
  
  ?treatment_http dc:creator ?authors ;
             dc:title ?title ;
             treat:publishedIn/dc:date ?year .
}
ORDER BY ?year

For example, for Laelaps incrassatus, it gives
image
which to me indicates that the latter two treatments are probably wrong and should be fixed with

  • Authority: Cope, 1876
  • base-authority: None

A quick glance at the list provided by the first query above shows that most "disagreements" are (Name, 1234) vs Name, 1234 (i.e. Name as baseAuthority vs as authority).

These cannot be fixed easily "after-the-fact" and require a human to check if it is supposed to be base- or non-base-authority.

However, i have found a handful of cases that could be "fixed" as such:

  • Fixable in GG2RDF with better normalization:
    • Differences in & vs , to separate Names
    • Random special characters like double commas or stray quotation marks
    • First-name initials some times given: (A. Name or A. A. Name should be normalized to Name)
    • L.Linnaeus
    • Names starting with species or subspecies → gg2rdf should emit a warning and remove latin name
    • Names containing parentheses → gg2rdf should emit a warning and remove trailing parenthezised parts

In other cases, some variants are redundant shorter versions of others, so these could be hidden in Synospecies by hiding some names (and putting them into a small (i)-popup with a "Authority also given as:" notice):

  • e.g. "Shelley, McAllister & Hollis, 2003" and "Shelley & Hollis, 2003" -> hide second variant
  • "Name et al" is redundant if longer authority "Name, ..." exists
  • "Name" is redundant if longer authority "Name, 1234" exists