Predicate to point to taxon-names from a Dataset

Question

Predicate to point to taxon-names from a Dataset

Opened this issue 2 months ago · 7 comments

Datasets in our information system hold information about the associated taxon-names (for example this dataset - where taxa are listed under taxterms).
We would like to point to these taxon-names via a predicate (distinct from predicates such as https://schema.org/keywords).

Is it possible to describe such a predicate within the Aphia ontologies?

Answer 1 · 2024-10-28T09:34:06.000Z

Hi @laurianvm it is not very clear to me what you want to do. I mean dataset of DCAT does not enter into the content of a dataset, usually. So what would you like to do here? Saying this dataset includes some taxon-names? Why? BTW: let's use keyword of DCAT first, and also schema.org...

Answer 2 · 2024-10-28T09:52:39.000Z

We would indeed like to be able to say that a dataset includes some taxon-names
and by extension also information on the taxonomic rank of those taxon-names (e.g. family, order, genus, ...)

we are currently doing this via schema:keywords
but we would like to separate out the taxon information from other keywords (to be able to construct more complex sparql queries, where both can general keywords can be specified in combination with certain taxon-names), hence we thought a separate predicate could be used

Answer 3 · 2024-10-28T16:03:38.000Z

I really do not like mixing general descriptive metadata, as it can be DCAT which is a meta-layer and its purpose is not to enter into the details of the dataset, with the content of the dataset. BTW: are you planning to produce one dataset or many datasets? Because I do not see the sense to list all possible scientific names saying that the dataset includes all of them. It makes sense saying that the dataset includes taxon names, in general, meaning all of them that can be found in WoRMS. Having said that, I will try to understand whether there is an elegant semantic solution for that.

Answer 4 · 2024-10-29T12:00:04.000Z

@giorgialodi I think the ambition here is trying to leverage "making the dataset discoverable based on hints of its content" in a day and age where that full content is not yet available in the same semantic fashion and space (i.e when we would have a full connected knowledge graph spanning all triples from metadata down to data --> actually I think we are all hoping for a future where this 'mix' of levels really become linked and queriable )

Since that vision remains to be hoped for, we are all on a road of step by step chipping away at bridging the 'dataset-to-data-content-gap'. Aiming towards making sure discovery of datasets becomes more and more feasible based on actual relevant - data found elsewhere(*). And, in fact, I would argue that this is totally in line with other data-reduction tricks happening already:

the full range of date-time-stamps in a dataset --> (gets reduced to) --> min-max range published through dct:temporal
the full range of geospatial data in the set --> (gets reduced to) --> some bounding-box or geoname published through dct:spatial
(with some stretch) the full textual or domain understanding embedded in the set --> (gets reduced to) --> keywords, themes, ...

so ... we are out to suggest a similar approach for referencing relevant bio-taxnames (on the level of the dataset):

the detail species available in a dataset --> (gets reduced to) --> a limited number of higher level taxa (like order or phylum), but for very focused datasets it could really be quite detailed level too

does that make sense?

(*) Totally on the side:
Of course we should be sensitive to the limitations of these techniques. Finding matching values based on reduced ranges of their presence is a challenge in combination with the nature of triple-stores (which are typically assuming exact matches). Point is the data-reductions make for very bad links, so they depend on extra techniques (e.g. smart indexes and sparql-query extensions to actually allow finding relevant datasets to match whatever lookup value you found elsewhere.

Examples:

for any date to fall in the min-max range (temporal coverage) one will need to tune the sparql query to compare the start-end dates
for any geo-point to fall in a bounding box or region will need to rely on geospatial indexing + special search statements
for any free text to match keywords one typically relies on some full-text-index and associated close-match-statements in the query
similarly for finding a match between two species (the one being sub of the other) special attention will be needed to smartly navigate the relations between them

Answer 5 · 2024-11-11T13:58:05.000Z

I am thinking about it. I understand your point and tend to agree under some perspective, but then there is something that does not fully convince me. For example, if you have to include just a simplification, from a metadata perspective does not help that much. Then one can also ask you, why only TaxonName and not other Vernacular for example (unless you want to create a separate dataset for that which is feasible).
This is why I am not convinced entirely.

Anyway a possible solution:
One way could be to use a property we have already in top-level thanks to the use of foundational ontologies (see, they are helpful!) which is dul:isMemberOf (inverse of dul:hasMember). In this case, we can select some TaxonName or potentially all of them and say that a TaxonName isMemberOf a dcat:Dataset. This implies that a reasoning engine would infer that dcat:Dataset is also a dul:Collection (because dul:isMemberOf is a property with domain dul:Entity and range dul:Collection).
Checking DCAT I do not see specific alignments to DUL (there are other W3C ontologies aligned with it but not DCAT), but if I see the definition of Dataset says "A collection of data". So I guess there are no isssues here in doing that from a soundness of the semantics :)

Did you follow me?
BTW: you can do it in the data without any changes in the ontology of taxon-name because we do not need restrictions. You can use directly in the data dul:isMemberOf from taxon-name:TaxonName to dcat:Dataset

In any case, there was a case in italy of a public administration, followed by a colleague of mine, that wanted something similar. I will resume that Italian work and return back to you on this because they acted at the level of domain ontology (not at the level of DCAT obviously).

Answer 6 · 2024-12-02T08:46:03.000Z

Hi
Just to add to Marc's reasoning: we are talking here about metadata that describe datasets in a catalogue, and for which one wants to be able to search in that catalogue on the taxonomic classification. Just as people describe the measurements in the datasets, they need to be able to describe the biology in the dataset. And you are right -- if there are 500 species names in the dataset, you are never going to list them all in the catalogue metadata. The way to deal with that is not to exclude them totally, but to have an intelligent way to group the species' by genus (or family) so cutting down 500 to 30 -- we follow this approach with our catalogue metadata which are exported in EML, which has a specific category for taxonomic coverage.

Answer 7 · 2024-12-06T14:17:35.000Z

@giorgialodi, sorry for returning late to this.

Your point about venaculars is indeed relevant --> again like how keywords are used, if we want our dataset to pop up in these tax-based-catalogue-searches when people are looking for a popular, relevant vernacular, we would like to be able to add that too --> so we are looking for some predicate that would allow precisely that kind of use cases
The proposed suggestion with dul:isMemberOf could work, but to me feels a bit unnatural and not entirely fitting the bill?
- "unnatural" in that it takes the opposite direction (from taxname to dataset) -- ie as if one would not say "these are the keywords describing this dataset" but rather opt to "list words that are key to find that dataset"
- "not fitting the bill" in that this kind of implies membership, grouping, composition all notions that suggest stronger relations than what is really going on here: a simple association. We are looking more for a "chosen reducing descriptions" rather than "factual identifiable members" (would you say a keyword or geospatial bounds are "members" ? ) -- but this is just my feeling

We want to simply, on a high level associate a given dataset "DSx" to some taxnames "TN1,TN2, .." (and yes, some could be vernaculars) -- The envisioned use case is to enable people looking for data on any of the TNx to be getting a direct link to DSx -- this as a shortcut to actually getting through all content of all datasets (so limited like keywords) and thus also in a first order best effort attempt, i.e. we accept in this approach false positives (you received a hit to DSx but then get disappointed because there is very little relevant TNx data in there for you) and false negatives (you missed out on hitting on this crucial DSx that is relevant to you because you searched on another taxname level than the one we declared)

Maybe we need to invent something like "key-tax-names" ?