Option to filter on validation status

Question

Option to filter on validation status

Opened this issue 3 years ago · 8 comments

Subsequent question: how to find/interpret this data in DwC.

(on the data publishing side, we should make sure all data providers use the same field and vocabulary. What about data providers that are not part of Riparias?)

Answer 1 · 2021-10-08T10:10:58.000Z

Nico, indeed. We should provide guidelines for making this field more or less a field with a controlled vocabulary. Howoever, as you say, there are datasets with releavant obs published outside RIPARIAS. Of course the user should be able to filter out these datasets. Still, I am afraid we need to have a human controlled mapping during the whole project for the values of this field, and so making a decision about what we consider a validated observation and what it is not. HYou can definitely assign me to this task, if we agree so.

Answer 2 · 2023-03-24T10:21:43.000Z

Hi, still think providing a filter on this is a good idea. Could we explore a bit the values across the different datasets? Also, I guess many do not have that field filled, in case they come from INBO we could probably consider them validated and feed that field at dataset level? We do know quite well how we mapped validation status of wnm.be data (validated based on evidence, based on probablility...), so that filter would already be useful as this concerns one of the biggest and most regularly republished datasets that is most relevant to the alerts. To note that iNaturalist only pushed validated ones to gbif so no problem there.

Answer 3 · 2023-03-29T07:40:03.000Z

Hi everyone, I agree showing/filtering per validation status would be nice, and there's no technical issue at all for that. But we need a clear-cut rule so the system can decide if a given observation is validated or not. The rule can be moderately complex, the important point being that it's non ambiguous and that it provides decent results for all target occurrences (top avoid confusing the users: misleading information is probably worse than no information). Here is a first draft based on the discussion above, please improve it. Once we have a consensus, it can be implemented on the alert tool:

if dataset is "iNaturalist" 
    => mark all occurrences as "validated"
else if  data provider is "INBO" 
    => mark all occurrences as "validated"
else if dwc:identificationVerificationStatus is "verified" or "1"  # we need a good consensus for a criteria like that, this is just an example 

    => mark the occurrence as validated  
else (by default) 
    => mark the occurrence as "non-validated"

In other words, I'd be happy to implement something like that, but there are questions to be solved on the "data front" before it can be done. Tell me what you think!

Answer 4 · 2023-03-30T07:42:38.000Z

else if  data provider is "DEMNA"
=> mark all occurrences as "validated"

else if dwc:identificationVerificationStatus is "Approved by expert judgement" or "Approved by autovalidation" or "Approved on photographic evidence"
=> mark the occurrence as validated

if dwc:identificationVerificationStatus is "Unverified"
=> mark the occurrence as "non-validated"

Answer 5 · 2023-03-30T07:43:30.000Z

maybe we should explore a matrix datasetName x identificationVerificationStatus so we can explore all current values (and NAs)

Answer 6 · 2023-03-30T08:29:42.000Z

@timadriaens for my2cents:

We typically have 3 types of validation for the DEMNA based databases (on evidence, expertise, or reliability). I tried to get homogenous vocabulary amongst our GBIF datasets but:
• For riparias we also include fresh records without validation. Otherwise the warning would not be early at all
• It does not concern data collected by other wallon partners
• Possible for me to adapt the vocabulary if we agree on controlled terms. It would be maybe easier for the info uptake and for future data sharing.

Answer 7 · 2023-10-19T11:08:26.000Z

I extract here the most interesting part of the comment of @timadriaens in #270 (duplicate):

this should be done in such a way that it can be selected per dataset (i.e. user gets option to select from a list based on what is in that field for the datasets selected).

So, he proposes a dynamic filter which resticts the values of identificationVerificationStatus based on the values present in the selected datasets.

@niconoe: I think this is quite a huge change in the filter mechanism.

My idea: keep it as simple as possible.

First of all: explore! We first should get an idea about what we are speaking about. So, here you are: there are the values of identificationVerificationStatus for the data shown, i.e. the data downloaded from GBIF with valid coordinates.

identificationVerificationStatus	n
approved on knowledge rules	173075
NA	109812
approved on photographic evidence	56510
unverified	34959
approved on expert judgement	22100
validated with document	6035
not validated	4011
validated on the basis of rules	1940
validated without document	1151
verified by experts	766
verified	367
validated on the basis of a document	213
validated without a document in support (expertise or additional informations)	100
under validation	50
validated on the basis of likelihood	18
Accepted	5

I have the feeling that we can group things quite easily:

identificationVerificationStatus	value to show in filter
starts with `approved`, `validated` or `accepted`	`verified`
NA	`not available` (`NA` is also possible)
`unverified`, `not validated`	`unverified`
any other value (at the moment no other values present)	`other`

Notice that more rules we add (e.g. see comments starting from #43 (comment)), more difficult they are to maintain on the long term. In particular, mapping verification status based on the dataset the occurrence it belongs to can be dangerous as such data could change in the future.

My proposal is:

easy to understand
easy to document
easy to detect if new values of identificationVerificationStatus are present: selecting other we get more than zero occurrences back
easy to expand: we can even avoid the grouping I proposed and opt to show all 16 options (+ the other option for possible new values in the future) after all. but I prefer not doing so, as it could be overwhelming for the typical user

Answer 8 · 2023-10-23T14:16:01.000Z

I agree with this of course