Option to filter on validation status
Opened this issue · 8 comments
Subsequent question: how to find/interpret this data in DwC.
(on the data publishing side, we should make sure all data providers use the same field and vocabulary. What about data providers that are not part of Riparias?)
Nico, indeed. We should provide guidelines for making this field more or less a field with a controlled vocabulary. Howoever, as you say, there are datasets with releavant obs published outside RIPARIAS. Of course the user should be able to filter out these datasets. Still, I am afraid we need to have a human controlled mapping during the whole project for the values of this field, and so making a decision about what we consider a validated observation and what it is not. HYou can definitely assign me to this task, if we agree so.
Hi, still think providing a filter on this is a good idea. Could we explore a bit the values across the different datasets? Also, I guess many do not have that field filled, in case they come from INBO we could probably consider them validated and feed that field at dataset level? We do know quite well how we mapped validation status of wnm.be data (validated based on evidence, based on probablility...), so that filter would already be useful as this concerns one of the biggest and most regularly republished datasets that is most relevant to the alerts. To note that iNaturalist only pushed validated ones to gbif so no problem there.
Hi everyone, I agree showing/filtering per validation status would be nice, and there's no technical issue at all for that. But we need a clear-cut rule so the system can decide if a given observation is validated or not. The rule can be moderately complex, the important point being that it's non ambiguous and that it provides decent results for all target occurrences (top avoid confusing the users: misleading information is probably worse than no information). Here is a first draft based on the discussion above, please improve it. Once we have a consensus, it can be implemented on the alert tool:
if dataset is "iNaturalist"
=> mark all occurrences as "validated"
else if data provider is "INBO"
=> mark all occurrences as "validated"
else if dwc:identificationVerificationStatus is "verified" or "1" # we need a good consensus for a criteria like that, this is just an example
=> mark the occurrence as validated
else (by default)
=> mark the occurrence as "non-validated"
In other words, I'd be happy to implement something like that, but there are questions to be solved on the "data front" before it can be done. Tell me what you think!
else if data provider is "DEMNA"
=> mark all occurrences as "validated"
else if dwc:identificationVerificationStatus is "Approved by expert judgement" or "Approved by autovalidation" or "Approved on photographic evidence"
=> mark the occurrence as validated
if dwc:identificationVerificationStatus is "Unverified"
=> mark the occurrence as "non-validated"
maybe we should explore a matrix datasetName x identificationVerificationStatus so we can explore all current values (and NAs)
@timadriaens for my2cents:
We typically have 3 types of validation for the DEMNA based databases (on evidence, expertise, or reliability). I tried to get homogenous vocabulary amongst our GBIF datasets but:
• For riparias we also include fresh records without validation. Otherwise the warning would not be early at all
• It does not concern data collected by other wallon partners
• Possible for me to adapt the vocabulary if we agree on controlled terms. It would be maybe easier for the info uptake and for future data sharing.
I extract here the most interesting part of the comment of @timadriaens in #270 (duplicate):
this should be done in such a way that it can be selected per dataset (i.e. user gets option to select from a list based on what is in that field for the datasets selected).
So, he proposes a dynamic filter which resticts the values of identificationVerificationStatus
based on the values present in the selected datasets.
@niconoe: I think this is quite a huge change in the filter mechanism.
My idea: keep it as simple as possible.
First of all: explore! We first should get an idea about what we are speaking about. So, here you are: there are the values of identificationVerificationStatus
for the data shown, i.e. the data downloaded from GBIF with valid coordinates.
identificationVerificationStatus | n |
---|---|
approved on knowledge rules | 173075 |
NA | 109812 |
approved on photographic evidence | 56510 |
unverified | 34959 |
approved on expert judgement | 22100 |
validated with document | 6035 |
not validated | 4011 |
validated on the basis of rules | 1940 |
validated without document | 1151 |
verified by experts | 766 |
verified | 367 |
validated on the basis of a document | 213 |
validated without a document in support (expertise or additional informations) | 100 |
under validation | 50 |
validated on the basis of likelihood | 18 |
Accepted | 5 |
I have the feeling that we can group things quite easily:
identificationVerificationStatus | value to show in filter |
---|---|
starts with approved , validated or accepted |
verified |
NA | not available (NA is also possible) |
unverified , not validated |
unverified |
any other value (at the moment no other values present) | other |
Notice that more rules we add (e.g. see comments starting from #43 (comment)), more difficult they are to maintain on the long term. In particular, mapping verification status based on the dataset the occurrence it belongs to can be dangerous as such data could change in the future.
My proposal is:
- easy to understand
- easy to document
- easy to detect if new values of
identificationVerificationStatus
are present: selectingother
we get more than zero occurrences back - easy to expand: we can even avoid the grouping I proposed and opt to show all 16 options (+ the
other
option for possible new values in the future) after all. but I prefer not doing so, as it could be overwhelming for the typical user
I agree with this of course