inbo/reporting-rshiny-grofwildjacht

[BUG] Male records are retained when filtering `type_comp == 'Onbekend'` in countEmbryos()

mvarewyck opened this issue · 2 comments

Describe the bug
When filtering the data within countEmbryos() there is a bug retaining some records that have type_comp == 'Onbekend' and geslacht_comp == "Mannelijk". Issue only occurs for the records with unknown type as we filter on the female types otherwise.

To Reproduce

library(reportingGrofwild)
    ecoData <- loadRawData(type = "eco")
    
    plotData <- countEmbryos(data = ecoData[ecoData$wildsoort == "Damhert", ], type = "Onbekend")$data
    sum(plotData$Freq)
    # [1] 19

    filterData <- ecoData[ecoData$type_comp == "Onbekend" & ecoData$wildsoort == "Damhert", ]
    nrow(filterData)
    # [1] 19

    table(filterData$geslacht_comp)
    # 
    # Vrouwelijk  Mannelijk   Onbekend 
    #          8          1         10 

Expected behavior
Exclude the male species within countEmbryos()

Git SHA (after 0.3.1)
#7568c97e249da29bc34f3581c2c549d45a14777f

@SanderDevisscher How do we exclude the males? The question is mostly about records which have unknown type_comp. For the other types we automatically select the females.

(1) retain records with geslacht_comp != "mannelijk" -> there can still be records retained that have gender unkown and are actually males. So we might have too many records with type_comp onbekend in the countEmbryos plot

ecoData <- loadRawData(type = "eco")
allSpecies <- unique(ecoData$wildsoort)

sapply(allSpecies, function(iSpecies) {
    
      filterData <- ecoData[ecoData$type_comp == "Onbekend" & 
          ecoData$wildsoort == iSpecies & 
          ecoData$geslacht_comp != "Mannelijk", ]
      table(filterData$geslacht_comp)
      
    })
    #            Wild zwijn Edelhert Damhert Ree
    # Vrouwelijk         28        0       8 133
    # Mannelijk           0        0       0   0
    # Onbekend          615        1      10 873

(2) retain records with geslacht_comp == "vrouwelijk" -> we exclude way too many records, because there are many records with unknown gender that still have known type

> table(droplevels(ecoData$type_comp[ecoData$geslacht_comp == "Onbekend"]))

    Smalree Jaarlingbok     Reegeit      Reebok    Onbekend 
         10           6         166          58        1499 

(3) exclude records with geslacht_comp == "mannelijk" OR (geslacht_comp == "unknown" & type_comp == "unknown". We might have excluded some female records. so too little records with type_comp unknown in the countEmbryos plot

So I think the decision is between (1) and (3) depending on whether you want to retain or exclude the ones for which you don't know gender AND type. Or do I miss sth?

I would go for the 3rd option. Explicit male individuals and fully unknown (no sex & no type) should be excluded.

Option 2 indicates we need to add some logic to check whether these are in fact correct and ifso reverse engineer the sex based on the type in the Backoffice.