Feature Request: single_value()
billdenney opened this issue ยท 6 comments
This is a function that would consider some values to be missing, but for all non-missing values, it would ensure that they have the same value.
I often work with datasets where I need to combine information for subjects in clinical trials. For that, I need to ensure that I have the same information from each of the different sources. For example, I may have multiple sources for the age of a subject when they start the study.
When I combine those data sets, I need to end up with the age as the same across all data. A paradigm I often use is below. Would that be of interest?
library(tidyverse)
library(bsd.report)
my_data_good <-
tibble(
Subject=rep(1:2, each=2),
Age=c(1, NA, 2, NA)
) %>%
group_by(Subject) %>%
mutate(
Age=single_value(Age)
)
my_data_good
#> # A tibble: 4 x 2
#> # Groups: Subject [2]
#> Subject Age
#> <int> <dbl>
#> 1 1 1
#> 2 1 1
#> 3 2 2
#> 4 2 2
my_data_bad <-
tibble(
Subject=rep(1:2, each=2),
Age=c(1, NA, 2, 3)
) %>%
group_by(Subject) %>%
mutate(
Age=single_value(Age)
)
#> Error: Problem with `mutate()` input `Age`.
#> x More than one (2) value found (2, 3)
#> i Input `Age` is `single_value(Age)`.
#> i The error occurred in group 2: Subject = 2.
Created on 2021-02-04 by the reprex package (v1.0.0)
@sfirke , If I make a PR for this, do you think it would be of interest? (And no worries if you think it's out of scope.)
I like this! It's in scope IMO. It's related to this: #18 There, I wanted a function for finding records like the one in my_data_bad
above. I think we can address the situation more broadly. In my issue above I wanted a diagnostic function, in your example the function functions like tidyr::fill
except it also includes a check against more than one distinct value - which is kind of diagnostic.
Do you have thoughts on if it's doable / the most elegant way to both offer the diagnostic functionality and the convenience wrapper for fill
? One idea - not sure if this is the best: have the function succeed with the fill
if there are no invalid combinations, and if there are invalid combos then it would fail and error and (??) return the bad records in a data.frame. It feels kinda clunky to squish that into one function, but maybe there's a way to both error and return the bad records? Or have the user specify?
Or maybe it should be two functions and you run the diagnostic one first, then the one you have above. That's probably more tidy-API style.
It would be nice if the diagnostic function could easily be used in an assertr
call so that folks can throw a check in a pipeline to be sure there aren't the multiple values lurking.
Hmm. I don't tend to use fill()
because I don't often need locf-style (last-observation carried forward) imputation. But, maybe the right solution is to suggest a new .direction
argument for fill()
of "single"
. Let's hold off here since that seems like an overall-better solution. If they don't like it for tidyr
, then let's revisit it here.
If included here in janitor
, I think that the diagnostic and fill
functions would be separate. I wouldn't want code to accidentally expect the fill
result and get the diagnostic result. For my work, the diagnosis is the error in the example above.
We got an answer back that it's not a good fit for tidyr. I'll work up a PR.
I agree that the error thrown by single_value
serves the diagnostic function. It sort of has assertr
-type of functionality built in, you can call single_value
without expecting it to do anything but serve as a check against mismatches.