`detect_missing` does not work with integer columns
chengchou opened this issue · 4 comments
I was trying to use detect_missing
to clean the missing data in a dataset, of which a few columns are integers. detect_missing
cannot correctly label the missing values.
Consider the following dataset rd1
:
rd1 <- tibble(
x1 = haven::labelled(x = c(32L, 996L, 40L),
labels = c("Refused to answer" = 996), label = "x1 variable (integer)"),
x2 = haven::labelled(x = c(32, 996, 40),
labels = c("Refused to answer" = 996), label = "x1 variable (double)")
)
# Here is the output of `rd1`:
# A tibble: 3 x 2
# x1 x2
# <int+lbl> <dbl+lbl>
# 32 32
# 996 [Refused to answer] 996 [Refused to answer]
# 40 40
The only difference between x1
and x2
is that x1
has only integers. Applying detect_missing
will only affect x2
, but 996
in x1
remains unchanged.
detect_missing(rd1, missing = c(996))
# # A tibble: 3 x 2
# x1 x2
# <int+lbl> <dbl+lbl>
# 32 32
# 996 [Refused to answer] NA(a) [[996] Refused to answer]
# 40 40
# Warning message:
# In detect_missing(rd1, missing = c(996)) :
# Cannot label missings for integers in variable x1
I looked into the codes of detect_missing
and found that the problem was that the function haven::tagged_na
does not work with vectors of integers. So you include the condition is.double
in a few if
statements.
If I modified these these lines (below) by removing the check for is.double
, detect_missing
will work for columns of integers by converting these columns of integers to column of double. I understand that converting integer
to double
could cause problems later, but it might not be a bad idea to add an option of letting users allow for the conversion so that missing values can be labelled correctly for integer columns.
detect_missing2
function below is a simple modification of the current detect_missing
by adding an extra option force_integer = TRUE
or FALSE
. (The changes are highlighted.) When force_integer = TRUE
, the integer columns will be converted to double and missing values will be labelled.
detect_missing2 <- function (data, only_labelled = TRUE, negative_values_are_missing = TRUE,
ninety_nine_problems = TRUE, learn_from_labels = TRUE, missing = c(),
non_missing = c(), vars = names(data), use_labelled_spss = FALSE, force_integer = FALSE)
{
for (i in seq_along(vars)) {
var <- vars[i]
if (is.numeric(data[[var]]) && any(!is.na(data[[var]]))) {
potential_missing_values <- c()
if (negative_values_are_missing) {
potential_missing_values <- unique(data[[var]][data[[var]] <
0])
}
labels <- attributes(data[[var]])$labels
if (learn_from_labels && length(labels)) {
numeric_representations <- as.numeric(stringr::str_match(names(labels),
"\\[([0-9-]+)\\]")[, 2])
potentially_untagged <- numeric_representations[is.na(labels)]
potential_tags <- labels[is.na(labels)]
if (is.double(data[[var]]) && !all(is.na(haven::na_tag(data[[var]]))) &&
length(intersect(potentially_untagged, data[[var]]))) {
# For integer vectors, their missing values cannot be tagged,
# so we don't need to modify the above if condition for
# integer vectors.
warning("Missing values were already tagged in ",
var, ". Although", "there were further potential missing values as indicated by",
"missing labels, this was not changed.")
} else {
for (e in seq_along(potentially_untagged)) {
pot <- potentially_untagged[e]
data[[var]][data[[var]] == pot] <- potential_tags[e]
}
}
}
if (ninety_nine_problems) {
if (any(!is.na(data[[var]])) && (stats::median(data[[var]],
na.rm = TRUE) + stats::mad(data[[var]], na.rm = TRUE) *
5) < 99) {
potential_missing_values <- c(potential_missing_values,
99)
}
if (any(!is.na(data[[var]])) && (stats::median(data[[var]],
na.rm = TRUE) + stats::mad(data[[var]], na.rm = TRUE) *
5) < 999) {
potential_missing_values <- c(potential_missing_values,
999)
}
}
potential_missing_values <- union(setdiff(potential_missing_values,
non_missing), missing)
if ((!only_labelled || haven::is.labelled(data[[var]])) &&
length(potential_missing_values) > 0) {
if (only_labelled) {
potential_missing_values <- potential_missing_values[potential_missing_values %in%
labels]
potential_missing_values <- union(potential_missing_values,
setdiff(labels[is.na(labels)], data[[var]]))
}
potential_missing_values <- sort(potential_missing_values)
with_tagged_na <- data[[var]]
if (is.double(data[[var]])) {
free_na_tags <- setdiff(letters, haven::na_tag(with_tagged_na))
} else {
free_na_tags <- letters
}
for (i in seq_along(potential_missing_values)) {
miss <- potential_missing_values[i]
if (!use_labelled_spss && !all(potential_missing_values %in%
free_na_tags)) {
new_miss <- free_na_tags[i]
} else {
new_miss <- potential_missing_values[i]
}
that_label <- which(labels == miss)
################################################################################
# I replaced `is.double(data[[var]])` with `(force_integer |
# is.double(data[[var]]))` below
if (length(which(with_tagged_na == miss)) &&
(force_integer | is.double(data[[var]])) && !use_labelled_spss) {
with_tagged_na[which(with_tagged_na == miss)] <- haven::tagged_na(new_miss)
} else if (!force_integer & is.integer(data[[var]])) {
warning("Cannot label missings for integers in variable ",
var, " let force_integer = TRUE if you want to label misssings for integers.")
}
if ((force_integer | is.double(data[[var]])) &&
length(that_label) && !use_labelled_spss) {
labels[that_label] <- haven::tagged_na(new_miss)
names(labels)[that_label] <- paste0("[",
potential_missing_values[i], "] ", names(labels)[that_label])
}
################################################################################
}
if (use_labelled_spss) {
labels <- attributes(data[[var]])$labels
if (is.null(labels)) {
labels <- potential_missing_values
names(labels) <- "autodetected unlabelled missing"
}
data[[var]] <- haven::labelled_spss(data[[var]],
label = attr(data[[var]], "label", TRUE),
labels = labels, na_values = potential_missing_values,
na_range = attr(data[[var]], "na_range",
TRUE))
} else if (haven::is.labelled(data[[var]])) {
data[[var]] <- haven::labelled(with_tagged_na,
label = attr(data[[var]], "label", TRUE),
labels = labels)
} else {
data[[var]] <- with_tagged_na
}
}
}
}
data
}
hi @chengchou Here's some background on why labelled missing values don't work for integers. I haven't checked your code yet, but I would name the argument something like coerce_integer_to_double
and I would have to add tests that integers without labelled missing values don't get coerced.
I don't consider it very urgent though, as people could simply convert integers beforehand. Have you ever sent a pull request?
@rubenarslan "tests that integers without labelled missing values don't get coerced" is a good point. coerce_integer_to_double
is self-explaining---good name. No I haven't sent a pull request before.
@rubenarslan Sorry, where does detect_missing
locate with R directory?