narnia
aims to make it easy to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot2 and tidy data.
Currently it provides:
- Data structures for missing data
as_shadow()
bind_shadow()
gather_shadow()
is_na()
- Visualisation methods:
geom_missing_point()
gg_miss_var()
gg_miss_case()
gg_miss_which()
- Numerical summaries:
n_miss()
n_complete()
miss_case_pct()
miss_case_summary()
miss_case_table()
miss_var_pct()
miss_var_summary()
miss_var_table()
miss_df_pct()
For details on how to use each of these functions, and their usage, you can read the vignette "Getting Started with Narnia".
Why narnia
?
narnia
was previously named ggmissing
and initially provided a ggplot geom and some visual summaries. It was changed to narnia
to reflect the fact that this package is going to be bigger in scope, and is not just related to ggplot2. Specifically, the package is designed to provide a suite of tools for generating visualisations of missing values and imputations, manipulate, and summarise missing data.
...But why
narnia
?
Well, I think it is useful to think of missing values in data being like this other dimension, perhaps like C.S. Lewis's Narnia - a different world, hidden away. You go inside, and sometimes it seems like you've spent no time in there but time has passed very quickly, or the opposite. Also, NA
rnia = na in r, and if you so desire, narnia may sound like "noneoya" in an nz/aussie accent. Full credit to @MilesMcbain for the name.
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
Representing missing data structure is achieved using the shadow matrix, introduced in Swayne and Buja. The shadow matrix is the same dimension as the data, and consists of binary indicators of missingness of data values, where missing is represented as "NA", and not missing is represented as "!NA". Although these may be represented as 1 and 0, respectively. This representation can be seen in the figure below, adding the suffix "_NA" to the variables. This structure can also be extended to allow for additional factor levels to be created. For example 0 indicates data presence, 1 indicates missing values, 2 indicates imputed value, and 3 might indicate a particular type or class of missingness, where reasons for missingness might be known or inferred. The data matrix can also be augmented to include the shadow matrix, which facilitates visualisation of univariate and bivariate missing data visualisations. Another format is to display it in long form, which facilitates heatmap style visualisations. This approach can be very helpful for giving an overview of which variables contain the most missingness. Methods can also be applied to rearrange rows and columns to find clusters, and identify other interesting features of the data that may have previously been hidden or unclear.
Illustration of data structures for facilitating visualisation of missings and not missings
Visualising missing data might sound a little strange - how do you visualise something that is not there? One approach to visualising missing data comes from ggobi and manet, where we replace "NA" values with values 10% lower than the minimum value in that variable. This is provided with the geom_missing_point()
ggplot2 geom, which we can illustrate by exploring the relationship between Ozone and Solar radiation from the airquality dataset.
library(ggplot2)
ggplot(data = airquality,
aes(x = Ozone,
y = Solar.R)) +
geom_point()
#> Warning: Removed 42 rows containing missing values (geom_point).
ggplot2 does not handle these missing values, and we get a warning message about the missing values.
We can instead use the geom_missing_point()
to display the missing data
library(narnia)
ggplot(data = airquality,
aes(x = Ozone,
y = Solar.R)) +
geom_missing_point()
geom_missing_point()
has shifted the missing values to now be 10% below the minimum value. The missing values are a different colour so that missingness becomes pre-attentive.
This plays nicely with other parts of ggplot, like adding transparency
ggplot(data = airquality,
aes(x = Ozone,
y = Solar.R)) +
geom_missing_point(alpha = 0.5)
Thanks to Luke Smith for making this pull request.
We can also add features such as faceting, just like any regular ggplot plot.
For example, we can split the facet by month:
p1 <-
ggplot(data = airquality,
aes(x = Ozone,
y = Solar.R)) +
geom_missing_point() +
facet_wrap(~Month, ncol = 2) +
theme(legend.position = "bottom")
p1
And then change the theme, just like you do with any other ggplot graphic
p1 + theme_bw()
You can also look at the proportion of missings in each variable with gg_missing_var:
gg_missing_var(airquality)
You can also explore the whole dataset of missings using the vis_miss
function, which is exported from the visdat
package.
vis_miss(airquality)
Another approach can be to use Univariate plots split by missingness. We can do this using the bind_shadow()
argument to place the data and shadow side by side. This allows for us to examine univariate distributions according to the presence or absence of another variable.
aq_shadow <- bind_shadow(airquality)
aq_shadow
#> # A tibble: 153 x 12
#> Ozone Solar.R Wind Temp Month Day Ozone_NA Solar.R_NA Wind_NA
#> <int> <int> <dbl> <int> <int> <int> <fctr> <fctr> <fctr>
#> 1 41 190 7.4 67 5 1 !NA !NA !NA
#> 2 36 118 8.0 72 5 2 !NA !NA !NA
#> 3 12 149 12.6 74 5 3 !NA !NA !NA
#> 4 18 313 11.5 62 5 4 !NA !NA !NA
#> 5 NA NA 14.3 56 5 5 NA NA !NA
#> 6 28 NA 14.9 66 5 6 !NA NA !NA
#> 7 23 299 8.6 65 5 7 !NA !NA !NA
#> 8 19 99 13.8 59 5 8 !NA !NA !NA
#> 9 8 19 20.1 61 5 9 !NA !NA !NA
#> 10 NA 194 8.6 69 5 10 NA !NA !NA
#> # ... with 143 more rows, and 3 more variables: Temp_NA <fctr>,
#> # Month_NA <fctr>, Day_NA <fctr>
The plot below shows the values of temperature when ozone is present and missing, on the left is a faceted histogram, and on the right is an overlaid density.
library(ggplot2)
p1 <- ggplot(data = aq_shadow,
aes(x = Temp)) +
geom_histogram() +
facet_wrap(~Ozone_NA,
ncol = 1)
p2 <- ggplot(data = aq_shadow,
aes(x = Temp,
colour = Ozone_NA)) +
geom_density()
gridExtra::grid.arrange(p1, p2, ncol = 2)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
narnia
provides numerical summaries of missing data. For variables, cases, and dataframes there are the function families miss_var_*
, miss_case_*
, and miss_df_*
. To find the percent missng variables, cases, and dataframes:
# Proportion of variables that contain any missing values
miss_var_pct(airquality)
#> [1] 33.33333
# Proportion of cases that contain any missing values
miss_case_pct(airquality)
#> [1] 27.45098
# Proportion elements in dataset that contains missing values
miss_df_pct(airquality)
#> [1] 4.793028
We can also look at the number and percent of missings in each case and variable with miss_var_summary()
, and miss_case_summary()
.
miss_var_summary(airquality)
#> # A tibble: 6 x 3
#> variable n_missing percent
#> <chr> <int> <dbl>
#> 1 Ozone 37 24.183007
#> 2 Solar.R 7 4.575163
#> 3 Wind 0 0.000000
#> 4 Temp 0 0.000000
#> 5 Month 0 0.000000
#> 6 Day 0 0.000000
miss_case_summary(airquality)
#> # A tibble: 153 x 3
#> case n_missing percent
#> <int> <int> <dbl>
#> 1 1 0 0.00000
#> 2 2 0 0.00000
#> 3 3 0 0.00000
#> 4 4 0 0.00000
#> 5 5 2 33.33333
#> 6 6 1 16.66667
#> 7 7 0 0.00000
#> 8 8 0 0.00000
#> 9 9 0 0.00000
#> 10 10 1 16.66667
#> # ... with 143 more rows
Tabulations of the number of missings in each case or variable can be calculated with miss_var_table()
and miss_case_table()
.
miss_var_table(airquality)
#> # A tibble: 3 x 3
#> n_missing_in_var n_vars percent
#> <int> <int> <dbl>
#> 1 0 4 66.66667
#> 2 7 1 16.66667
#> 3 37 1 16.66667
miss_case_table(airquality)
#> # A tibble: 3 x 3
#> n_missing_in_case n_cases percent
#> <int> <int> <dbl>
#> 1 0 111 72.54902
#> 2 1 40 26.14379
#> 3 2 2 1.30719
All functions can be called at once using miss_summary()
, which takes a data.frame and then returns a nested dataframe containing the percentages of missing data, and lists of dataframes containing tally and summary information for the variables and cases.
s_miss <- miss_summary(airquality)
s_miss
#> # A tibble: 1 x 7
#> miss_df_pct miss_var_pct miss_case_pct miss_case_table miss_var_table
#> <dbl> <dbl> <dbl> <list> <list>
#> 1 4.793028 33.33333 27.45098 <tibble [3 x 3]> <tibble [3 x 3]>
#> # ... with 2 more variables: miss_var_summary <list>,
#> # miss_case_summary <list>
# overall % missing data
s_miss$percent_missing_df
#> Warning: Unknown or uninitialised column: 'percent_missing_df'.
#> NULL
# % of variables that contain missing data
s_miss$percent_missing_var
#> Warning: Unknown or uninitialised column: 'percent_missing_var'.
#> NULL
# % of cases that contain missing data
s_miss$percent_missing_case
#> Warning: Unknown or uninitialised column: 'percent_missing_case'.
#> NULL
# tabulations of missing data across cases
s_miss$table_missing_case
#> Warning: Unknown or uninitialised column: 'table_missing_case'.
#> NULL
# tabulations of missing data across variables
s_miss$table_missing_var
#> Warning: Unknown or uninitialised column: 'table_missing_var'.
#> NULL
# summary information (counts, percentrages) of missing data for variables and cases
s_miss$summary_missing_var
#> Warning: Unknown or uninitialised column: 'summary_missing_var'.
#> NULL
s_miss$summary_missing_case
#> Warning: Unknown or uninitialised column: 'summary_missing_case'.
#> NULL
gg_missing_var(airquality)
gg_missing_case(airquality)
This shows whether a given variable contains a missing variable. In this case grey = missing. Think of it as if you are shading the cell in, if it contains data.
gg_missing_which(airquality)
Other plans to extend the geom_missing_
family to include:
- Categorical variables
- Bivariate plots: Scatterplots, Density overlays.
Naming credit (once again!) goes to @MilesMcBain. Also thank you to @dicook and @hadley for putting up with my various questions and concerns, mainly around the name.