inspectdf
is collection of utilities for columnwise summary,
comparison and visualisation of data frames. Functions are provided to
summarise missingness, categorical levels, numeric distribution,
correlation, column types and memory usage.
The package has three aims:
- to speed up repetitive checking and exploratory tasks for data frames
- to make it easier to compare data frames for differences and inconsistencies
- to support quick visualisation of data frames
inspect_types()
summary of column typesinspect_mem()
summary of memory usage of columnsinspect_na()
columnwise prevalence of missing valuesinspect_cor()
correlation coefficients of numeric columnsinspect_imb()
feature imbalance of categorical columnsinspect_num()
summaries of numeric columnsinspect_cat()
summaries of categorical columns
To install the development version of the package, use
devtools::install_github("alastairrushworth/inspectdf")
# load the package
library(inspectdf)
The examples below make use of the starwars
and storms
data from the
dplyr
package
# some example data
data(starwars, package = "dplyr")
data(storms, package = "dplyr")
For illustrating comparisons of dataframes, use the starwars
data and
produce two new dataframes star_1
and star_2
that randomly sample
the rows of the original and drop a couple of columns.
library(dplyr)
star_1 <- starwars %>% sample_n(50)
star_2 <- starwars %>% sample_n(50) %>% select(-1, -2)
To explore the column types in a data frame, use the function
inspect_types()
. The command returns a tibble
summarising the counts
and percentages of columns with particular types.
# return tibble showing columns types
inspect_types(starwars)
## # A tibble: 4 x 4
## type cnt pcnt col_name
## <chr> <int> <dbl> <list>
## 1 character 7 53.8 <chr [7]>
## 2 list 3 23.1 <chr [3]>
## 3 numeric 2 15.4 <chr [2]>
## 4 integer 1 7.69 <chr [1]>
A barplot can be produced by passing the result to show_plot()
:
# print visualisation of column types
inspect_types(starwars) %>% show_plot()
When a second dataframe is provided, inspect_types()
will create a
dataframe comparing the count and percentage of each column type for
each of the input dataframes. The summaries for the first and second
dataframes are show in columns with names appended with _1
and _2
,
respectively.
inspect_types(star_1, star_2)
## # A tibble: 4 x 5
## type cnt_1 pcnt_1 cnt_2 pcnt_2
## <chr> <int> <dbl> <dbl> <dbl>
## 1 character 7 53.8 6 54.5
## 2 list 3 23.1 3 27.3
## 3 numeric 2 15.4 2 18.2
## 4 integer 1 7.69 0 0
# print visualisation of column type comparison
inspect_types(star_1, star_2) %>% show_plot()
To explore the memory usage of the columns in a data frame, use
inspect_mem()
. The command returns a tibble
containing the size of
each column in the dataframe.
inspect_mem(starwars)
## # A tibble: 13 x 3
## col_name size pcnt
## <chr> <chr> <dbl>
## 1 films 19.54 Kb 36.5
## 2 starships 7.27 Kb 13.6
## 3 name 6.13 Kb 11.5
## 4 vehicles 5.8 Kb 10.8
## 5 homeworld 3.52 Kb 6.58
## 6 species 2.88 Kb 5.39
## 7 skin_color 2.59 Kb 4.85
## 8 eye_color 1.57 Kb 2.93
## 9 hair_color 1.41 Kb 2.63
## 10 gender 976 bytes 1.78
## 11 mass 744 bytes 1.36
## 12 birth_year 744 bytes 1.36
## 13 height 400 bytes 0.730
A barplot can be produced by passing the result to show_plot()
:
inspect_mem(starwars) %>% show_plot()
When a second dataframe is provided, inspect_mem()
will create a
dataframe comparing the size of each column for both input dataframes.
The summaries for the first and second dataframes are show in columns
with names appended with _1
and _2
, respectively.
inspect_mem(star_1, star_2)
## # A tibble: 13 x 5
## col_name size_1 size_2 pcnt_1 pcnt_2
## <chr> <chr> <chr> <dbl> <dbl>
## 1 films 12.08 Kb 12.19 Kb 37.5 41.7
## 2 starships 4.15 Kb 4.37 Kb 12.9 15.0
## 3 name 3.5 Kb <NA> 10.9 NA
## 4 vehicles 3.47 Kb 3.31 Kb 10.8 11.3
## 5 homeworld 2.09 Kb 2.19 Kb 6.47 7.49
## 6 skin_color 1.77 Kb 1.77 Kb 5.48 6.05
## 7 species 1.72 Kb 1.79 Kb 5.33 6.13
## 8 hair_color 904 bytes 1.06 Kb 2.74 3.64
## 9 eye_color 896 bytes 1.05 Kb 2.71 3.58
## 10 gender 616 bytes 624 bytes 1.87 2.09
## 11 mass 448 bytes 448 bytes 1.36 1.50
## 12 birth_year 448 bytes 448 bytes 1.36 1.50
## 13 height 248 bytes <NA> 0.751 NA
inspect_mem(star_1, star_2) %>% show_plot()
inspect_na()
summarises the prevalence of missing values by each
column in a data frame. A tibble containing the count (cnt
) and the
overall percentage (pcnt
) of missing values is returned.
inspect_na(starwars)
## # A tibble: 13 x 3
## col_name cnt pcnt
## <chr> <dbl> <dbl>
## 1 birth_year 44 50.6
## 2 mass 28 32.2
## 3 homeworld 10 11.5
## 4 height 6 6.90
## 5 hair_color 5 5.75
## 6 species 5 5.75
## 7 gender 3 3.45
## 8 name 0 0
## 9 skin_color 0 0
## 10 eye_color 0 0
## 11 films 0 0
## 12 vehicles 0 0
## 13 starships 0 0
A barplot can be produced by passing the result to show_plot()
:
inspect_na(starwars) %>% show_plot()
When a second dataframe is provided, inspect_na()
returns a tibble
containing counts and percentage missingness by column, with summaries
for the first and second data frames are show in columns with names
appended with _1
and _2
, respectively. In addition, a (p)-value is
calculated which provides a measure of evidence of whether the
difference in missing values is significantly different.
inspect_na(star_1, star_2)
## # A tibble: 13 x 6
## col_name cnt_1 pcnt_1 cnt_2 pcnt_2 p_value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 birth_year 21 42 24 48 0.688
## 2 mass 15 30 16 32 1
## 3 homeworld 7 14 4 8 0.523
## 4 height 5 10 NA NA NA
## 5 hair_color 3 6 5 10 0.712
## 6 gender 3 6 3 6 1
## 7 species 2 4 2 4 1
## 8 name 0 0 NA NA NA
## 9 skin_color 0 0 0 0 NA
## 10 eye_color 0 0 0 0 NA
## 11 films 0 0 0 0 NA
## 12 vehicles 0 0 0 0 NA
## 13 starships 0 0 0 0 NA
inspect_na(star_1, star_2) %>% show_plot()
Notes:
- Smaller (p)-values indicate stronger evidence of a difference in the missingness rate for a single column
- If a column appears in one data frame and not the other - for
example
height
appears instar_1
but norstar_2
, then the correspondingpcnt_
,cnt_
andp_value
columns will containNA
- Where the missingness is identically 0, the
p_value
isNA
. - The visualisation illustrates the significance of the difference
using a coloured bar overlay. Orange bars indicate evidence of
equality or missingness, while blue bars indicate inequality. If a
p_value
cannot be calculated, no coloured bar is shown. - The significance level can be specified using the
alpha
argument toinspect_na()
. The default isalpha = 0.05
.
inspect_cor()
returns a tibble containing Pearson’s correlation
coefficient, confidence intervals and (p)-values for pairs of numeric
columns . The function combines the functionality of cor()
and
cor.test()
in a more convenient wrapper.
inspect_cor(storms)
## Column pair (42/45): hu_diameter & wind Column pair (43/45): ts_diameter &
## pressure Column pair (44/45): hu_diameter & pressure Column pair (45/45):
## hu_diameter & ts_diameter
## # A tibble: 45 x 6
## col_1 col_2 corr p_value lower upper
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 pressure wind -0.942 0. -0.945 -0.940
## 2 hu_diameter pressure -0.842 0. -0.853 -0.831
## 3 hu_diameter wind 0.774 0. 0.758 0.788
## 4 hu_diameter ts_diameter 0.684 0. 0.663 0.704
## 5 ts_diameter pressure -0.683 0. -0.703 -0.663
## 6 ts_diameter wind 0.640 0. 0.617 0.662
## 7 ts_diameter lat 0.301 1.25e-73 0.266 0.335
## 8 day month -0.183 3.59e-76 -0.205 -0.161
## 9 hu_diameter lat 0.164 1.59e-22 0.127 0.201
## 10 ts_diameter month 0.139 1.67e-16 0.102 0.176
## # … with 35 more rows
A plot showing point estimate and confidence intervals is printed when
using the show_plot()
function. Note that intervals that straddle the
null value of 0 are shown in gray:
inspect_cor(storms) %>% show_plot()
## Column pair (41/45): ts_diameter & wind Column pair (42/45): hu_diameter
## & wind Column pair (43/45): ts_diameter & pressure Column pair (44/45):
## hu_diameter & pressure Column pair (45/45): hu_diameter & ts_diameter
Notes:
- The tibble is sorted in descending order of the absolute coefficient (|\rho|).
inspect_cor
drops missing values prior to calculation of each correlation coefficient.- The
p_value
is associated with the null hypothesis (H_0: \rho = 0).
When a second dataframe is provided, inspect_cor()
returns a tibble
that compares correlation coefficients of the first dataframe to those
in the second. The p_value
column contains a measure of evidence for
whether the two correlation coefficients are equal or not.
inspect_cor(storms, storms[-c(1:200), ])
## Column pair (40/45): pressure & wind Column pair (41/45): ts_diameter
## & wind Column pair (42/45): hu_diameter & wind Column pair (43/45):
## ts_diameter & pressure Column pair (44/45): hu_diameter & pressure Column
## pair (45/45): hu_diameter & ts_diameter Column pair (29/45): ts_diameter
## & hour Column pair (30/45): hu_diameter & hour Column pair (31/45): long
## & lat Column pair (32/45): wind & lat Column pair (33/45): pressure & lat
## Column pair (34/45): ts_diameter & lat Column pair (35/45): hu_diameter &
## lat Column pair (36/45): wind & long Column pair (37/45): pressure & long
## Column pair (38/45): ts_diameter & long Column pair (39/45): hu_diameter &
## long Column pair (40/45): pressure & wind Column pair (41/45): ts_diameter
## & wind Column pair (42/45): hu_diameter & wind Column pair (43/45):
## ts_diameter & pressure Column pair (44/45): hu_diameter & pressure Column
## pair (45/45): hu_diameter & ts_diameter
## # A tibble: 45 x 5
## col_1 col_2 corr_1 corr_2 p_value
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 pressure wind -0.942 -0.942 0.929
## 2 hu_diameter pressure -0.842 -0.842 1
## 3 hu_diameter wind 0.774 0.774 1
## 4 hu_diameter ts_diameter 0.684 0.684 1
## 5 ts_diameter pressure -0.683 -0.683 1
## 6 ts_diameter wind 0.640 0.640 1
## 7 ts_diameter lat 0.301 0.301 1
## 8 day month -0.183 -0.178 0.729
## 9 hu_diameter lat 0.164 0.164 1
## 10 ts_diameter month 0.139 0.139 1
## # … with 35 more rows
To plot the comparison of the top 20 correlation coefficients:
inspect_cor(storms, storms[-c(1:200), ]) %>%
slice(1:20) %>%
show_plot()
## Column pair (35/45): hu_diameter & lat Column pair (36/45): wind & long
## Column pair (37/45): pressure & long Column pair (38/45): ts_diameter &
## long Column pair (39/45): hu_diameter & long Column pair (40/45): pressure
## & wind Column pair (41/45): ts_diameter & wind Column pair (42/45):
## hu_diameter & wind Column pair (43/45): ts_diameter & pressure Column
## pair (44/45): hu_diameter & pressure Column pair (45/45): hu_diameter &
## ts_diameter
Notes:
- Smaller
p_value
indicates stronger evidence against the null hypothesis (H_0: \rho_1 = \rho_2) and an indication that the true correlation coefficients differ. - The visualisation illustrates the significance of the difference using a coloured bar underlay. Coloured bars indicate evidence of inequality of correlations, while gray bars indicate equality.
- For a pair of features, if either coefficient is
NA
, the comparison is omitted from the visualisation. - The significance level can be specified using the
alpha
argument toinspect_cor()
. The default isalpha = 0.05
.
Understanding categorical columns that are dominated by a single level
can be useful. inspect_imb()
returns a tibble containing categorical
column names (col_name
); the most frequently occurring categorical
level in each column (value
) and pctn
& cnt
the percentage and
count which the value occurs. The tibble is sorted in descending order
of pcnt
.
inspect_imb(starwars)
## # A tibble: 7 x 4
## col_name value pcnt cnt
## <chr> <chr> <dbl> <int>
## 1 gender male 71.3 62
## 2 hair_color none 42.5 37
## 3 species Human 40.2 35
## 4 eye_color brown 24.1 21
## 5 skin_color fair 19.5 17
## 6 homeworld Naboo 12.6 11
## 7 name Ackbar 1.15 1
A barplot is printed by passing the result to the show_plot()
function:
inspect_imb(starwars) %>% show_plot()
When a second dataframe is provided, inspect_imb()
returns a tibble
that compares the frequency of the most common categorical values of the
first dataframe to those in the second. The p_value
column contains a
measure of evidence for whether the true frequencies are equal or not.
inspect_imb(star_1, star_2)
## # A tibble: 7 x 7
## col_name value pcnt_1 cnt_1 pcnt_2 cnt_2 p_value
## <chr> <chr> <dbl> <int> <dbl> <int> <dbl>
## 1 gender male 74 37 68 34 0.659
## 2 species Human 42 21 40 20 1.000
## 3 hair_color none 36 18 38 19 1
## 4 eye_color brown 32 16 NA NA NA
## 5 skin_color fair 16 8 16 8 1
## 6 homeworld Naboo 14. 7 14. 7 1.000
## 7 name Ackbar 2 1 NA NA NA
inspect_imb(star_1, star_2) %>% show_plot()
- Smaller
p_value
indicates stronger evidence against the null hypothesis that the true frequency of the most common values is the same. - The visualisation illustrates the significance of the difference
using a coloured bar overlay. Orange bars indicate evidence of
equality of the imbalance, while blue bars indicate inequality. If a
p_value
cannot be calculated, no coloured bar is shown. - The significance level can be specified using the
alpha
argument toinspect_imb()
. The default isalpha = 0.05
.
inspect_num()
combining some of the functionality of summary()
and
hist()
by returning summaries of numeric columns. inspect_num()
returns standard numerical summaries (min
, q1
, mean
,
median
,q3
, max
, sd
), but also the percentage of missing entries
(pcnt_na
) and a simple histogram (hist
).
inspect_num(storms, breaks = 10)
## # A tibble: 10 x 10
## col_name min q1 median mean q3 max sd pcnt_na hist
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lis>
## 1 day 1 8 16 15.9 24 31 9.01 0 <tib…
## 2 hour 0 6 12 9.11 18 23 6.73 0 <tib…
## 3 hu_diam… 0 0 0 21.4 28.8 345. 41.3 65.2 <tib…
## 4 lat 7.2 17.5 24.4 24.8 31.3 51.9 8.54 0 <tib…
## 5 long -109. -80.7 -64.5 -64.2 -48.6 -6 19.6 0 <tib…
## 6 month 1 8 9 8.78 9 12 1.24 0 <tib…
## 7 pressure 882 985 999 992. 1006 1022 19.5 0 <tib…
## 8 ts_diam… 0 69.0 138. 167. 242. 1001. 141. 65.2 <tib…
## 9 wind 10 30 45 53.5 65 160 26.2 0 <tib…
## 10 year 1975 1990 1999 1998. 2006 2015 10.3 0 <tib…
The hist
column is a list whose elements are tibbles each containing
the relative frequencies of bins for each feature. These tibbles are
used to generate the histograms when show_plot = TRUE
. For example,
the histogram for starwars$birth_year
is
inspect_num(storms)$hist$pressure
## # A tibble: 31 x 2
## value prop
## <chr> <dbl>
## 1 [-Inf, 880) 0
## 2 [880, 885) 0.0000999
## 3 [885, 890) 0.000200
## 4 [890, 895) 0.000400
## 5 [895, 900) 0.000300
## 6 [900, 905) 0.000300
## 7 [905, 910) 0.000699
## 8 [910, 915) 0.000999
## 9 [915, 920) 0.00130
## 10 [920, 925) 0.00390
## # … with 21 more rows
A histogram is generated for each numeric feature by passing the result
to the show_plot()
function:
inspect_num(storms, breaks = 10) %>%
show_plot()
When comparing a pair of dataframes using inspect_num()
, the
histograms of common numeric features are calculated, using identical
bins. The list columns hist_1
and hist_2
contain the histograms of
the features in the first and second dataframes. A formal statistical
comparison of each pair of histograms is calculated using Fisher’s exact
test, the resulting p value is reported in the column fisher_p
.
When show_plot = TRUE
, heat plot comparisons are returned for each
numeric column in each dataframe. Where a column is present in only one
of the dataframes, grey cells are shown in the comparison. The
significance of Fisher’s test is illustrated by coloured vertical bands
around each plot: if the colour is grey, no p value could be
calculated, if blue, the histograms are not found to be significantly
different otherwise the bands are red.
inspect_num(storms, storms[-c(1:10), -1])
## # A tibble: 10 x 5
## col_name hist_1 hist_2 jsd fisher_p
## <chr> <list> <list> <dbl> <dbl>
## 1 day <tibble [18 × 2]> <tibble [18 × 2]> 0.00000119 NA
## 2 hour <tibble [25 × 2]> <tibble [25 × 2]> 0.000539 NA
## 3 hu_diameter <tibble [20 × 2]> <tibble [20 × 2]> 0 1.000
## 4 lat <tibble [25 × 2]> <tibble [25 × 2]> 0.000000460 NA
## 5 long <tibble [23 × 2]> <tibble [23 × 2]> 0.00000174 NA
## 6 month <tibble [24 × 2]> <tibble [24 × 2]> 0.0125 NA
## 7 pressure <tibble [31 × 2]> <tibble [31 × 2]> 0.00000102 NA
## 8 ts_diameter <tibble [23 × 2]> <tibble [23 × 2]> 0 1.000
## 9 wind <tibble [17 × 2]> <tibble [17 × 2]> 0.000104 NA
## 10 year <tibble [23 × 2]> <tibble [23 × 2]> 0.0000221 NA
inspect_num(storms, storms[-c(1:10), -1]) %>%
show_plot()
inspect_cat()
returns a tibble summarising categorical features in a
data frame, combining the functionality of the inspect_imb()
and
table()
functions. The tibble generated contains the columns
col_name
name of each categorical columncnt
the number of unique levels in the featurecommon
the most common level (see alsoinspect_imb()
)common_pcnt
the percentage occurrence of the most dominant levellevels
a list of tibbles each containing frequency tabulations of all levels
inspect_cat(starwars)
## # A tibble: 7 x 5
## col_name cnt common common_pcnt levels
## <chr> <int> <chr> <dbl> <list>
## 1 eye_color 15 brown 24.1 <tibble [15 × 3]>
## 2 gender 5 male 71.3 <tibble [5 × 3]>
## 3 hair_color 13 none 42.5 <tibble [13 × 3]>
## 4 homeworld 49 Naboo 12.6 <tibble [49 × 3]>
## 5 name 87 Ackbar 1.15 <tibble [87 × 3]>
## 6 skin_color 31 fair 19.5 <tibble [31 × 3]>
## 7 species 38 Human 40.2 <tibble [38 × 3]>
For example, the levels for the hair_color
column are
inspect_cat(starwars)$levels$hair_color
## # A tibble: 13 x 3
## value prop cnt
## <chr> <dbl> <int>
## 1 none 0.425 37
## 2 brown 0.207 18
## 3 black 0.149 13
## 4 <NA> 0.0575 5
## 5 white 0.0460 4
## 6 blond 0.0345 3
## 7 auburn 0.0115 1
## 8 auburn, grey 0.0115 1
## 9 auburn, white 0.0115 1
## 10 blonde 0.0115 1
## 11 brown, grey 0.0115 1
## 12 grey 0.0115 1
## 13 unknown 0.0115 1
Note that by default, if NA
values are present, they are counted as a
distinct categorical level. A barplot is printed showing the relative
split when passing the result to show_plot()
:
inspect_cat(starwars) %>% show_plot()
The argument high_cardinality
in the show_plot()
function can be
used to bundle together categories that occur only a small number of
times. For example, to combine categories only occurring once, use:
inspect_cat(starwars) %>%
show_plot(high_cardinality = 1)
The resulting bundles are shown in purple.
When two dataframes are compared using inspect_cat()
, list columns are
returned for categorical columns common to both: lvls_1
and lvl2_2
.
In addition, the Jensen-Shannon divergence (jsd
) and p values
associated with Fisher’s exact test (fisher_p
) are returned to enable
comparison of the distribution of levels in each pair of columns.
inspect_cat(star_1, star_2)
## # A tibble: 7 x 5
## col_name jsd fisher_p lvls_1 lvls_2
## <chr> <dbl> <dbl> <list> <list>
## 1 eye_color 0.0667 0.709 <tibble [8 × 3]> <tibble [11 × 3]>
## 2 gender 0.0240 0.877 <tibble [4 × 3]> <tibble [4 × 3]>
## 3 hair_color 0.0816 0.978 <tibble [9 × 3]> <tibble [12 × 3]>
## 4 homeworld 0.254 0.986 <tibble [29 × 3]> <tibble [31 × 3]>
## 5 name NA NA <tibble [50 × 3]> <NULL>
## 6 skin_color 0.105 0.978 <tibble [23 × 3]> <tibble [22 × 3]>
## 7 species 0.191 1.000 <tibble [23 × 3]> <tibble [24 × 3]>
inspect_cat(star_1, star_2) %>% show_plot()