skimr
skimr
provides a frictionless approach to summary statistics which
conforms to the principle of least
surprise,
displaying summary statistics the user can skim quickly to understand
their data. It handles different data types and returns a skim_df
object which can be included in a pipeline or displayed nicely for the
human reader.
Installation
The current released version of skimr
can be installed from CRAN. If
you wish to install the current build of the next release you can do so
using the following:
# install.packages("devtools")
devtools::install_github("ropenscilabs/skimr")
The APIs for this branch should be considered reasonably stable but still subject to change if an issue is discovered.
To install the version with the most recent changes that have not yet been incorporated in the master branch (and may not be):
devtools::install_github("ropenscilabs/skimr", ref = "develop")
Do not rely on APIs from the develop branch.
Skim statistics in the console
skimr
:
- Provides a larger set of statistics than
summary()
, including missing, complete, n, and sd. - reports each data types separately
- handles dates, logicals, and a variety of other types
- supports spark-bar and spark-line based on Hadley Wickham's pillar package.
Separates variables by class:
skim(chickwts)
## Skim summary statistics
## n obs: 71
## n variables: 2
##
## ── Variable type:factor ─────────────────────────────────────────────────────────────
## variable missing complete n n_unique top_counts ordered
## feed 0 71 71 6 soy: 14, cas: 12, lin: 12, sun: 12 FALSE
##
## ── Variable type:numeric ────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## weight 0 71 71 261.31 78.07 108 204.5 258 323.5 423 ▃▅▅▇▃▇▂▂
Presentation is in a compact horizontal format:
skim(iris)
## Skim summary statistics
## n obs: 150
## n variables: 5
##
## ── Variable type:factor ─────────────────────────────────────────────────────────────
## variable missing complete n n_unique top_counts ordered
## Species 0 150 150 3 set: 50, ver: 50, vir: 50, NA: 0 FALSE
##
## ── Variable type:numeric ────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## Petal.Length 0 150 150 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▁▂▅▅▃▁
## Petal.Width 0 150 150 1.2 0.76 0.1 0.3 1.3 1.8 2.5 ▇▁▁▅▃▃▂▂
## Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9 ▂▇▅▇▆▅▂▂
## Sepal.Width 0 150 150 3.06 0.44 2 2.8 3 3.3 4.4 ▁▂▅▇▃▂▁▁
Built in support for strings, lists and other column classes
skim(dplyr::starwars)
## Skim summary statistics
## n obs: 87
## n variables: 13
##
## ── Variable type:character ──────────────────────────────────────────────────────────
## variable missing complete n min max empty n_unique
## eye_color 0 87 87 3 13 0 15
## gender 3 84 87 4 13 0 4
## hair_color 5 82 87 4 13 0 12
## homeworld 10 77 87 4 14 0 48
## name 0 87 87 3 21 0 87
## skin_color 0 87 87 3 19 0 31
## species 5 82 87 3 14 0 37
##
## ── Variable type:integer ────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## height 6 81 87 174.36 34.77 66 167 180 191 264 ▁▁▁▂▇▃▁▁
##
## ── Variable type:list ───────────────────────────────────────────────────────────────
## variable missing complete n n_unique min_length median_length max_length
## films 0 87 87 24 1 1 7
## starships 0 87 87 17 0 0 5
## vehicles 0 87 87 11 0 0 2
##
## ── Variable type:numeric ────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## birth_year 44 43 87 87.57 154.69 8 35 52 72 896 ▇▁▁▁▁▁▁▁
## mass 28 59 87 97.31 169.46 15 55.6 79 84.5 1358 ▇▁▁▁▁▁▁▁
Has a useful summary function
skim(iris) %>% summary()
## A skim object
##
## Name: iris
## Number of Rows: 150
## Number of Columns: 5
##
## Column type frequency
## factor: 1
## numeric: 4
Individual columns can be selected using tidyverse-style selectors
skim(iris, Sepal.Length, Petal.Length)
## Skim summary statistics
## n obs: 150
## n variables: 5
##
## ── Variable type:numeric ────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## Petal.Length 0 150 150 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▁▂▅▅▃▁
## Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9 ▂▇▅▇▆▅▂▂
Handles grouped data
skim()
can handle data that has been grouped using dplyr::group_by
.
iris %>% dplyr::group_by(Species) %>% skim()
## Skim summary statistics
## n obs: 150
## n variables: 5
## group variables: Species
##
## ── Variable type:numeric ────────────────────────────────────────────────────────────
## Species variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## setosa Petal.Length 0 50 50 1.46 0.17 1 1.4 1.5 1.58 1.9 ▁▁▅▇▇▅▂▁
## setosa Petal.Width 0 50 50 0.25 0.11 0.1 0.2 0.2 0.3 0.6 ▂▇▁▂▂▁▁▁
## setosa Sepal.Length 0 50 50 5.01 0.35 4.3 4.8 5 5.2 5.8 ▂▃▅▇▇▃▁▂
## setosa Sepal.Width 0 50 50 3.43 0.38 2.3 3.2 3.4 3.68 4.4 ▁▁▃▅▇▃▂▁
## versicolor Petal.Length 0 50 50 4.26 0.47 3 4 4.35 4.6 5.1 ▁▃▂▆▆▇▇▃
## versicolor Petal.Width 0 50 50 1.33 0.2 1 1.2 1.3 1.5 1.8 ▆▃▇▅▆▂▁▁
## versicolor Sepal.Length 0 50 50 5.94 0.52 4.9 5.6 5.9 6.3 7 ▃▂▇▇▇▃▅▂
## versicolor Sepal.Width 0 50 50 2.77 0.31 2 2.52 2.8 3 3.4 ▁▂▃▅▃▇▃▁
## virginica Petal.Length 0 50 50 5.55 0.55 4.5 5.1 5.55 5.88 6.9 ▂▇▃▇▅▂▁▂
## virginica Petal.Width 0 50 50 2.03 0.27 1.4 1.8 2 2.3 2.5 ▂▁▇▃▃▆▅▃
## virginica Sepal.Length 0 50 50 6.59 0.64 4.9 6.23 6.5 6.9 7.9 ▁▁▃▇▅▃▂▃
## virginica Sepal.Width 0 50 50 2.97 0.32 2.2 2.8 3 3.18 3.8 ▁▃▇▇▅▃▁▂
Knitted results
Simply skimming a data frame will produce the horizontal print layout shown above. When knitting you can also used enhanced rendering with kable and pander implementations.
Options for kable and pander
Enhanced print options are available by piping to kable()
or
pander()
. These build on the pander
package and the kable
function of the knitr
package These examples show
how the enhanced options should appear after knitting, however your
results may differ (see vignettes for details).
Option for kable.
Note that the results='asis' chunk option is used and the skimr::
namespace is used to prevent it being replaced by knitr::kable (which
will result in the long skim_df object being printed.)
skim(iris) %>% skimr::kable()
Skim summary statistics
n obs: 150
n variables: 5
Variable type: factor
variable | missing | complete | n | n_unique | top_counts | ordered |
---|---|---|---|---|---|---|
Species | 0 | 150 | 150 | 3 | set: 50, ver: 50, vir: 50, NA: 0 | FALSE |
Variable type: numeric
variable | missing | complete | n | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
Petal.Length | 0 | 150 | 150 | 3.76 | 1.77 | 1 | 1.6 | 4.35 | 5.1 | 6.9 | ▇▁▁▂▅▅▃▁ |
Petal.Width | 0 | 150 | 150 | 1.2 | 0.76 | 0.1 | 0.3 | 1.3 | 1.8 | 2.5 | ▇▁▁▅▃▃▂▂ |
Sepal.Length | 0 | 150 | 150 | 5.84 | 0.83 | 4.3 | 5.1 | 5.8 | 6.4 | 7.9 | ▂▇▅▇▆▅▂▂ |
Sepal.Width | 0 | 150 | 150 | 3.06 | 0.44 | 2 | 2.8 | 3 | 3.3 | 4.4 | ▁▂▅▇▃▂▁▁ |
Options for pander
At times you may need panderOptions('knitr.auto.asis', FALSE)
.
skim(iris) %>% pander()
Skim summary statistics
n obs: 150
n variables: 5
variable | missing | complete | n | n_unique |
---|---|---|---|---|
Species | 0 | 150 | 150 | 3 |
top_counts | ordered |
---|---|
set: 50, ver: 50, vir: 50, NA: 0 | FALSE |
variable | missing | complete | n | mean | sd | p0 | p25 | p50 | p75 |
---|---|---|---|---|---|---|---|---|---|
Petal.Length | 0 | 150 | 150 | 3.76 | 1.77 | 1 | 1.6 | 4.35 | 5.1 |
Petal.Width | 0 | 150 | 150 | 1.2 | 0.76 | 0.1 | 0.3 | 1.3 | 1.8 |
Sepal.Length | 0 | 150 | 150 | 5.84 | 0.83 | 4.3 | 5.1 | 5.8 | 6.4 |
Sepal.Width | 0 | 150 | 150 | 3.06 | 0.44 | 2 | 2.8 | 3 | 3.3 |
p100 | hist |
---|---|
6.9 | ▇▁▁▂▅▅▃▁ |
2.5 | ▇▁▁▅▃▃▂▂ |
7.9 | ▂▇▅▇▆▅▂▂ |
4.4 | ▁▂▅▇▃▂▁▁ |
skim_df
object (long format)
By default skim()
prints beautifully in the console, but it also
produces a long, tidy-format skim_df
object that can be computed on.
a <- skim(chickwts)
dim(a)
## [1] 23 6
print.data.frame(skim(chickwts))
## variable type stat level value formatted
## 1 weight numeric missing .all 0.0000 0
## 2 weight numeric complete .all 71.0000 71
## 3 weight numeric n .all 71.0000 71
## 4 weight numeric mean .all 261.3099 261.31
## 5 weight numeric sd .all 78.0737 78.07
## 6 weight numeric p0 .all 108.0000 108
## 7 weight numeric p25 .all 204.5000 204.5
## 8 weight numeric p50 .all 258.0000 258
## 9 weight numeric p75 .all 323.5000 323.5
## 10 weight numeric p100 .all 423.0000 423
## 11 weight numeric hist .all NA ▃▅▅▇▃▇▂▂
## 12 feed factor missing .all 0.0000 0
## 13 feed factor complete .all 71.0000 71
## 14 feed factor n .all 71.0000 71
## 15 feed factor n_unique .all 6.0000 6
## 16 feed factor top_counts soybean 14.0000 soy: 14
## 17 feed factor top_counts casein 12.0000 cas: 12
## 18 feed factor top_counts linseed 12.0000 lin: 12
## 19 feed factor top_counts sunflower 12.0000 sun: 12
## 20 feed factor top_counts meatmeal 11.0000 mea: 11
## 21 feed factor top_counts horsebean 10.0000 hor: 10
## 22 feed factor top_counts <NA> 0.0000 NA: 0
## 23 feed factor ordered .all 0.0000 FALSE
skim_df
object
Compute on the full skim(mtcars) %>% dplyr::filter(stat=="hist")
## # A tibble: 11 x 6
## variable type stat level value formatted
## <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 mpg numeric hist .all NA ▃▇▇▇▃▂▂▂
## 2 cyl numeric hist .all NA ▆▁▁▃▁▁▁▇
## 3 disp numeric hist .all NA ▇▆▁▂▅▃▁▂
## 4 hp numeric hist .all NA ▃▇▃▅▂▃▁▁
## 5 drat numeric hist .all NA ▃▇▁▅▇▂▁▁
## 6 wt numeric hist .all NA ▃▃▃▇▆▁▁▂
## 7 qsec numeric hist .all NA ▃▂▇▆▃▃▁▁
## 8 vs numeric hist .all NA ▇▁▁▁▁▁▁▆
## 9 am numeric hist .all NA ▇▁▁▁▁▁▁▆
## 10 gear numeric hist .all NA ▇▁▁▆▁▁▁▂
## 11 carb numeric hist .all NA ▆▇▂▇▁▁▁▁
Customizing skimr
Although skimr provides opinionated defaults, it is highly customizable. Users can specify their own statistics, change the formatting of results, create statistics for new classes and develop skimmers for data structures that are not data frames.
Specify your own statistics and classes
Users can specify their own statistics using a list combined with the
skim_with()
function. This can support any named class found in your
data.
funs <- list(
iqr = IQR,
quantile = purrr::partial(quantile, probs = .99)
)
skim_with(numeric = funs, append = FALSE)
skim(iris, Sepal.Length)
## Skim summary statistics
## n obs: 150
## n variables: 5
##
## ── Variable type:numeric ────────────────────────────────────────────────────────────
## variable iqr quantile
## Sepal.Length 1.3 7.7
# Restore defaults
skim_with_defaults()
Change formatting
Skimr provides a set of default formats that allow decimals in columns
to be aligned, a reasonable number of decimal places for numeric data,
and a representation of dates. Users can view thes with show_formats()
and modify them with skim_format()
.
Skimming other objects
Procedures for developing skim functions for other objects are described in the vignette Supporting additional objects.
Limitations of current version
We are aware that there are issues with rendering the inline histograms and line charts in various contexts, some of which are described below.
Support for spark histograms
There are known issues with printing the spark-histogram characters when
printing a data frame. For example, "▂▅▇"
is printed as
"<U+2582><U+2585><U+2587>"
. This longstanding problem originates in
the low-level
code
for printing dataframes. While some cases have been addressed, there
are, for example, reports of this issue in Emacs ESS.
This means that while skimr
can render the histograms to the console
and in kable()
, it cannot in other circumstances. This includes:
- rendering a
skimr
data frame withinpander()
- converting a
skimr
data frame to a vanilla R data frame, but tibbles render correctly
One workaround for showing these characters in Windows is to set the
CTYPE part of your locale to Chinese/Japanese/Korean with
Sys.setlocale("LC_CTYPE", "Chinese")
. These values do show up by
default when printing a data-frame created by skim()
as a list
(as.list()
) or as a matrix (as.matrix()
).
Printing spark histograms and line graphs in knitted documents
Spark-bar and spark-line work in the console, but may not work when you
knit them to a specific document format. The same session that produces
a correctly rendered HTML document may produce an incorrectly rendered
PDF, for example. This issue can generally be addressed by changing
fonts to one with good building block (for histograms) and Braille
support (for line graphs). For example, the open font "DejaVu Sans" from
the extrafont
package supports these. You may also want to try
wrapping your results in knitr::kable()
. Please see the vignette on
using fonts for details.
Displays in documents of different types will vary. For example, one user found that the font "Yu Gothic UI Semilight" produced consistent results for Microsoft Word and Libre Office Write.
Contributing
We welcome issue reports and pull requests, including potentially adding support for commonly used variable classes. However, in general, we encourage users to take advantage of skimr's flexibility to add their own customized classes. Please see the contributing and conduct documents.