/dpurifyr

dpurifyr: A grammar of data preprocessing

Primary LanguageRMIT LicenseMIT

dpurifyr

Build Status CRAN_Status_Badge

Overview

dpurifyr package gives you a practical way for data preprocessing providing a consistent set of verbs that help you solve the most common data preprocessing challenges.

Installation

You can install from CRAN with:

# Not yet
# install.packages("dpurifyr")

# Or the development version from GitHub:
# install.packages("devtools")
devtools::install_github("teramonagi/dpurifyr")

Simple Example

The following example uses dpurifyr to solve a fairly realistic problem: apply different types of data preprocessing (standard_scale and scale_minmax) to columns (dplyr::starts_with("Sepal") and Petal.Width) selected by the way which is consistent with other tidyverse packages.

library("dpurifyr")
df <- head(iris)
# Create Data Pre-Processing chain while data preproessing for df is done.
pp <- dpurifyr::scale_standard(df, dplyr::starts_with("Sepal")) %>% 
  dpurifyr::scale_minmax(Petal.Width) 

# PreProcessing object(pp) behave like data.frame object
head(pp)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1    0.5206576   0.3401105          1.4           0  setosa
#> 2   -0.1735525  -1.1175060          1.4           0  setosa
#> 3   -0.8677627  -0.5344594          1.3           0  setosa
#> 4   -1.2148677  -0.8259827          1.5           0  setosa
#> 5    0.1735525   0.6316338          1.4           0  setosa
#> 6    1.5619728   1.5062037          1.7           1  setosa

Once you get preprocessing object, preprocessing object can be applied to the other data. This means that you can apply the same preprocessing with fixed parameter to other data.

# You can apply the same preprocessing to different data.frame
pp <- dpurifyr::scale_standard(head(iris), Sepal.Width, Petal.Length) %>% 
  dpurifyr::scale_standard(Sepal.Length) 
# Using the same parameters( `use_param=TRUE`) estimated in pp object.
pp2 <- dpurifyr::apply(head(iris, 10), pp, TRUE) 
pp2
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1     0.5206576  0.34011052   -0.3627381         0.2  setosa
#> 2    -0.1735525 -1.11750599   -0.3627381         0.2  setosa
#> 3    -0.8677627 -0.53445939   -1.0882144         0.2  setosa
#> 4    -1.2148677 -0.82598269    0.3627381         0.2  setosa
#> 5     0.1735525  0.63163382   -0.3627381         0.2  setosa
#> 6     1.5619728  1.50620373    1.8136906         0.4  setosa
#> 7    -1.2148677  0.04858722   -0.3627381         0.3  setosa
#> 8     0.1735525  0.04858722    0.3627381         0.2  setosa
#> 9    -1.9090779 -1.40902929   -0.3627381         0.2  setosa
#> 10   -0.1735525 -0.82598269    0.3627381         0.1  setosa

You can find more information in vignettes.

Contribution

  • If you encounter a clear bug, please file a minimal reproducible example on github

Citation

To cite package dpurifyr in publications use:

Nagi Teramo, Shinichi Takayanagi (2017). dpurifyr: A grammar of data preprocessing. R package version 0.1.0. https://github.com/teramonagi/dpurifyr

A BibTeX entry for LaTeX users is

@Manual{,
  title = {dpurifyr: A grammar of data preprocessing},
  author = {Nagi Teramo, Shinichi Takayanagi},
  year = {2017}, 
  note = {R package version 0.1.0},
  url = {https://github.com/teramonagi/dpurifyr},
}