Akshi Chaudhary
: akshi8Tina Qian
: TinaQian2017Xinbin Huang
: xinbinhuang
- Date : March 18, 2018
- Release : v4
The tidyplusR
package is an essential data cleaning package with features like missing value treatment, data manipulation and displaying data as markdown table for documents. The package adds a few additional functionality on the existing data wrangling packages in popular statistical software like R. The objective of this package is to provide a few specific functions to solve some of the pressing issues in data cleaning.
You can install tidyplusR from github with:
# install.packages("devtools")
devtools::install_github("UBC-MDS/tidyplusR")
Three main parts include different functions in tidyplusR
-
Data Type Cleansing :
-
typemix
* The function helps to find the columns containing different types of data, like character and numeric. The input of the function is a data frame, and the output of the function will be a list of 3 data frames reporting details about the mixture of data types. The first data frame in the list is the same as the input data frame, the second one tells you the location and types of data in the columns where there is type mixture. The third data frame is a summary of the second data frame. -
cleanmix
* The function helps to clean our data frame. After knowing the location of discrepancy of data types, one can use this function to keep a type of data in certain columns. * Here, the input will be the output bytypemix
function, name of the column (a vector of the name of columns) that they want to clean, the type of data they want to work on, and if we want to keep or delete the certain type. The output will be a data frame like the original type but with specified data type in certain columns deleted.
-
-
Missing Value Treatment : Basic Imputation using
impute
-
Imputation: replace missing values in a column of a dataframe, or multiple columns of dataframe based on the
method
of imputation -
(Method = 'Mean')
replace using mean -
(Method = 'Median')
replace using median -
(Method = 'Mode')
replace using mode
-
-
Markdown Table:
md_new()
: This function creates a bare bone for generating a markdown table. Alignments, and size of the table can be input by users. - Input: the size of table (number of rows and number of columns) - Output: a character vector of the source code.md_data()
: This function converts a dataframe or matrix into a markdown table format. - Input: a matrix or dataframe - Output: a character vector of the source code.
This is a basic example which shows you how to solve a common problem:
The section has two functions, typemix and cleanmix.
-
The input for typemix function is a
data frame
, and the output is a list of 3 data frames. The first one is the same as the input data frame, the second one tells you the location and types of data in the columns where there is type mixture. The third data frame is a summary of the second data frame. -
The input for cleanmix function is the result from typemix function, the column(s) you want to work on, the type(s) of data you want to keep/delete, and if you want to keep/delete the instances specified.
library(tidyplusR)
dat<-data.frame(x1=c(1,2,3,"1.2.3"),
x2=c("test","test",1,TRUE),
x3=c(TRUE,TRUE,FALSE,FALSE))
typemix(dat) #
## [[1]]
## x1 x2 x3
## 1 1 test TRUE
## 2 2 test TRUE
## 3 3 1 FALSE
## 4 1.2.3 TRUE FALSE
##
## [[2]]
## x1 x2 x3
## 1 number character NA
## 2 number character NA
## 3 number number NA
## 4 character logical NA
##
## [[3]]
## Column_ID number character logical
## 1 1 3 1 0
## 2 2 1 2 1
cleanmix(typemix(dat),column=c(1,2),type=c("number","character"))
## x1 x2 x3
## 1 1 test TRUE
## 2 2 test TRUE
## 3 3 <NA> FALSE
## 4 <NA> <NA> FALSE
- This function requires a
dataframe
as an input for missing value treatment using mean/median/mode
library(tidyverse)
## ── Attaching packages ────────
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.3.0
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ─────────────────
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# Dummy dataframe
dat <- data.frame(x=sample(letters[1:3],20,TRUE),
y=sample(letters[1:3],20,TRUE),
w=as.numeric(sample(0:50,20,TRUE)),
z=sample(letters[1:3],20,TRUE),
b = as.logical(sample(0:1,20,TRUE)),
a=sample(0:100,20,TRUE),
stringsAsFactors=FALSE)
dat[c(5,10,15),1] <- NA
dat[c(3,7),2] <- NA
dat[c(1,3,5),3] <- NA
dat[c(4,5,9),4] <- NA
dat[c(4,5,9),5] <- NA
dat[,4] <- factor(dat[,4] )
dat[c(4,5,9),6] <- NA
# Calling impute function
# method can be replaced by median and mean as well
impute(dat,method = "mode") %>% head()
## x y w z b a
## 1 a b 34.6 a TRUE 40.00
## 2 b c 33.0 c FALSE 1.00
## 3 c c 34.6 a TRUE 38.00
## 4 c c 15.0 a FALSE 23.53
## 5 b b 34.6 a FALSE 23.53
## 6 c c 22.0 b FALSE 37.00
md_new()
can create an empty markdown table by specifying the number of columns and number of rows.
## default: ncol = 2 and nrow = 2, alignment = "l"
md_new()
##
## | | |
## |:---|:---|
## | | |
## | | |
## 3 by 3 table
md_new(nrow = 3, ncol = 3)
##
## | | | |
## |:---|:---|:---|
## | | | |
## | | | |
## | | | |
## different alignments:
md_new(nrow = 1, align = "c")
##
## | | |
## |:--:|:--:|
## | | |
md_new(nrow = 1, align = "r")
##
## | | |
## |---:|---:|
## | | |
## providing header
h <- c("foo", "boo")
md_new(header = h)
##
## | foo| boo|
## |:---|:---|
## | | |
## | | |
md_data()
can create an markdown table given input as matrix of data frame.
md_data(mtcars, row.index = 1:3, col.index = 1:4)
##
## | |mpg|cyl|disp|hp|
## |:---|---:|---:|---:|---:|
## |Mazda RX4|21.0|6|160|110|
## |Mazda RX4 Wag|21.0|6|160|110|
## |Datsun 710|22.8|4|108|93|
## alignment to right
md_data(mtcars, row.index = 1:3, col.index = 1:4, align = "r")
##
## | |mpg|cyl|disp|hp|
## |:---|---:|---:|---:|---:|
## |Mazda RX4|21.0|6|160|110|
## |Mazda RX4 Wag|21.0|6|160|110|
## |Datsun 710|22.8|4|108|93|
## provide header
md_data(mtcars, row.index = 1:3, col.index = 1:4, header = c("a","b","c","d"))
##
## | |a|b|c|d|
## |:---|---:|---:|---:|---:|
## |Mazda RX4|21.0|6|160|110|
## |Mazda RX4 Wag|21.0|6|160|110|
## |Datsun 710|22.8|4|108|93|
## not include row names
md_data(mtcars, row.index = 1:3, col.index = 1:4, row.names = F)
##
## |mpg|cyl|disp|hp|
## |---:|---:|---:|---:|
## |21|6|160|110|
## |21|6|160|110|
## |22.8|4|108|93|
- Users can use the package when they want to clean and wrangle their data. For example, if the data has not been cleaned yet, users can use function
typemix
to check where data is not clean and usecleanmix
to clean data. Based on personal work experience, the mix of number and character is usually seen in the data collected from the survey. After clean data is ready, one can use theimpute()
to deal with missing data with mean/median/mode. The resulting data frame can be output to markdown syntax withmd_data()
. Your can also usemd_new()
to create a empty markdown table.
-
Data Type Cleansing
-
R does not have functions that can explicitly perform string processing/ data type conversion without affecting the overall column type.
-
Missing Value treatment
-
R doesn't have imputation methods which use
Mode
for missing value treatment, which can be useful for categorical and numeric variables MICE package in R do provide limited imputation using mean, median, etc. -
Markdown table in R
-
R has library
Kable
which can output a dataset in the form of a markdown table but withtidyplus
user will have more freedom with data types and formatting.
This is an open source project. Please follow the guidelines below for contribution. - Open an issue for any feedback and suggestions. - For contributing to the project, please refer to Contributing for details.