
Clean your data frame in one readable function

Primary LanguageR

sternclean seeks to simplify cleaning dataframes.

Multiple cleaning steps are accomplished in just one function.

For example, you can change column types, impute one set of columns' NAs with a set value, impute another set of columns' NAs with a group mean, and impute another set of columns' infinite values with another set value in a few lines of clean code

Here is the order of operations under the hood:

  • Change the types
  • Remove columns
  • Impute NAs
  • Impute infinites

This allows multiple cleaning processes to happen in this one function

Simple Examples

We will start with simple one-step cleaning examples. Later we will take on more complex situations.

Rickle and Mortan Dataset

people original_person intelligence evil_rank
Rickle Rickle Inf 5
Mortan Mortan 9 2.75
Jerry Jerry 0.1 2
Pickle Rickle Rickle Inf NA

Class Change Parameters

#> [1] "factor"

           class_to_strng = "people")

#> [1] "character"
#> [1] "character"

           class_to_numer = "intelligence")

#> [1] "numeric"

Column/Row Removal Parameters

           remove_columns = "intelligence")
people original_person evil_rank
Rickle Rickle 5
Mortan Mortan 2.75
Jerry Jerry 2
Pickle Rickle Rickle NA
           remove_na_rows =  "evil_rank")
people original_person intelligence evil_rank
Rickle Rickle Inf 5
Mortan Mortan 9 2.75
Jerry Jerry 0.1 2
           removeby_regex = "pe")
intelligence evil_rank
Inf 5
9 2.75
0.1 2
Inf NA
           remove_all_nas = TRUE)
people original_person intelligence evil_rank
Rickle Rickle Inf 5
Mortan Mortan 9 2.75
Jerry Jerry 0.1 2
           remove_non_num = TRUE)
intelligence evil_rank
Inf 5
9 2.75
0.1 2
Inf NA
           remove_all_exc = c("people", "evil_rank"))
people evil_rank
Rickle 5
Mortan 2.75
Jerry 2
Pickle Rickle NA

Impute Parameters

           impute_na2mean = "evil_rank")
people original_person intelligence evil_rank
Rickle Rickle Inf 5
Mortan Mortan 9 2.75
Jerry Jerry 0.1 2
Pickle Rickle Rickle Inf 3.25
           impute_na_cols = "evil_rank",
           impute_na_with = 1738)
people original_person intelligence evil_rank
Rickle Rickle Inf 5
Mortan Mortan 9 2.75
Jerry Jerry 0.1 2
Pickle Rickle Rickle Inf 1738
           impute_grpmean = "evil_rank",
           impute_grpwith = "original_person")
original_person people intelligence evil_rank
Jerry Jerry 0.1 2
Mortan Mortan 9 2.75
Rickle Rickle Inf 5
Rickle Pickle Rickle Inf 5
           impute_inf_col = "intelligence",
           impute_inf_wit = 1738)
people original_person intelligence evil_rank
Rickle Rickle 1738 5
Mortan Mortan 9 2.75
Jerry Jerry 0.1 2
Pickle Rickle Rickle 1738 NA
           impute_cust_cl = "evil_rank",
           impute_cust_fn = quantile,
           probs = .25,
           na.rm = TRUE
people original_person intelligence evil_rank
Rickle Rickle Inf 5
Mortan Mortan 9 2.75
Jerry Jerry 0.1 2
Pickle Rickle Rickle Inf 2.375

More Complex Example

Here we:

  • change the people column's class to string
  • change the intelligence column's class to numeric
  • remove the original_person column
  • impute the NAs in the evil rank with the column's mean
  • impute the infite values in the intelligence column to 1738
           class_to_strng = "people",
           class_to_numer = "intelligence",
           remove_columns = "original_person",
           impute_na2mean = "evil_rank",
           impute_inf_col = "intelligence",
           impute_inf_wit = 1738
people intelligence evil_rank
Rickle 1738 5
Mortan 9 2.75
Jerry 0.1 2
Pickle Rickle 1738 3.25

Compared to Original Data Frame

people original_person intelligence evil_rank
Rickle Rickle Inf 5
Mortan Mortan 9 2.75
Jerry Jerry 0.1 2
Pickle Rickle Rickle Inf NA