tidymodels/recipes

Remove duplicate non-numeric columns

EmilHvitfeldt opened this issue · 0 comments

In general we don't want perfectly identical features in our model, because 1) it doesn't provide any value, and 2) because it will break some stuff because it creates linear combinations of columns.

Right now we only have step_lincomb() for numeric data, and step_corr() for when it is close. We don't have anything for non-numeric.

Below the levels are identical, but even if they are not, it would still be an issue.

library(recipes)

data(ames, package = "modeldata")

ames <- ames[c(2, 3)]
ames$MS_Zoning_copy <- ames$MS_Zoning
ames$Lot_Frontage_copy <- ames$Lot_Frontage

ames
#> # A tibble: 2,930 × 4
#>    MS_Zoning                Lot_Frontage MS_Zoning_copy        Lot_Frontage_copy
#>    <fct>                           <dbl> <fct>                             <dbl>
#>  1 Residential_Low_Density           141 Residential_Low_Dens…               141
#>  2 Residential_High_Density           80 Residential_High_Den…                80
#>  3 Residential_Low_Density            81 Residential_Low_Dens…                81
#>  4 Residential_Low_Density            93 Residential_Low_Dens…                93
#>  5 Residential_Low_Density            74 Residential_Low_Dens…                74
#>  6 Residential_Low_Density            78 Residential_Low_Dens…                78
#>  7 Residential_Low_Density            41 Residential_Low_Dens…                41
#>  8 Residential_Low_Density            43 Residential_Low_Dens…                43
#>  9 Residential_Low_Density            39 Residential_Low_Dens…                39
#> 10 Residential_Low_Density            60 Residential_Low_Dens…                60
#> # ℹ 2,920 more rows

recipe(~ ., data = ames) |>
  step_corr(all_numeric_predictors()) |>
  prep() |>
  bake(NULL)
#> # A tibble: 2,930 × 3
#>    MS_Zoning                MS_Zoning_copy           Lot_Frontage_copy
#>    <fct>                    <fct>                                <dbl>
#>  1 Residential_Low_Density  Residential_Low_Density                141
#>  2 Residential_High_Density Residential_High_Density                80
#>  3 Residential_Low_Density  Residential_Low_Density                 81
#>  4 Residential_Low_Density  Residential_Low_Density                 93
#>  5 Residential_Low_Density  Residential_Low_Density                 74
#>  6 Residential_Low_Density  Residential_Low_Density                 78
#>  7 Residential_Low_Density  Residential_Low_Density                 41
#>  8 Residential_Low_Density  Residential_Low_Density                 43
#>  9 Residential_Low_Density  Residential_Low_Density                 39
#> 10 Residential_Low_Density  Residential_Low_Density                 60
#> # ℹ 2,920 more rows

Created on 2024-08-07 with reprex v2.1.0