Remove duplicate non-numeric columns
EmilHvitfeldt opened this issue · 0 comments
EmilHvitfeldt commented
In general we don't want perfectly identical features in our model, because 1) it doesn't provide any value, and 2) because it will break some stuff because it creates linear combinations of columns.
Right now we only have step_lincomb()
for numeric data, and step_corr()
for when it is close. We don't have anything for non-numeric.
Below the levels are identical, but even if they are not, it would still be an issue.
library(recipes)
data(ames, package = "modeldata")
ames <- ames[c(2, 3)]
ames$MS_Zoning_copy <- ames$MS_Zoning
ames$Lot_Frontage_copy <- ames$Lot_Frontage
ames
#> # A tibble: 2,930 × 4
#> MS_Zoning Lot_Frontage MS_Zoning_copy Lot_Frontage_copy
#> <fct> <dbl> <fct> <dbl>
#> 1 Residential_Low_Density 141 Residential_Low_Dens… 141
#> 2 Residential_High_Density 80 Residential_High_Den… 80
#> 3 Residential_Low_Density 81 Residential_Low_Dens… 81
#> 4 Residential_Low_Density 93 Residential_Low_Dens… 93
#> 5 Residential_Low_Density 74 Residential_Low_Dens… 74
#> 6 Residential_Low_Density 78 Residential_Low_Dens… 78
#> 7 Residential_Low_Density 41 Residential_Low_Dens… 41
#> 8 Residential_Low_Density 43 Residential_Low_Dens… 43
#> 9 Residential_Low_Density 39 Residential_Low_Dens… 39
#> 10 Residential_Low_Density 60 Residential_Low_Dens… 60
#> # ℹ 2,920 more rows
recipe(~ ., data = ames) |>
step_corr(all_numeric_predictors()) |>
prep() |>
bake(NULL)
#> # A tibble: 2,930 × 3
#> MS_Zoning MS_Zoning_copy Lot_Frontage_copy
#> <fct> <fct> <dbl>
#> 1 Residential_Low_Density Residential_Low_Density 141
#> 2 Residential_High_Density Residential_High_Density 80
#> 3 Residential_Low_Density Residential_Low_Density 81
#> 4 Residential_Low_Density Residential_Low_Density 93
#> 5 Residential_Low_Density Residential_Low_Density 74
#> 6 Residential_Low_Density Residential_Low_Density 78
#> 7 Residential_Low_Density Residential_Low_Density 41
#> 8 Residential_Low_Density Residential_Low_Density 43
#> 9 Residential_Low_Density Residential_Low_Density 39
#> 10 Residential_Low_Density Residential_Low_Density 60
#> # ℹ 2,920 more rows
Created on 2024-08-07 with reprex v2.1.0