step_pca + prep changing not predictor column names when names ends with ... followed by a number
ilaria-kode opened this issue · 1 comments
The problem
When prepping a recipe that includes a step_pca
step, if some of the columns in the dataset (not used by the PCA) have names that end with the "...[:digit:]" pattern (which is what is usually obtained for example when loading the dataset using .name_repair = "unique"
), their names will be changed after the execution of the prep
function.
I have noticed that this only happens if the effected columns names' number is not "aligned" with their position in the dataframe (i.e. the column is named foo...6 but is in position 1 in the resulting dataframe, see example).
Is there a way to change this behaviour and force the recipe to keep the column names untouched?
Reproducible example
library(tidyverse)
# when the columns effected by name repair have names that are aligned with
# their position in the dataset, the names are kept the same
sample_data <- tibble::tibble(
x1 = runif(10),
x2 = runif(10),
x3 = runif(10),
x4 = runif(10),
foo = runif(10),
foo = runif(10),
.name_repair = "unique"
)
#> New names:
#> • `foo` -> `foo...5`
#> • `foo` -> `foo...6`
sample_data
#> # A tibble: 10 × 6
#> x1 x2 x3 x4 foo...5 foo...6
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0328 0.796 0.0740 0.543 0.812 0.000651
#> 2 0.313 0.752 0.837 0.428 0.803 0.942
#> 3 0.451 0.758 0.864 0.991 0.737 0.403
#> 4 0.0677 0.636 0.937 0.758 0.826 0.787
#> 5 0.103 0.852 0.682 0.801 0.314 0.530
#> 6 0.522 0.120 0.233 0.708 0.650 0.266
#> 7 0.110 0.737 0.605 0.389 0.617 0.356
#> 8 0.199 0.471 0.684 0.735 0.664 0.324
#> 9 0.659 0.106 0.536 0.555 0.818 0.347
#> 10 0.830 0.996 0.669 0.366 0.881 0.315
# expected behaviour, colnames are retained
rec <- recipes::recipe(sample_data, formula = ~.) %>%
recipes::update_role(contains("foo"), new_role = "info") %>%
recipes::step_pca(
num_comp = 2,
recipes::all_numeric_predictors()
) %>%
recipes::prep(strings_as_factors = FALSE)
rec$template
#> # A tibble: 10 × 4
#> foo...5 foo...6 PC1 PC2
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.812 0.000651 -0.792 0.344
#> 2 0.803 0.942 -1.21 0.104
#> 3 0.737 0.403 -1.57 -0.108
#> 4 0.826 0.787 -1.31 0.160
#> 5 0.314 0.530 -1.32 0.260
#> 6 0.650 0.266 -0.728 -0.475
#> 7 0.617 0.356 -0.994 0.266
#> 8 0.664 0.324 -1.10 -0.0298
#> 9 0.818 0.347 -0.845 -0.569
#> 10 0.881 0.315 -1.36 -0.138
# when the columns effected by name repair have names that are not
# aligned with their position in the dataset,
# the names are changed after step_pca + prep
sample_data <- tibble::tibble(
foo...10 = runif(10), # forcing different numbering
foo...11 = runif(10),
x1 = runif(10),
x2 = runif(10),
x3 = runif(10),
x4 = runif(10)
)
sample_data
#> # A tibble: 10 × 6
#> foo...10 foo...11 x1 x2 x3 x4
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.438 0.846 0.258 0.768 0.131 0.885
#> 2 0.816 0.255 0.0259 0.492 0.784 0.0304
#> 3 0.271 0.760 0.861 0.00659 0.00730 0.838
#> 4 0.788 0.696 0.601 0.845 0.283 0.587
#> 5 0.286 0.531 0.500 0.676 0.582 0.0797
#> 6 0.289 0.553 0.417 0.258 0.682 0.922
#> 7 0.239 0.322 0.423 0.937 0.338 0.245
#> 8 0.531 0.210 0.826 0.139 0.162 0.522
#> 9 0.227 0.136 0.199 0.922 0.515 0.434
#> 10 0.217 0.107 0.0141 0.0988 0.116 0.0111
rec <- recipes::recipe(sample_data, formula = ~.) %>%
recipes::update_role(contains("foo"), new_role = "info") %>%
recipes::step_pca(
num_comp = 2,
recipes::all_numeric_predictors()
) %>%
recipes::prep(strings_as_factors = FALSE)
#> New names:
#> • `foo...10` -> `foo...1`
#> • `foo...11` -> `foo...2`
rec
#>
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> predictor: 4
#> info: 2
#>
#> ── Training information
#> Training data contained 10 data points and no incomplete rows.
#>
#> ── Operations
#> • PCA extraction with: x1, x2, x3, x4 | Trained
rec$template
#> # A tibble: 10 × 4
#> foo...1 foo...2 PC1 PC2
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.438 0.846 -1.10 -0.0898
#> 2 0.816 0.255 -0.619 0.575
#> 3 0.271 0.760 -0.854 -0.838
#> 4 0.788 0.696 -1.20 0.0149
#> 5 0.286 0.531 -0.897 0.354
#> 6 0.289 0.553 -1.10 -0.257
#> 7 0.239 0.322 -1.01 0.355
#> 8 0.531 0.210 -0.806 -0.514
#> 9 0.227 0.136 -1.07 0.422
#> 10 0.217 0.107 -0.115 0.0920
Created on 2024-07-09 with reprex v2.1.1
Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.3.1 (2023-06-16 ucrt)
#> os Windows 10 x64 (build 19045)
#> system x86_64, mingw32
#> ui RTerm
#> language EN
#> collate Italian_Italy.utf8
#> ctype Italian_Italy.utf8
#> tz Europe/Rome
#> date 2024-07-09
#> pandoc 3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> class 7.3-22 2023-05-03 [2] CRAN (R 4.3.1)
#> cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.1)
#> codetools 0.2-19 2023-02-01 [2] CRAN (R 4.3.1)
#> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.1)
#> data.table 1.14.8 2023-02-17 [1] CRAN (R 4.3.1)
#> digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.1)
#> dplyr * 1.1.3 2023-09-03 [1] CRAN (R 4.3.1)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.3.1)
#> evaluate 0.22 2023-09-29 [1] CRAN (R 4.3.1)
#> fansi 1.0.5 2023-10-08 [1] CRAN (R 4.3.1)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.1)
#> forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.1)
#> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.1)
#> future 1.33.0 2023-07-01 [1] CRAN (R 4.3.1)
#> future.apply 1.11.0 2023-05-21 [1] CRAN (R 4.3.1)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.1)
#> ggplot2 * 3.4.4 2023-10-12 [1] CRAN (R 4.3.1)
#> globals 0.16.2 2022-11-21 [1] CRAN (R 4.3.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.1)
#> gower 1.0.1 2022-12-22 [1] CRAN (R 4.3.0)
#> gtable 0.3.4 2023-08-21 [1] CRAN (R 4.3.1)
#> hardhat 1.3.0 2023-03-30 [1] CRAN (R 4.3.1)
#> hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.1)
#> htmltools 0.5.6.1 2023-10-06 [1] CRAN (R 4.3.1)
#> ipred 0.9-14 2023-03-09 [1] CRAN (R 4.3.1)
#> knitr 1.44 2023-09-11 [1] CRAN (R 4.3.1)
#> lattice 0.21-8 2023-04-05 [2] CRAN (R 4.3.1)
#> lava 1.7.2.1 2023-02-27 [1] CRAN (R 4.3.1)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.1)
#> listenv 0.9.0 2022-12-16 [1] CRAN (R 4.3.1)
#> lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.3.1)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.1)
#> MASS 7.3-60 2023-05-04 [2] CRAN (R 4.3.1)
#> Matrix 1.5-4.1 2023-05-18 [2] CRAN (R 4.3.1)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.1)
#> nnet 7.3-19 2023-05-03 [2] CRAN (R 4.3.1)
#> parallelly 1.36.0 2023-05-26 [1] CRAN (R 4.3.0)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.1)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.1)
#> prodlim 2023.08.28 2023-08-28 [1] CRAN (R 4.3.1)
#> purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.1)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.3)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.3)
#> R.oo 1.26.0 2024-01-24 [1] CRAN (R 4.3.3)
#> R.utils 2.12.3 2023-11-18 [1] CRAN (R 4.3.3)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.1)
#> Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.3.1)
#> readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.1)
#> recipes 1.0.8 2023-08-25 [1] CRAN (R 4.3.1)
#> reprex 2.1.1 2024-07-06 [1] CRAN (R 4.3.3)
#> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.1)
#> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1)
#> rpart 4.1.19 2022-10-21 [2] CRAN (R 4.3.1)
#> rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.1)
#> scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.1)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.1)
#> stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)
#> stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.1)
#> styler 1.10.3 2024-04-07 [1] CRAN (R 4.3.3)
#> survival 3.5-5 2023-03-12 [2] CRAN (R 4.3.1)
#> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.1)
#> tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.1)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.1)
#> tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.1)
#> timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.1)
#> timeDate 4022.108 2023-01-07 [1] CRAN (R 4.3.0)
#> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.1)
#> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.1)
#> vctrs 0.6.4 2023-10-12 [1] CRAN (R 4.3.1)
#> withr 2.5.1 2023-09-26 [1] CRAN (R 4.3.1)
#> xfun 0.40 2023-08-09 [1] CRAN (R 4.3.1)
#> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)
#>
#> [1] C:/Users/ilari/AppData/Local/R/win-library/4.3
#> [2] C:/Program Files/R/R-4.3.1/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
Hello @ilaria-kode 👋
Thanks for filing this bug report. This does appear to be a bug.
In step_pca()
we call vctrs::cbind()
on the data, which is where we are getting this issue.
sample_data <- tibble::tibble(
foo...10 = runif(10),
foo...11 = runif(10),
x1 = runif(10),
x2 = runif(10),
x3 = runif(10),
x4 = runif(10)
)
vctrs::vec_cbind(sample_data)
#> New names:
#> • `foo...10` -> `foo...1`
#> • `foo...11` -> `foo...2`
#> # A tibble: 10 × 6
#> foo...1 foo...2 x1 x2 x3 x4
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.845 0.394 0.820 0.0637 0.00586 0.165
#> 2 0.409 0.367 0.0105 0.492 0.411 0.757
#> 3 0.409 0.702 0.0760 0.829 0.917 0.780
#> 4 0.435 0.331 0.335 0.940 0.202 0.334
#> 5 0.953 0.0558 0.0857 0.395 0.434 0.129
#> 6 0.130 0.340 0.258 0.161 0.793 0.939
#> 7 0.507 0.000829 0.296 0.547 0.318 0.115
#> 8 0.293 0.00540 0.733 0.860 0.739 0.374
#> 9 0.113 0.0957 0.153 0.684 0.894 0.397
#> 10 0.590 0.737 0.724 0.955 0.329 0.301
Created on 2024-07-09 with reprex v2.1.0
More reading on why this is happening: r-lib/vctrs#685