tidymodels/recipes

step_pca + prep changing not predictor column names when names ends with ... followed by a number

ilaria-kode opened this issue · 1 comments

The problem

When prepping a recipe that includes a step_pca step, if some of the columns in the dataset (not used by the PCA) have names that end with the "...[:digit:]" pattern (which is what is usually obtained for example when loading the dataset using .name_repair = "unique" ), their names will be changed after the execution of the prep function.

I have noticed that this only happens if the effected columns names' number is not "aligned" with their position in the dataframe (i.e. the column is named foo...6 but is in position 1 in the resulting dataframe, see example).

Is there a way to change this behaviour and force the recipe to keep the column names untouched?

Reproducible example

library(tidyverse)

# when the columns effected by name repair have names that are aligned with
# their position in the dataset, the names are kept the same
sample_data <- tibble::tibble(
  x1 = runif(10),
  x2 = runif(10),
  x3 = runif(10),
  x4 = runif(10),
  foo = runif(10),
  foo = runif(10),
  .name_repair = "unique"
)
#> New names:
#> • `foo` -> `foo...5`
#> • `foo` -> `foo...6`
sample_data
#> # A tibble: 10 × 6
#>        x1    x2     x3    x4 foo...5  foo...6
#>     <dbl> <dbl>  <dbl> <dbl>   <dbl>    <dbl>
#>  1 0.0328 0.796 0.0740 0.543   0.812 0.000651
#>  2 0.313  0.752 0.837  0.428   0.803 0.942   
#>  3 0.451  0.758 0.864  0.991   0.737 0.403   
#>  4 0.0677 0.636 0.937  0.758   0.826 0.787   
#>  5 0.103  0.852 0.682  0.801   0.314 0.530   
#>  6 0.522  0.120 0.233  0.708   0.650 0.266   
#>  7 0.110  0.737 0.605  0.389   0.617 0.356   
#>  8 0.199  0.471 0.684  0.735   0.664 0.324   
#>  9 0.659  0.106 0.536  0.555   0.818 0.347   
#> 10 0.830  0.996 0.669  0.366   0.881 0.315

# expected behaviour, colnames are retained
rec <- recipes::recipe(sample_data, formula = ~.) %>%
  recipes::update_role(contains("foo"), new_role = "info") %>%
  recipes::step_pca(
    num_comp = 2,
    recipes::all_numeric_predictors()
  ) %>%
  recipes::prep(strings_as_factors = FALSE)

rec$template
#> # A tibble: 10 × 4
#>    foo...5  foo...6    PC1     PC2
#>      <dbl>    <dbl>  <dbl>   <dbl>
#>  1   0.812 0.000651 -0.792  0.344 
#>  2   0.803 0.942    -1.21   0.104 
#>  3   0.737 0.403    -1.57  -0.108 
#>  4   0.826 0.787    -1.31   0.160 
#>  5   0.314 0.530    -1.32   0.260 
#>  6   0.650 0.266    -0.728 -0.475 
#>  7   0.617 0.356    -0.994  0.266 
#>  8   0.664 0.324    -1.10  -0.0298
#>  9   0.818 0.347    -0.845 -0.569 
#> 10   0.881 0.315    -1.36  -0.138

# when the columns effected by name repair have names that are not
# aligned with their position in the dataset,
# the names are changed after step_pca + prep
sample_data <- tibble::tibble(
  foo...10 = runif(10), # forcing different numbering
  foo...11 = runif(10),
  x1 = runif(10),
  x2 = runif(10),
  x3 = runif(10),
  x4 = runif(10)
)
sample_data
#> # A tibble: 10 × 6
#>    foo...10 foo...11     x1      x2      x3     x4
#>       <dbl>    <dbl>  <dbl>   <dbl>   <dbl>  <dbl>
#>  1    0.438    0.846 0.258  0.768   0.131   0.885 
#>  2    0.816    0.255 0.0259 0.492   0.784   0.0304
#>  3    0.271    0.760 0.861  0.00659 0.00730 0.838 
#>  4    0.788    0.696 0.601  0.845   0.283   0.587 
#>  5    0.286    0.531 0.500  0.676   0.582   0.0797
#>  6    0.289    0.553 0.417  0.258   0.682   0.922 
#>  7    0.239    0.322 0.423  0.937   0.338   0.245 
#>  8    0.531    0.210 0.826  0.139   0.162   0.522 
#>  9    0.227    0.136 0.199  0.922   0.515   0.434 
#> 10    0.217    0.107 0.0141 0.0988  0.116   0.0111

rec <- recipes::recipe(sample_data, formula = ~.) %>%
  recipes::update_role(contains("foo"), new_role = "info") %>%
  recipes::step_pca(
    num_comp = 2,
    recipes::all_numeric_predictors()
  ) %>%
  recipes::prep(strings_as_factors = FALSE)
#> New names:
#> • `foo...10` -> `foo...1`
#> • `foo...11` -> `foo...2`
rec
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs 
#> Number of variables by role
#> predictor: 4
#> info:      2
#> 
#> ── Training information 
#> Training data contained 10 data points and no incomplete rows.
#> 
#> ── Operations 
#> • PCA extraction with: x1, x2, x3, x4 | Trained
rec$template
#> # A tibble: 10 × 4
#>    foo...1 foo...2    PC1     PC2
#>      <dbl>   <dbl>  <dbl>   <dbl>
#>  1   0.438   0.846 -1.10  -0.0898
#>  2   0.816   0.255 -0.619  0.575 
#>  3   0.271   0.760 -0.854 -0.838 
#>  4   0.788   0.696 -1.20   0.0149
#>  5   0.286   0.531 -0.897  0.354 
#>  6   0.289   0.553 -1.10  -0.257 
#>  7   0.239   0.322 -1.01   0.355 
#>  8   0.531   0.210 -0.806 -0.514 
#>  9   0.227   0.136 -1.07   0.422 
#> 10   0.217   0.107 -0.115  0.0920

Created on 2024-07-09 with reprex v2.1.1

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.1 (2023-06-16 ucrt)
#>  os       Windows 10 x64 (build 19045)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language EN
#>  collate  Italian_Italy.utf8
#>  ctype    Italian_Italy.utf8
#>  tz       Europe/Rome
#>  date     2024-07-09
#>  pandoc   3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version    date (UTC) lib source
#>  class          7.3-22     2023-05-03 [2] CRAN (R 4.3.1)
#>  cli            3.6.1      2023-03-23 [1] CRAN (R 4.3.1)
#>  codetools      0.2-19     2023-02-01 [2] CRAN (R 4.3.1)
#>  colorspace     2.1-0      2023-01-23 [1] CRAN (R 4.3.1)
#>  data.table     1.14.8     2023-02-17 [1] CRAN (R 4.3.1)
#>  digest         0.6.33     2023-07-07 [1] CRAN (R 4.3.1)
#>  dplyr        * 1.1.3      2023-09-03 [1] CRAN (R 4.3.1)
#>  ellipsis       0.3.2      2021-04-29 [1] CRAN (R 4.3.1)
#>  evaluate       0.22       2023-09-29 [1] CRAN (R 4.3.1)
#>  fansi          1.0.5      2023-10-08 [1] CRAN (R 4.3.1)
#>  fastmap        1.1.1      2023-02-24 [1] CRAN (R 4.3.1)
#>  forcats      * 1.0.0      2023-01-29 [1] CRAN (R 4.3.1)
#>  fs             1.6.3      2023-07-20 [1] CRAN (R 4.3.1)
#>  future         1.33.0     2023-07-01 [1] CRAN (R 4.3.1)
#>  future.apply   1.11.0     2023-05-21 [1] CRAN (R 4.3.1)
#>  generics       0.1.3      2022-07-05 [1] CRAN (R 4.3.1)
#>  ggplot2      * 3.4.4      2023-10-12 [1] CRAN (R 4.3.1)
#>  globals        0.16.2     2022-11-21 [1] CRAN (R 4.3.0)
#>  glue           1.6.2      2022-02-24 [1] CRAN (R 4.3.1)
#>  gower          1.0.1      2022-12-22 [1] CRAN (R 4.3.0)
#>  gtable         0.3.4      2023-08-21 [1] CRAN (R 4.3.1)
#>  hardhat        1.3.0      2023-03-30 [1] CRAN (R 4.3.1)
#>  hms            1.1.3      2023-03-21 [1] CRAN (R 4.3.1)
#>  htmltools      0.5.6.1    2023-10-06 [1] CRAN (R 4.3.1)
#>  ipred          0.9-14     2023-03-09 [1] CRAN (R 4.3.1)
#>  knitr          1.44       2023-09-11 [1] CRAN (R 4.3.1)
#>  lattice        0.21-8     2023-04-05 [2] CRAN (R 4.3.1)
#>  lava           1.7.2.1    2023-02-27 [1] CRAN (R 4.3.1)
#>  lifecycle      1.0.3      2022-10-07 [1] CRAN (R 4.3.1)
#>  listenv        0.9.0      2022-12-16 [1] CRAN (R 4.3.1)
#>  lubridate    * 1.9.3      2023-09-27 [1] CRAN (R 4.3.1)
#>  magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.3.1)
#>  MASS           7.3-60     2023-05-04 [2] CRAN (R 4.3.1)
#>  Matrix         1.5-4.1    2023-05-18 [2] CRAN (R 4.3.1)
#>  munsell        0.5.0      2018-06-12 [1] CRAN (R 4.3.1)
#>  nnet           7.3-19     2023-05-03 [2] CRAN (R 4.3.1)
#>  parallelly     1.36.0     2023-05-26 [1] CRAN (R 4.3.0)
#>  pillar         1.9.0      2023-03-22 [1] CRAN (R 4.3.1)
#>  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.3.1)
#>  prodlim        2023.08.28 2023-08-28 [1] CRAN (R 4.3.1)
#>  purrr        * 1.0.2      2023-08-10 [1] CRAN (R 4.3.1)
#>  R.cache        0.16.0     2022-07-21 [1] CRAN (R 4.3.3)
#>  R.methodsS3    1.8.2      2022-06-13 [1] CRAN (R 4.3.3)
#>  R.oo           1.26.0     2024-01-24 [1] CRAN (R 4.3.3)
#>  R.utils        2.12.3     2023-11-18 [1] CRAN (R 4.3.3)
#>  R6             2.5.1      2021-08-19 [1] CRAN (R 4.3.1)
#>  Rcpp           1.0.11     2023-07-06 [1] CRAN (R 4.3.1)
#>  readr        * 2.1.4      2023-02-10 [1] CRAN (R 4.3.1)
#>  recipes        1.0.8      2023-08-25 [1] CRAN (R 4.3.1)
#>  reprex         2.1.1      2024-07-06 [1] CRAN (R 4.3.3)
#>  rlang          1.1.1      2023-04-28 [1] CRAN (R 4.3.1)
#>  rmarkdown      2.25       2023-09-18 [1] CRAN (R 4.3.1)
#>  rpart          4.1.19     2022-10-21 [2] CRAN (R 4.3.1)
#>  rstudioapi     0.15.0     2023-07-07 [1] CRAN (R 4.3.1)
#>  scales         1.2.1      2022-08-20 [1] CRAN (R 4.3.1)
#>  sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.3.1)
#>  stringi        1.7.12     2023-01-11 [1] CRAN (R 4.3.0)
#>  stringr      * 1.5.0      2022-12-02 [1] CRAN (R 4.3.1)
#>  styler         1.10.3     2024-04-07 [1] CRAN (R 4.3.3)
#>  survival       3.5-5      2023-03-12 [2] CRAN (R 4.3.1)
#>  tibble       * 3.2.1      2023-03-20 [1] CRAN (R 4.3.1)
#>  tidyr        * 1.3.0      2023-01-24 [1] CRAN (R 4.3.1)
#>  tidyselect     1.2.0      2022-10-10 [1] CRAN (R 4.3.1)
#>  tidyverse    * 2.0.0      2023-02-22 [1] CRAN (R 4.3.1)
#>  timechange     0.2.0      2023-01-11 [1] CRAN (R 4.3.1)
#>  timeDate       4022.108   2023-01-07 [1] CRAN (R 4.3.0)
#>  tzdb           0.4.0      2023-05-12 [1] CRAN (R 4.3.1)
#>  utf8           1.2.3      2023-01-31 [1] CRAN (R 4.3.1)
#>  vctrs          0.6.4      2023-10-12 [1] CRAN (R 4.3.1)
#>  withr          2.5.1      2023-09-26 [1] CRAN (R 4.3.1)
#>  xfun           0.40       2023-08-09 [1] CRAN (R 4.3.1)
#>  yaml           2.3.7      2023-01-23 [1] CRAN (R 4.3.0)
#> 
#>  [1] C:/Users/ilari/AppData/Local/R/win-library/4.3
#>  [2] C:/Program Files/R/R-4.3.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Hello @ilaria-kode 👋

Thanks for filing this bug report. This does appear to be a bug.

In step_pca() we call vctrs::cbind() on the data, which is where we are getting this issue.

sample_data <- tibble::tibble(
  foo...10 = runif(10),
  foo...11 = runif(10),
  x1 = runif(10),
  x2 = runif(10),
  x3 = runif(10),
  x4 = runif(10)
)
vctrs::vec_cbind(sample_data)
#> New names:
#> • `foo...10` -> `foo...1`
#> • `foo...11` -> `foo...2`
#> # A tibble: 10 × 6
#>    foo...1  foo...2     x1     x2      x3    x4
#>      <dbl>    <dbl>  <dbl>  <dbl>   <dbl> <dbl>
#>  1   0.845 0.394    0.820  0.0637 0.00586 0.165
#>  2   0.409 0.367    0.0105 0.492  0.411   0.757
#>  3   0.409 0.702    0.0760 0.829  0.917   0.780
#>  4   0.435 0.331    0.335  0.940  0.202   0.334
#>  5   0.953 0.0558   0.0857 0.395  0.434   0.129
#>  6   0.130 0.340    0.258  0.161  0.793   0.939
#>  7   0.507 0.000829 0.296  0.547  0.318   0.115
#>  8   0.293 0.00540  0.733  0.860  0.739   0.374
#>  9   0.113 0.0957   0.153  0.684  0.894   0.397
#> 10   0.590 0.737    0.724  0.955  0.329   0.301

Created on 2024-07-09 with reprex v2.1.0

More reading on why this is happening: r-lib/vctrs#685