unexpected altering of data when guessing column data types on data import with vroom()
k-hench opened this issue ยท 3 comments
Hello ๐ ,
I stumbled on what I believe to be a bug, or at least quite some dangerous behavior where in the data import with vroom::vroom()
the data are altered in unexpected (for me at least) ways.
The issue is reminds me a little of the infamous re-formatting of data in the Excell:
In a nutshell, in cells with aggregated data (eg in the form 1,0
), vroom()
in some cases appears drop the ,
character and then interpret the concatenated value (10
) as numeric
.
However, this does not appear to happen in a consistent way, such that sometimes 0,1
is actually interpreted as "0,1"
(character
). I believe that this depends on the leading value being a 0
vs. [1-9]
.
This behavior can be avoided when specifying the column types (vroom(..., col_types = "<types>")
), however I believe that the default guess mode likely used frequently enough to raise the issue.
If the descried behavior is actually just a description of vroom()
working as intended and this is just a case of rtfm, then please never mind. (However, having used vroom()
for many years and yet being surprised by this myself, I hope that this heads-up from the user-perspective might still be of value.)
Minimal example
library(vroom)
packageVersion("vroom")
#> [1] '1.6.3.9000'
Creating minimal data set
write.table(data.frame(foo = "1,0", bar = "0,1"),
file = "~/Downloads/wtf1.tsv",
sep = "\t",
quote = FALSE,
row.names = FALSE)
The exported file (wtf1.tsv
) should look like this:
foo bar
1,0 0,1
Native R
data import works as expected
read.table("~/Downloads/wtf1.tsv", header = TRUE)
#> foo bar
#> 1 1,0 0,1
On import with vroom::vroom()
and automated data type detection, the ,
character is omitted and the value altered from 1,0
to 10
:
vroom("~/Downloads/wtf1.tsv", delim = "\t")
#> Rows: 1 Columns: 2
#> โโ Column specification โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
#> Delimiter: "\t"
#> chr (1): bar
#> num (1): foo
#>
#> โน Use `spec()` to retrieve the full column specification for this data.
#> โน Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 ร 2
#> foo bar
#> <dbl> <chr>
#> 1 10 0,1
Importing the data with specified column types ("character") does not alter the data however:
vroom("~/Downloads/wtf1.tsv", delim = "\t", col_types = "c")
#> # A tibble: 1 ร 2
#> foo bar
#> <chr> <chr>
#> 1 1,0 0,1
Minimal 'real-life' example
I stumbled on the behavior while using readr::read_tsv()
, which I believe uses vroom()
under the hood.
The original data is a summary file produced by the genomics software gatk
, which is very widely used through genomics community.
A slimmed down version of the original file looks like this (wtf2.tsv
):
CHROM POS TYPE 105388.AD 105391.AD 105407.AD ES2551.AD ES2692.AD ES2697.AD ES2713.AD ES2816.AD
NC_072356.1 2392 SNP 0,26 0,21 0,33 21,0 28,0 16,0 22,0 16,0
NC_072356.1 2742 SNP 0,36 0,18 0,39 26,0 33,0 25,0 29,0 25,0
Again, the native R
import works as expected:
read.table("~/Downloads/wtf2.tsv", header = TRUE)
#> CHROM POS TYPE X105388.AD X105391.AD X105407.AD ES2551.AD ES2692.AD
#> 1 NC_072356.1 2392 SNP 0,26 0,21 0,33 21,0 28,0
#> 2 NC_072356.1 2742 SNP 0,36 0,18 0,39 26,0 33,0
#> ES2697.AD ES2713.AD ES2816.AD
#> 1 16,0 22,0 16,0
#> 2 25,0 29,0 25,0
However, in the vroom()
import, the columns starting with 0,[0-9]*
are being parsed as numeric
with the individual values being concatenated as the digits in the new value (switch in parsing behavior between columns ES2551.AD
and ES2692.AD
)
vroom("~/Downloads/wtf2.tsv", delim = "\t")
#> Rows: 2 Columns: 11
#> โโ Column specification โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
#> Delimiter: "\t"
#> chr (5): CHROM, TYPE, 105388.AD, 105391.AD, 105407.AD
#> dbl (1): POS
#> num (5): ES2551.AD, ES2692.AD, ES2697.AD, ES2713.AD, ES2816.AD
#>
#> โน Use `spec()` to retrieve the full column specification for this data.
#> โน Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 ร 11
#> CHROM POS TYPE `105388.AD` `105391.AD` `105407.AD` ES2551.AD ES2692.AD
#> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 NC_072356โฆ 2392 SNP 0,26 0,21 0,33 210 280
#> 2 NC_072356โฆ 2742 SNP 0,36 0,18 0,39 260 330
#> # โน 3 more variables: ES2697.AD <dbl>, ES2713.AD <dbl>, ES2816.AD <dbl>
Again, specifying the col_types
avoids the dropping of the ,
and the concatenation of the individual values.
vroom("~/Downloads/wtf2.tsv", delim = "\t", col_types = "cdccccccccc")
#> # A tibble: 2 ร 11
#> CHROM POS TYPE `105388.AD` `105391.AD` `105407.AD` ES2551.AD ES2692.AD
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 NC_072356โฆ 2392 SNP 0,26 0,21 0,33 21,0 28,0
#> 2 NC_072356โฆ 2742 SNP 0,36 0,18 0,39 26,0 33,0
#> # โน 3 more variables: ES2697.AD <chr>, ES2713.AD <chr>, ES2816.AD <chr>
To pinpoint the cause of the parsing behavior switch, I changed a single cell (21,0
<-> 0,21
for ES2551.AD
) and created an altered version of the data (wtf3.tsv
):
CHROM POS TYPE 105388.AD 105391.AD 105407.AD ES2551.AD ES2692.AD ES2697.AD ES2713.AD ES2816.AD
NC_072356.1 2392 SNP 0,26 0,21 0,33 0,21 28,0 16,0 22,0 16,0
NC_072356.1 2742 SNP 0,36 0,18 0,39 26,0 33,0 25,0 29,0 25,0
Indeed the leading 0
now seems to cause the column ES2551.AD
to be parsed as character
and to conserve the ,
also in the default guess mode (columns ES2692.AD:ES2816.AD
still exhibit the issue though):
vroom("~/Downloads/wtf3.tsv", delim = "\t")
#> Rows: 2 Columns: 11
#> โโ Column specification โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
#> Delimiter: "\t"
#> chr (6): CHROM, TYPE, 105388.AD, 105391.AD, 105407.AD, ES2551.AD
#> dbl (1): POS
#> num (4): ES2692.AD, ES2697.AD, ES2713.AD, ES2816.AD
#>
#> โน Use `spec()` to retrieve the full column specification for this data.
#> โน Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 ร 11
#> CHROM POS TYPE `105388.AD` `105391.AD` `105407.AD` ES2551.AD ES2692.AD
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 NC_072356โฆ 2392 SNP 0,26 0,21 0,33 0,21 280
#> 2 NC_072356โฆ 2742 SNP 0,36 0,18 0,39 26,0 330
#> # โน 3 more variables: ES2697.AD <dbl>, ES2713.AD <dbl>, ES2816.AD <dbl>
And again, the col_types
can be used to avoid the behavior:
vroom("~/Downloads/wtf3.tsv", delim = "\t", col_types = "cdccccccccc")
#> # A tibble: 2 ร 11
#> CHROM POS TYPE `105388.AD` `105391.AD` `105407.AD` ES2551.AD ES2692.AD
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 NC_072356โฆ 2392 SNP 0,26 0,21 0,33 0,21 28,0
#> 2 NC_072356โฆ 2742 SNP 0,36 0,18 0,39 26,0 33,0
#> # โน 3 more variables: ES2697.AD <chr>, ES2713.AD <chr>, ES2816.AD <chr>
Session Info
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 11 (bullseye)
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3; LAPACK version 3.9.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Europe/Berlin
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] vroom_1.6.3.9000
#>
#> loaded via a namespace (and not attached):
#> [1] crayon_1.5.2 vctrs_0.6.3 cli_3.6.1 knitr_1.42
#> [5] rlang_1.1.1 xfun_0.39 purrr_1.0.1 styler_1.10.1
#> [9] glue_1.6.2 bit_4.0.5 htmltools_0.5.5 fansi_1.0.4
#> [13] rmarkdown_2.21 R.cache_0.16.0 tibble_3.2.1 evaluate_0.20
#> [17] tzdb_0.4.0 fastmap_1.1.1 yaml_2.3.7 lifecycle_1.0.3
#> [21] compiler_4.3.1 fs_1.6.1 pkgconfig_2.0.3 rstudioapi_0.14
#> [25] R.oo_1.24.0 R.utils_2.12.2 digest_0.6.31 utf8_1.2.3
#> [29] reprex_2.0.2.9000 tidyselect_1.2.0 parallel_4.3.1 pillar_1.9.0
#> [33] magrittr_2.0.3 R.methodsS3_1.8.1 tools_4.3.1 withr_2.5.0
#> [37] bit64_4.0.5
Created on 2023-06-27 with reprex v2.0.2.9000
vroom's default locale assumes that .
is the decimal mark, which then has implications for type guessing.
If you want type guessing to work on files like this, you need to inform vroom that ,
is the decimal mark.
library(vroom)
tmp <- tempfile()
vroom_write_lines(c("foo\tbar", "1,0\t0,1"), tmp)
vroom(tmp, delim = "\t", locale = locale(decimal_mark = ","))
#> Rows: 1 Columns: 2
#> โโ Column specification โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
#> Delimiter: "\t"
#> dbl (2): foo, bar
#>
#> โน Use `spec()` to retrieve the full column specification for this data.
#> โน Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 ร 2
#> foo bar
#> <dbl> <dbl>
#> 1 1 0.1
Created on 2023-06-27 with reprex v2.0.2.9000
thank you very much for the quick reply, however my point is not about decimal marks:
In the example the 1,0
is a list of values (1
and 0
) and NOT 1.0
.
The use case here is that the column foo
contains two counts that are to be separated:
library(tidyverse)
vroom::vroom("~/Downloads/wtf1.tsv", delim = "\t", col_types = "c") |>
separate(foo, into = c("ref", "alt"), sep = ",")
#> # A tibble: 1 ร 3
#> ref alt bar
#> <chr> <chr> <chr>
#> 1 1 0 0,1
This fails if the col_types
are not specified as the separator ,
is removed from the column foo
:
vroom::vroom("~/Downloads/wtf1.tsv", delim = "\t") |>
+ separate(foo, into = c("ref", "alt"), sep = ",")
#>Rows: 1 Columns: 2
#>โโ Column specification โ
#> Delimiter: "\t"
#> chr (1): bar
#> num (1): foo
#>
#> โน Use `spec()` to retrieve the full column specification for this data.
#> โน Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 ร 3
#> ref alt bar
#> <chr> <chr> <chr>
#> 1 10 NA 0,1
#> Warning message:
#> Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].
Ah, I see.
I think you've just bumped up against the hard fact that type guessing is hard and very fraught! I don't see anything that vroom could change that would be a net positive for users, in the large, even if you could imagine tweaks that are advantageous for this particular file.