bfcquery returns inconsistent column types for empty rows
Opened this issue · 3 comments
The column header types for the columns create_time
and access_time
are character vectors when non-empty, and double vectors when empty.
I expect that they should consistently return the same type; maybe character vectors always; although it's not clear why they are not date or datetime types instead.
Returning inconsistent types throws an error when trying to row bind join multiple queries using purrr::map_df
where some of the queries are successful and some of them fail:
> files_remote
[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1480nnn/GSM1480327/suppl/GSM1480327_K562_PROseq_minus.bw"
[2] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1480nnn/GSM1480327/suppl/GSM1480327_K562_PROseq_plus.bw"
> map_df(files_remote, bfcquery, x = bfc)
Error: Can't combine `create_time` <character> and `create_time` <double>.
Run `rlang::last_error()` to see where the error occurred.
> map_df(files_remote[1], bfcquery, x = bfc)
# A tibble: 1 x 10
rid rname create_time access_time rpath rtype fpath last_modified_t… etag
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 BFC6 ftp:… 2020-06-29… 2020-06-29… /hom… web ftp:… NA NA
# … with 1 more variable: expires <dbl>
> map_df(files_remote[2], bfcquery, x = bfc)
# A tibble: 0 x 10
# … with 10 variables: rid <chr>, rname <chr>, create_time <dbl>,
# access_time <dbl>, rpath <chr>, rtype <chr>, fpath <chr>,
# last_modified_time <dbl>, etag <chr>, expires <dbl>
>
I'm a little behind on my R installation and can update if you can't reproduce the problem:
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: PureOS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] usethis_1.6.1 tidyr_1.1.0 tibble_3.0.1
[4] stringr_1.4.0 purrr_0.3.4 dplyr_1.0.0
[7] rtracklayer_1.44.4 GenomicRanges_1.36.1 GenomeInfoDb_1.20.0
[10] IRanges_2.18.3 S4Vectors_0.22.1 GEOquery_2.52.0
[13] Biobase_2.44.0 BiocGenerics_0.30.0 evolength_0.0.0.9000
[16] testthat_2.3.2
loaded via a namespace (and not attached):
[1] httr_1.4.1 pkgload_1.1.0
[3] bit64_0.9-7 Rdpack_0.11-1
[5] assertthat_0.2.1 BiocFileCache_1.8.0
[7] blob_1.2.1 GenomeInfoDbData_1.2.1
[9] Rsamtools_2.0.3 remotes_2.1.1
[11] sessioninfo_1.1.1 lattice_0.20-41
[13] pillar_1.4.4 RSQLite_2.2.0
[15] backports_1.1.7 glue_1.4.1
[17] limma_3.40.6 digest_0.6.25
[19] XVector_0.24.0 Matrix_1.2-18
[21] XML_3.99-0.3 pkgconfig_2.0.3
[23] devtools_2.3.0 bibtex_0.4.2.2
[25] zlibbioc_1.30.0 processx_3.4.2
[27] BiocParallel_1.18.1 generics_0.0.2
[29] ellipsis_0.3.1 withr_2.2.0
[31] SummarizedExperiment_1.14.1 cli_2.0.2
[33] magrittr_1.5 crayon_1.3.4
[35] memoise_1.1.0 ps_1.3.3
[37] fs_1.4.1 fansi_0.4.1
[39] xml2_1.3.2 pkgbuild_1.0.8
[41] tools_3.6.3 prettyunits_1.1.1
[43] hms_0.5.3 matrixStats_0.56.0
[45] gbRd_0.4-11 lifecycle_0.2.0
[47] DelayedArray_0.10.0 callr_3.4.3
[49] Biostrings_2.52.0 RcppHMM_1.2.2
[51] compiler_3.6.3 rlang_0.4.6
[53] grid_3.6.3 RCurl_1.98-1.2
[55] rstudioapi_0.11 rappdirs_0.3.1
[57] bitops_1.0-6 DBI_1.1.0
[59] curl_4.3 R6_2.4.1
[61] GenomicAlignments_1.20.1 utf8_1.1.4
[63] bit_1.1-15.2 rprojroot_1.3-2
[65] readr_1.3.1 desc_1.2.0
[67] stringi_1.4.6 Rcpp_1.0.4.6
[69] vctrs_0.3.0 dbplyr_1.4.4
[71] tidyselect_1.1.0
>
Sorry for the long delay. I'm looking into this and I'm not quite sure how to correct it. It seems like a bug when using dplyr::filter that somehow changes the columns type.
> tbl
# Source: table<resource> [?? x 11]
# Database: sqlite 3.35.2
# [/home/shepherd/.cache/BiocFileCache/BiocFileCache.sqlite]
id rid rname create_time access_time rpath rtype fpath last_modified_t…
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 BFC1 annot… 2020-07-20… 2021-03-30… 534a… web http… 2021-03-15 14:4…
2 2 BFC2 annot… 2020-07-20… 2021-03-30… 534a… rela… 534a… NA
3 4 BFC4 AH800… 2020-07-27… 2021-03-30… 21c6… web http… NA
> tbl %>% dplyr::filter(rid == NA_character_)
# Source: lazy query [?? x 11]
# Database: sqlite 3.35.2
# [/home/shepherd/.cache/BiocFileCache/BiocFileCache.sqlite]
# … with 11 variables: id <int>, rid <chr>, rname <chr>, create_time <dbl>,
# access_time <dbl>, rpath <chr>, rtype <chr>, fpath <chr>,
# last_modified_time <dbl>, etag <chr>, expires <dbl>
If I omit dplyr::filter
, using an empty BiocFileCache defaults to double for time columns or - in the second bfcquery
below - columns only containing NA. Can empty maintain preserve a consistent type?
library(purrr)
library(stringr)
library(BiocFileCache)
path <- tempfile()
bfc <- BiocFileCache(path, ask = FALSE)
files_remote <-
str_c(file.path("ftp://ftp.ncbi.nlm.nih.gov",
"geo/samples/GSM1480nnn/GSM1480327/suppl",
"GSM1480327_K562_PROseq_"),
c("minus", "plus"),
".bw")
map_df(files_remote, bfcquery, x = bfc)
# A tibble: 0 × 10
# ℹ 10 variables: rid <chr>, rname <chr>, create_time <dbl>, access_time <dbl>,
# rpath <chr>, rtype <chr>, fpath <chr>, last_modified_time <dbl>,
# etag <chr>, expires <dbl>
bfcadd(bfc, files_remote[1])
#> |======================================================================| 100%
#> BFC1
#> "/tmp/RtmpDRIP5H/file2ff2a62a8acdc8/2ff2a64678a220_GSM1480327_K562_PROseq_minus.bw"
map_df(files_remote[1], bfcquery, x = bfc)
#> # A tibble: 1 × 10
#> rid rname create_time access_time rpath rtype fpath last_modified_time etag
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
#> 1 BFC1 ftp:… 2024-12-06… 2024-12-06… /tmp… web ftp:… NA NA
#> # ℹ 1 more variable: expires <dbl>
map_df(files_remote, bfcquery, x = bfc)
#> Error in `dplyr::bind_rows()`:
#> ! Can't combine `..1$create_time` <character> and `..2$create_time` <double>.
#> Run `rlang::last_trace()` to see where the error occurred.