Bioconductor/BiocFileCache

bfcquery returns inconsistent column types for empty rows

Opened this issue · 3 comments

omsai commented

The column header types for the columns create_time and access_time are character vectors when non-empty, and double vectors when empty.
I expect that they should consistently return the same type; maybe character vectors always; although it's not clear why they are not date or datetime types instead.
Returning inconsistent types throws an error when trying to row bind join multiple queries using purrr::map_df where some of the queries are successful and some of them fail:

> files_remote
[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1480nnn/GSM1480327/suppl/GSM1480327_K562_PROseq_minus.bw"
[2] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1480nnn/GSM1480327/suppl/GSM1480327_K562_PROseq_plus.bw" 
> map_df(files_remote, bfcquery, x = bfc)
Error: Can't combine `create_time` <character> and `create_time` <double>.
Run `rlang::last_error()` to see where the error occurred.
> map_df(files_remote[1], bfcquery, x = bfc)
# A tibble: 1 x 10
  rid   rname create_time access_time rpath rtype fpath last_modified_t… etag 
  <chr> <chr> <chr>       <chr>       <chr> <chr> <chr>            <dbl> <chr>
1 BFC6  ftp:… 2020-06-29… 2020-06-29… /hom… web   ftp:…               NA NA   
# … with 1 more variable: expires <dbl>
> map_df(files_remote[2], bfcquery, x = bfc)
# A tibble: 0 x 10
# … with 10 variables: rid <chr>, rname <chr>, create_time <dbl>,
#   access_time <dbl>, rpath <chr>, rtype <chr>, fpath <chr>,
#   last_modified_time <dbl>, etag <chr>, expires <dbl>
>  
omsai commented

I'm a little behind on my R installation and can update if you can't reproduce the problem:

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: PureOS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] usethis_1.6.1        tidyr_1.1.0          tibble_3.0.1        
 [4] stringr_1.4.0        purrr_0.3.4          dplyr_1.0.0         
 [7] rtracklayer_1.44.4   GenomicRanges_1.36.1 GenomeInfoDb_1.20.0 
[10] IRanges_2.18.3       S4Vectors_0.22.1     GEOquery_2.52.0     
[13] Biobase_2.44.0       BiocGenerics_0.30.0  evolength_0.0.0.9000
[16] testthat_2.3.2      

loaded via a namespace (and not attached):
 [1] httr_1.4.1                  pkgload_1.1.0              
 [3] bit64_0.9-7                 Rdpack_0.11-1              
 [5] assertthat_0.2.1            BiocFileCache_1.8.0        
 [7] blob_1.2.1                  GenomeInfoDbData_1.2.1     
 [9] Rsamtools_2.0.3             remotes_2.1.1              
[11] sessioninfo_1.1.1           lattice_0.20-41            
[13] pillar_1.4.4                RSQLite_2.2.0              
[15] backports_1.1.7             glue_1.4.1                 
[17] limma_3.40.6                digest_0.6.25              
[19] XVector_0.24.0              Matrix_1.2-18              
[21] XML_3.99-0.3                pkgconfig_2.0.3            
[23] devtools_2.3.0              bibtex_0.4.2.2             
[25] zlibbioc_1.30.0             processx_3.4.2             
[27] BiocParallel_1.18.1         generics_0.0.2             
[29] ellipsis_0.3.1              withr_2.2.0                
[31] SummarizedExperiment_1.14.1 cli_2.0.2                  
[33] magrittr_1.5                crayon_1.3.4               
[35] memoise_1.1.0               ps_1.3.3                   
[37] fs_1.4.1                    fansi_0.4.1                
[39] xml2_1.3.2                  pkgbuild_1.0.8             
[41] tools_3.6.3                 prettyunits_1.1.1          
[43] hms_0.5.3                   matrixStats_0.56.0         
[45] gbRd_0.4-11                 lifecycle_0.2.0            
[47] DelayedArray_0.10.0         callr_3.4.3                
[49] Biostrings_2.52.0           RcppHMM_1.2.2              
[51] compiler_3.6.3              rlang_0.4.6                
[53] grid_3.6.3                  RCurl_1.98-1.2             
[55] rstudioapi_0.11             rappdirs_0.3.1             
[57] bitops_1.0-6                DBI_1.1.0                  
[59] curl_4.3                    R6_2.4.1                   
[61] GenomicAlignments_1.20.1    utf8_1.1.4                 
[63] bit_1.1-15.2                rprojroot_1.3-2            
[65] readr_1.3.1                 desc_1.2.0                 
[67] stringi_1.4.6               Rcpp_1.0.4.6               
[69] vctrs_0.3.0                 dbplyr_1.4.4               
[71] tidyselect_1.1.0           
> 
lshep commented

Sorry for the long delay. I'm looking into this and I'm not quite sure how to correct it. It seems like a bug when using dplyr::filter that somehow changes the columns type.

> tbl
# Source:   table<resource> [?? x 11]
# Database: sqlite 3.35.2
#   [/home/shepherd/.cache/BiocFileCache/BiocFileCache.sqlite]
      id rid   rname  create_time access_time rpath rtype fpath last_modified_t…
   <int> <chr> <chr>  <chr>       <chr>       <chr> <chr> <chr> <chr>           
 1     1 BFC1  annot… 2020-07-20… 2021-03-30… 534a… web   http… 2021-03-15 14:4…
 2     2 BFC2  annot… 2020-07-20… 2021-03-30… 534a… rela… 534a… NA              
 3     4 BFC4  AH800… 2020-07-27… 2021-03-30… 21c6… web   http… NA              


> tbl %>% dplyr::filter(rid == NA_character_)
# Source:   lazy query [?? x 11]
# Database: sqlite 3.35.2
#   [/home/shepherd/.cache/BiocFileCache/BiocFileCache.sqlite]
# … with 11 variables: id <int>, rid <chr>, rname <chr>, create_time <dbl>,
#   access_time <dbl>, rpath <chr>, rtype <chr>, fpath <chr>,
#   last_modified_time <dbl>, etag <chr>, expires <dbl>

If I omit dplyr::filter, using an empty BiocFileCache defaults to double for time columns or - in the second bfcquery below - columns only containing NA. Can empty maintain preserve a consistent type?

library(purrr)
library(stringr)
library(BiocFileCache)

path <- tempfile()

bfc <- BiocFileCache(path, ask = FALSE)

files_remote <-
  str_c(file.path("ftp://ftp.ncbi.nlm.nih.gov",
                  "geo/samples/GSM1480nnn/GSM1480327/suppl",
                  "GSM1480327_K562_PROseq_"),
        c("minus", "plus"),
        ".bw")

map_df(files_remote, bfcquery, x = bfc)
# A tibble: 0 × 10
# ℹ 10 variables: rid <chr>, rname <chr>, create_time <dbl>, access_time <dbl>,
#   rpath <chr>, rtype <chr>, fpath <chr>, last_modified_time <dbl>,
#   etag <chr>, expires <dbl>

bfcadd(bfc, files_remote[1])
#> |======================================================================| 100%
#> BFC1 
#> "/tmp/RtmpDRIP5H/file2ff2a62a8acdc8/2ff2a64678a220_GSM1480327_K562_PROseq_minus.bw"

map_df(files_remote[1], bfcquery, x = bfc)
#> # A tibble: 1 × 10
#>   rid   rname create_time access_time rpath rtype fpath last_modified_time etag 
#>   <chr> <chr> <chr>       <chr>       <chr> <chr> <chr>              <dbl> <chr>
#> 1 BFC1  ftp:… 2024-12-06… 2024-12-06… /tmp… web   ftp:…                 NA NA   
#> # ℹ 1 more variable: expires <dbl>

map_df(files_remote, bfcquery, x = bfc)
#> Error in `dplyr::bind_rows()`:
#> ! Can't combine `..1$create_time` <character> and `..2$create_time` <double>.
#> Run `rlang::last_trace()` to see where the error occurred.