rnabioco/valr

tabix, bam, and vcf reading

kriemo opened this issue · 2 comments

It would be nice to have some functions that pull data from indexed genomic data formats (tabix, bam, and vcf) into tidy tibbles. There are bioconductor approaches for data import, but I think it would be nice to have a tidyverse styled readr-like interface (e.g. read_bam, read_tabix, read_vcf) for simple data import.

I've written up some htslib based functions for custom work to read tabix or bam, which could serve as a template. rhtslib also makes it pretty easy to use htslib across multiple OS's so the build wouldn't be too difficult to manage.

Having these functions would make it feasible to perform summary operations on out of memory sized datasets.

e.g.

x # a smaller tibble with intervals in memory
lst_of_tbls <- split(x, x$chrom) # split by chrom (or provide some custom splitting function)

map(lst_of_tbls, 
  ~read_bam("large.bam", tbl = .x) %>% # pull in only relevant intervals using htslib
     bed_map(.x, .) %>% # perform a valr or related summary operation
     group_by(something) %>% # 
     summarize()  # return a smaller summary tibble
)

bigwig would be nice, too.

Implementing these readers would duplicate much of the work already in bioconductor. It's not much effort to simply coerce the outputs from GenomicAlignments or rtracklayer into a tibble to use with valr, so I'm going to close this issue.