rnabioco/raer

organizing databases and data resources

Closed this issue · 6 comments

There are many files needed for editing analyses. It would be helpful to put all of these in one place to make it easier for development. A short term solution would be to put all of the files into a s3 bucket or similar that are downloaded locally for development. A long term solution would be to move the data into an ExperimentHub package which could then be used to import external datasets needed for a vignette.

Regarding the 2 cell line idea for the bulk rna-seq data, the data in the package is actually already adar wt and ko (https://github.com/rnabioco/raer/blob/main/inst/extdata/make_test_data.sh). I forgot that I already put this into the package. It would be useful to have the full dataset, or at least a larger subset, organized into a data package for use in a vignette or for benchmarking.

I'll put move some data into a s3 bucket to get this started.

I've put a few files into a s3 bucket relevant for the both editing pipelines.

s3_url <- "https://raer-test-data.s3.us-west-2.amazonaws.com"

#~ 4Gb of files as of now
files <- c("Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam_subsampled_cb_sorted.bam", #glioblastoma single cell dataset from 10x (downsampled to 5%)
           "Homo_sapiens.GRCh38.dna.primary_assembly.UCSC.fa", # fasta file
           "Homo_sapiens.GRCh38.dna.primary_assembly.UCSC.fa.fai",
           "cbs.txt", # list of cell barcodes in bam file
           "TABLE1_hg38.txt.gz", # Table of known editing sites in hg38 format from RediPortal,
          "SRR5564269_dedup_subsampled_sorted.bam",  # same test data as in package but aligned to hg38
          "SRR5564269_dedup_subsampled_sorted.bam.bai",
          "SRR5564277_dedup_subsampled_sorted.bam",
          "SRR5564277_dedup_subsampled_sorted.bam.bai"
           )

urls <- paste0(s3_url, "/", files)
names(urls) <- files

to_download <- urls[!file.exists(names(urls))]

for(i in seq_along(to_download)){
  options(timeout = max(600, getOption("timeout")))
  download.file(to_download[i], names(to_download)[i])
}

Part of the bulk-rna seq editing pipeline will involve excluding known SNPs. Perhaps we could show how to use pre-built bioconductor objects for this:
https://bioconductor.org/packages/release/data/annotation/html/SNPlocs.Hsapiens.dbSNP151.GRCh38.html

ideas for bulk RNA-seq datasets to process:

single cell data:

spatial data:

  • human/mouse brain data
kriemo commented

minimal datasets have been moved to an experimenthub package https://github.com/rnabioco/raerdata