GoekeLab/proActiv

preparePromoterAnnotation for different species

Closed this issue · 5 comments

Hi! I'm trying to run proActiv with different vertebrate species. My problem is when I try to run the preparePromoterAnnotation function the species I want to analyze is not in the genomeStyles. Therefore I was wondering if there is a way that I can still get my promoterAnnotation object when the species is not defined in the genomeStyles or what would you recommend me.

Thanks for your help!

Hi @gabee-chan, thanks for your question! We'll look into it and get back to you soon.

Hi @gabee-chan, you can try installing the package from Github and try creating your promoter annotation again.

I've tried creating promoter annotations for Zebrafish (Danio rerio) and it works. One caveat is that the seqnames in your TxDb or GTF file should follow standard naming (e.g. "chr1" or "1").

> pa <- preparePromoterAnnotation(file = "Danio_rerio.GRCz11.103.chr.gtf.gz", species = "Danio_rerio")
Parsing input file...
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Extract exons by transcripts...
Identify overlapping first exons for each gene...
Prepare mapping between transcripts, tss, promoters and genes...
Prepare annotated intron ranges...
Annotating reduced exon ranges...
Prepare promoter coordinates and first exon ranges...
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type stop_codon. This
  information was ignored.
> 
> head(pa@promoterCoordinates)
GRanges object with 6 ranges and 4 metadata columns:
      seqnames    ranges strand | promoterId internalPromoter firstExonEnd      intronId
         <Rle> <IRanges>  <Rle> |  <integer>        <logical>    <integer> <IntegerList>
  [1]     chr9  34121839      - |          1            FALSE     34121792        106045
  [2]     chr9  34089156      + |          2            FALSE     34090811        100784
  [3]     chr4  15103696      - |          3            FALSE     15103602         43293
  [4]     chr4  15011341      + |          4            FALSE     15011512         37110
  [5]    chr12  33484458      + |          5            FALSE     33485048        129560
  [6]    chr24  22074272      - |          6            FALSE     22074235        242963
  -------
  seqinfo: 26 sequences from an unspecified genome; no seqlengths

Session information:

R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_Singapore.1252  LC_CTYPE=English_Singapore.1252   
[3] LC_MONETARY=English_Singapore.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Singapore.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] proActiv_1.1.21 testthat_3.0.1 

loaded via a namespace (and not attached):
  [1] colorspace_2.0-0            ellipsis_0.3.1              rprojroot_2.0.2            
  [4] biovizBase_1.38.0           htmlTable_2.1.0             XVector_0.30.0             
  [7] GenomicRanges_1.42.0        base64enc_0.1-3             fs_1.5.0                   
 [10] dichromat_2.0-0             rstudioapi_0.13             remotes_2.2.0              
 [13] bit64_4.0.5                 AnnotationDbi_1.52.0        fansi_0.4.1                
 [16] xml2_1.3.2                  splines_4.0.3               knitr_1.30                 
 [19] geneplotter_1.68.0          pkgload_1.1.0               Formula_1.2-4              
 [22] Rsamtools_2.6.0             annotate_1.68.0             cluster_2.1.0              
 [25] dbplyr_2.0.0                png_0.1-7                   compiler_4.0.3             
 [28] httr_1.4.2                  backports_1.2.1             lazyeval_0.2.2             
 [31] assertthat_0.2.1            Matrix_1.2-18               cli_2.2.0                  
 [34] htmltools_0.5.0             prettyunits_1.1.1           tools_4.0.3                
 [37] gtable_0.3.0                glue_1.4.2                  GenomeInfoDbData_1.2.4     
 [40] dplyr_1.0.2                 rappdirs_0.3.1              Rcpp_1.0.5                 
 [43] Biobase_2.50.0              vctrs_0.3.6                 Biostrings_2.58.0          
 [46] rtracklayer_1.49.5          xfun_0.19                   stringr_1.4.0              
 [49] ps_1.5.0                    lifecycle_0.2.0             ensembldb_2.14.0           
 [52] devtools_2.3.2              XML_3.99-0.5                zlibbioc_1.36.0            
 [55] scales_1.1.1                BSgenome_1.58.0             VariantAnnotation_1.36.0   
 [58] ProtGenerics_1.22.0         hms_0.5.3                   MatrixGenerics_1.2.0       
 [61] parallel_4.0.3              SummarizedExperiment_1.20.0 AnnotationFilter_1.14.0    
 [64] RColorBrewer_1.1-2          curl_4.3                    memoise_1.1.0              
 [67] gridExtra_2.3               ggplot2_3.3.3               biomaRt_2.46.0             
 [70] rpart_4.1-15                latticeExtra_0.6-29         stringi_1.5.3              
 [73] RSQLite_2.2.1               genefilter_1.72.0           S4Vectors_0.28.1           
 [76] desc_1.2.0                  checkmate_2.0.0             GenomicFeatures_1.42.1     
 [79] BiocGenerics_0.36.0         pkgbuild_1.2.0              BiocParallel_1.24.1        
 [82] GenomeInfoDb_1.26.2         rlang_0.4.9                 pkgconfig_2.0.3            
 [85] matrixStats_0.57.0          bitops_1.0-6                lattice_0.20-41            
 [88] purrr_0.3.4                 htmlwidgets_1.5.3           GenomicAlignments_1.26.0   
 [91] bit_4.0.4                   processx_3.4.5              tidyselect_1.1.0           
 [94] magrittr_2.0.1              DESeq2_1.30.0               R6_2.5.0                   
 [97] IRanges_2.24.1              generics_0.1.0              Hmisc_4.4-2                
[100] DelayedArray_0.16.0         DBI_1.1.0                   pillar_1.4.7               
[103] foreign_0.8-80              withr_2.3.0                 survival_3.2-7             
[106] RCurl_1.98-1.2              nnet_7.3-14                 tibble_3.0.4               
[109] crayon_1.3.4                BiocFileCache_1.14.0        jpeg_0.1-8.1               
[112] progress_1.2.2              usethis_2.0.0               locfit_1.5-9.4             
[115] grid_4.0.3                  data.table_1.13.6           blob_1.2.1                 
[118] callr_3.5.1                 digest_0.6.27               xtable_1.8-4               
[121] openssl_1.4.3               stats4_4.0.3                munsell_0.5.0              
[124] Gviz_1.34.0                 sessioninfo_1.1.1           askpass_1.1 

Hi @jleechung ! Thanks for your help!

For some organisms I'm able to get the promoterAnnotation object, but for others no :(

For example for Gorilla (Gorilla gorilla ) I'm getting the below output:

Parsing input file... Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK Extract exons by transcripts... Identify overlapping first exons for each gene... Prepare mapping between transcripts, tss, promoters and genes... Prepare annotated intron ranges... Annotating reduced exon ranges... Error in[[<-(tmp, name, value = new("SimpleIntegerList", elementType = "integer", : 33718 elements in value to replace 36607 elements Calls: preparePromoterAnnotation ... annotateReducedExonRanges -> $<- -> $<- -> [[<- -> [[<- In addition: Warning message: In .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored. Execution halted
I'm wondering why is this happening if the Gorilla GTF has the standard naming ("1")

Thanks for your help!
Gaby

Hi @gabee-chan, I looked into the Gorilla GTF file and I think chromosome names like "2A" and "2B" are causing the problem. Internally, proActiv uses the keepStandardChromosomes function from GenomeInfoDb, which takes in a species argument to check which chromosomes are standard for the species in question. When no species is provided, this may trim names like "2A" (see below), resulting in the error.

We'll try to make the code more robust to support other species (maybe allow the user to define seqnames for "non-standard" species), but this will take awhile.

> gr <- GRanges(c("chr1", "chr2A", "chr2B", "chr3"), IRanges(1:4, width=5))
> keepStandardChromosomes(gr, pruning.mode = "tidy")
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1       1-5      *
  [2]     chr3       4-8      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Hi @gabee-chan I also came across the problem caused by the chromosome names like "2A" and "2B", did you fixed it?