/TriplexAligner

A method for sequence based prediction of RNA-DNA triplices.

Primary LanguageROtherNOASSERTION

TriplexAligner


An R package for sequence based prediction of RNA:DNA:DNA triple helix formation.

Install

Installation of TriplexAligner requires the devtools and BiocManager packages. Once these are installed, TriplexAligner may be installed with the following command:

devtools::install_github(repo = 'SchulzLab/TriplexAligner', repos = BiocManager::repositories(), dependencies = T)

Usage

The TriplexAligner R package works using the TriplexAligner() command of the same name. Here, the user can supply input RNA and DNA of various formats (explained below), along with specifying the species of interest. The output of TriplexAligner() is a data.frame providing local alignment results with accompanying score metrics and Karlin-Altschul statistics.

Parameters

There are several required and optional parameters for the TriplexAligner() function, detailed below:

Parameter Description Default
rna_input Input RNA. See rna_format for format options. None
dna_input Input DNA. See dna_format for format options. None
rna_format One of "symbol" or "fasta". None
dna_format One of "symbol", "fasta" or "bed". None
species Either "human"/"hs" or "mouse"/"mm". None
up Number of base pairs upstream of supplied DNA region which is considered (dna_format must be one of "symbol" or "bed" for this to apply). 0
down Number of base pairs downstream of supplied DNA region which is considered (dna_format must be one of "symbol" or "bed" for this to apply). 0

Examples

# Predict triplex formation between MALAT1 and the promoter of GAPDH
TriplexAligner(rna_input = 'MALAT1', dna_input = 'GAPDH', rna_format = 'symbol', dna_format = 'symbol', species = 'hs')
# Predict triplex formation between MALAT1 and DNA regions in a bed file (e.g. ATAC-sequencing peaks)
TriplexAligner(rna_input = 'MALAT1', dna_input = 'foo.bed', rna_format = 'symbol', dna_format = 'bed', species = 'hs')
# Predict triplex formation between MALAT1 & NEAT1 and DNA regions in a fasta file
TriplexAligner(rna_input = c('MALAT1', 'NEAT1'), dna_input = 'foo.fa', rna_format = 'symbol', dna_format = 'fasta', species = 'hs')

Details

rna_input

This is the parameter used to supply the RNA of interest to TriplexAligner. It may currently be supplied in two formats, specified by rna_format:

  • symbol : Specifying that the input RNA is in the symbol format (a character vector of species-appropriate gene symbols e.g. c("MALAT1", "NEAT1") for humans, or c("Malat1", "Neat1") for mice) will lead TriplexAligner to retrieve all transcript sequences for each supplied symbol, as annotated by either TxDb.Hsapiens.UCSC.hg38.knownGene or TxDb.Mmusculus.UCSC.mm10.knownGene. Each transcript is then separately used as input to TriplexAligner.
  • fasta : Specifying that that input RNA is in the fasta format means that the RNA sequences will be read from a corresponding fasta file supplied to rna_input. The headers in the fasta file will be assigned as the corresponding transcript names in the results. Note: there is no need to convert T to U.

dna_input

This is the parameter used to supply the DNA of interest to TriplexAligner. It may currently be supplied in three formats, specified by rna_format:

  • symbol : Specifying that the input DNA is in the symbol format (a character vector of species-appropriate gene symbols e.g. c("GAPDH", "CDH5") for humans, or c("Gapdh", "Cdh5") for mice) will lead TriplexAligner to retrieve all sequences surrounding the transcription start sites of each supplied gene symbol, as annotated by either TxDb.Hsapiens.UCSC.hg38.knownGene or TxDb.Mmusculus.UCSC.mm10.knownGene. The number of base pairs upstream and downstream of the transcription start site taken into consideration may be specified using the up and down parameters of TriplexAligner().
  • fasta : Specifying that that input DNA is in the fasta format means that the DNA sequences will be read from a corresponding fasta file supplied to dna_input. The headers in the fasta file will be assigned as the corresponding DNA region names in the results.
  • bed : Specifying that the DNA regions of interest are supplied in the bed format means that sequences corresponding to the supplied region bed file (e.g. foo.bed) will be retrieved using the GenomicRanges and species-appropriate BSgenome packages, and then used in TriplexAligner(). The input bed file should be a 4-column file with the columns being chromosome, start, stop, and name. Note: the bed file should have no header, and names must be unique, seeing as the names in the fourth column of the bed file will be used as the names of the corresponding sequences.

species

This parameter is used to ensure that the correct sequences are retrieved and considered for RNA:DNA:DNA triple helix formation, and that Karlin-Altschul statistics are calculated correctly. At the present time, only human ("human"/"hs") and mouse ("mouse"/"mm") are supported, but this will be expanded in the future.

up

This parameter extends the region supplied by either the symbol or bed DNA formats by a user-defined number of base pairs upstream. This may be used to consider wider areas around gene transcription start sites in the case that input_DNA is in the symbol format, or to extend regions supplied in the bed format.

down

This parameter extends the region supplied by either the symbol or bed DNA formats by a user-defined number of base pairs downstream. This may be used in conjunction with up to consider wider areas around gene transcription start sites in the case that input_DNA is in the symbol format, or to extend regions supplied in the bed format.

Output

TriplexAligner outputs a data.frame with one row per transcript/DNA/code combination. Note: if rna_input is supplied in the with rna_format = "symbol" then all transcripts from the species-appropriate TxDb package will be considered. TriplexAligner uses Karlin-Altschul statistics to assess local alignments, and corresponding values are provided in the output data.frame. An example output is shown below:

TxStart TxEnd TxSeq DNAStart DNAEnd DNASeq Score BitScore EValue logE Code DNA_name RNA_name
147 197 UCC... 2208 2258 GGT... 261.05 133.76 0.00 26.87 Code 1 DAD1 foo
589 597 UCU... 863 871 ACT... 35.47 20.33 18989690.25 -7.28 Code 2 DAD1 foo
698 711 ACA... 1240 1253 ACT... 62.32 33.46 2111.98 -3.32 Code 3 DAD1 foo
698 711 ACA... 994 1007 TGC... 50.63 31.30 9474.47 -3.98 Code 4 DAD1 foo
836 858 UGG... 2173 2195 ACC... 61.31 36.50 257.60 -2.41 Code 5 DAD1 foo
714 757 UGG... 849 892 GAA... 59.84 35.31 584.38 -2.77 Code 6 DAD1 foo
805 837 AUA... 562 594 ATA... 56.78 32.56 3942.06 -3.60 Code 7 DAD1 foo
572 670 UUU... 1653 1751 TTC... 49.24 29.10 43423.78 -4.64 Code 8 DAD1 foo
21 61 CUU... 977 1017 TTG... 225.18 115.65 0.00 21.42 Code 1 GAPDH foo
589 597 UCU... 2475 2483 TCT... 35.23 20.20 20703348.37 -7.32 Code 2 GAPDH foo
695 711 AGU... 870 886 TGC... 59.26 31.92 6163.40 -3.79 Code 3 GAPDH foo
698 711 ACA... 1541 1554 CGC... 50.98 31.50 8207.92 -3.91 Code 4 GAPDH foo
834 875 GAU... 2361 2402 CTA... 60.29 35.92 383.46 -2.58 Code 5 GAPDH foo
736 764 GUA... 842 870 TGG... 66.66 39.15 40.88 -1.61 Code 6 GAPDH foo
177 199 CGC... 1541 1563 CGC... 58.21 33.32 2322.41 -3.37 Code 7 GAPDH foo
573 691 UUC... 381 499 TCA... 58.02 34.04 1414.54 -3.15 Code 8 GAPDH foo
630 688 UAU... 2562 2620 TCG... 259.83 133.14 0.00 26.68 Code 1 CD40 foo
589 597 UCU... 1082 1090 ACA... 35.59 20.39 18186809.85 -7.26 Code 2 CD40 foo
698 711 ACA... 1089 1102 ACA... 61.41 33.00 2904.10 -3.46 Code 3 CD40 foo
698 711 ACA... 2429 2442 CGC... 51.07 31.56 7910.57 -3.90 Code 4 CD40 foo
173 189 UUA... 295 311 AAT... 62.64 37.24 153.35 -2.19 Code 5 CD40 foo
714 757 UGG... 853 896 GAA... 59.84 35.31 584.38 -2.77 Code 6 CD40 foo
76 98 AAA... 2041 2063 AAA... 71.25 40.28 18.64 -1.27 Code 7 CD40 foo
589 672 UCU... 1740 1823 TGT... 52.93 31.18 10297.54 -4.01 Code 8 CD40 foo