/Aligater

Software suite for detection/analysis of chimeric RNAs from LIGR-seq data

Primary LanguagePerlMIT LicenseMIT

aligater

Software suite developed for detection of chimeric or circular RNAs from LIGR-seq data (see citation below).

Sharma, E., Sterne-Weiler, T., O’Hanlon D., Blencowe, BJ. (2016) “Global Mapping of Human RNA-RNA Interactions.” Molecular Cell. 62(4):618-26

Table of Contents

Requirements

julia v0.4

  • ArgParse
  • Match
  • Distributions
  • GZip

perl v5 Packages

  • Parallel::ForkManager
  • Getopt::Long
  • DBI

external software

  • bowtie2 - short-read alignment
  • blastn - ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
  • blast databses ftp://ftp.ncbi.nlm.nih.gov/blast/db/: nt, human_genomic, other_genomic

optional

  • RactIP - intermolecular in silico RNA folding
  • nibFrag - faToNib to make nib genome files from fasta format
  • faToNib - nibFrag to quickly retrieve sequences from nib formatted chromosomes

Installation

###System###

Install julia release here, which must be = v0.4 (v0.5 and v0.6 are not supported). If you are new to julia, or installing programs via command line, there is a helpful guide here. The packages can then be installed manually by opening the julia REPL:

  pkgs = ("ArgParse", "Match", "Distributions", "GZip")
  map( Pkg.add, pkgs )
  map( Pkg.test, pkgs ) 

The perl packages can be installed using cpan -i Parallel::ForkManager

###Database Setup###

The current database is for hg19, you can download it here: hg19 transcriptome

It is also possible to use a different build or species, but automation of this process is still in development. In the meantime you should be able to reverse engineer the file formats in the hg19 version. Briefly fasta transcriptome headers need to be in the format of >ENST00001_ENSG00001_GENESYM_BIOTYPE, for example >ENST00000432079.1_ENSG00000116747.8_TROVE2_protein-coding. If you are having trouble feel free to open an issue or e-mail me.

Set up the default database like:

$ git clone https://github.com/timbitz/Aligater.git
$ cd Aligater
$ wget http://hgwdev.sdsc.edu/~timsw/GRCh37.v19.bt2.tar.gz
$ tar xzvf GRCh37.v19.bt2.tar.gz

Overview

Usage

The aligater executable is a wrapper for a set of tools that are meant to work as input/output for one another, often compatible with piping one into the next.

$ aligater -h
     aligater [sub-command] [-h]
         - align   : align short-reads to transcriptome
         - detect  : detect chimeric reads by recursive chaining of transcriptome SAM blocks
         - post    : post-process LIG format files with BLAST or RACTIP
         - reclass : create 2D k-mer db and reclassify chimeras using heirarchical type
         - stats   : compare crosslinked to mock-treated samples using multinomial statistics
         - table   : compile interaction results into tabular format

###Alignment and Detection###

Using the default alignment parameters is recommended for LIGR-seq and the step can be run as simply as:

$ filename="somefile.fastq.gz"
$ nodir=`basename $filename`
$ prefastq=${nodir%%.*}

$ aligater align -x db $filename > sam/$prefastq.sam

or --bam flag will pipe the output of aligater align into samtools view -bS

But feel free to alter other alignment parameters as you see fit for custom purposes:

$ aligater align -h

Example:

$ align_max=50
$ aligater align -p $cores -k $align_max --bam -x db/GRCh37.v19 $filename > bam/$prefastq.bam

The next step is detection, which can be piped from the first: aligater align | aligater detect > out.lig

$ detectparam='--gtf [annoFile.gtf(.gz)] --gfam [gene_fam.txt(.gz)] --rmsk [maskerFile.bed(.gz)]'
$ aligater detect $detectparam < sam/$prefastq.sam > lig/$prefastq.lig

for example with the default transcriptome you should have something like:

$ detectparam='--gtf db/GRCh37.v19.gtf.gz --gfam db/hg.gene_fam.txt.gz --rmsk db/GRCh37.repeatMasker.slim.bed.gz'

will print a lig formatted file to STDOUT.

or if bam output was specified from the first command.

$ samtools view -h bam/$prefastq.bam | aligater detect $detectparam > lig/$prefastq.lig

###Post Processing###

There are a few parts to the post processing step, a number of filtering flags, a --blast (mandatory) step and a folding step using --ractip (optional), each requiring the use of external dependencies (installed to your $path), blastn and ractip, see requirements.

aligater post [filtering_flags] [--blast] [--ractip] 

It is recommended that these commands be run separately, to effectively make use system resources. For example, --ractip takes several hours with substantial CPU on multiple cores (-p), but doesn't require much RAM.
Meanwhile, --blast is very CPU heavy and requires significant RAM as well!

The first command aligater post --blast should be considered mandatory, it is not recommended to skip this step. This step requires proper installation of blastn and the blast databases to the environmental variable BLASTDB.

$ export BLASTDB="/path/to/blast/databases"
$ blastn -version
blastn: 2.2.28+
Package: blast 2.2.28, build Mar 12 2013 16:52:31

$ aligater post [--tmp (def: /tmp/aligater)] [-p] --loose --blast < lig/$prefastq.lig > lig/$prefastq.blast.lig

Edit the tmp directory to hit and number of threads -p to use as necessary.

###Reclassification###

This is a two part step, you need to first create a junction library .jlz from the annotations of all samples and replicates that you plan to compare:

$ cat lig/*blast.filtered.lig | aligater reclass --uniq --save database.jlz

and then the actual reclassification step on each lig file:

$ aligater reclass --load database.jlz -g -b -u < lig/$input.lig > lig/$input.reclass.lig

###Statistics###

This step produces the final output of the aligater package and takes a number of command line options that require that the input file names be properly formatted!

The input expects two foreground files, a crosslinking reagent treated (xlink) or mock-treated/control (unxlink) which are identical file names other than some specific identifier that you could specify as a wildcard. For example:

$ xlinkfile="sample_rep1-xlink.final.lig"
$ unxlinkfile="sample_rep1-unx.final.lig"

$ foregroundfile="sample_rep1-%.final.lig"

along with an --nd xlink,unx flag which specifies the strings to insert into the % wildcard to make the two input files.

Similarly, the two background files containing non-chimeric expression level .lig files must be similarly formatted with a % wildcard that will be interpolated with the same --nd strings.

$ backgroundfile="sample_rep1-%.expression.lig"

$ nameParam="--fore $foregroundfile --back $backgroundfile --nd xlink,unx"

Additionally we have to supply the comma delimited normalization constants (total fastq input read numbers) --nc for the foreground files:

$ totalAMTreads=73217814
$ totalMOCKreads=62648851

$ normParam="--nc $totalAMTreads,$totalMOCKreads"

This makes up the core arguments to aligater stats:

$ aligater stats $nameParam $normParam > output.pvl

There are a number of other arguments which greatly expand the data compiled by aligater stats such as the --vs option which allows an arbitrary number of variable columns to summarize for each interaction 'col:type,col:type,etc..'. Each entry consists of a column number (1-based) followed by a : and a data type character [pcfdn] where each character stands for:

  • p : paired column with colon delimited entries for example BIOTYPEA:BIOTYPEB
  • s : string containing column to include
  • f : floating point number within the scope of Float64
  • d : integer number within the scope of Int64
  • n : some member of the Number abstract type, could be complex

For default lig format you can use: --vs 18:f,24:p,21:f,22:d,23:d which should summarize most of the important columns.

NOTE: If you skipped the RactIP folding step in aligater post --ractip then you will have a different number of columns than the default expects. To change command line options for aligater stats to be compatible with your input .lig files, use --gi 19 and --vs 18:p.

Additionally there is a --filt optional flag that can be used to specify a single column and regex pattern for filtering purposes. For example you may want to filter for only intermolecular interactions using --filt 1:I.

###File Formats###

The aligater detect step outputs a basic .lig (tab delimited) format file, and the aligater post outputs an extended version of the .lig fileformat. The standard format is:

Column Example Regex Description
1 I [IPS] Single letter code classification: I=Intermolecular, P=Putatively Paralogous, S=Intramolecular
2 A:B [A-Z:]+ Chimeric Read Structure: A,B,C refer to molecules, : denotes a ligation. A:A = intramolecular ligation, A:B = intermolecular ligation, A:B:A = intermolecular ligation from A to B and then from B back to A
3 1:3 [\d:]+ Local alignments (from SAM) that have been chained to make up the chimera
4 RNU4-1:RNU6-1251P \S+ Gene symbols of transcripts aligned to chimeric read
5 ENSG00000200795.1:ENSG00000201372.1 \S+ Gene ids of chimeric segments
6 ENST00000363925.1:ENST00000364502.1 \S+ Transcript ids of chimeric segments
7 snRNA:snRNA \S+ Biotypes of transcripts in chimeric segments
8 U4_snRNA:U6_snRNA \S+ or NA If alignment overlaps RepeatMasker annotation, RepeatNames or NA are given in A:B order
9 snRNA:snRNA \S+ or NA If alignment overlaps RepeatMasker annotation, then RepeatClass or NA is given in A:B order
10 68c3777d9e14bf33 \S+ or [a-f1-9]+ Readname or md5 hash of readname
11 ACTGGCAATTTAAAATTGGAA+ [ATGCN]+ Read Sequence, _ indicates ligation site
12 124 \d+ Alignment Score
13 1,30>30>17 [\d,>()]+ Chimera uniqueness, deprecated
14 74,26 [\d,]+ Alignment positions in transcripts
15 74,27 [\d,]+ or NA,NA RepeatMask offset positions
16 55,41 [\d,]+ Alignment lengths
17 chr12:120730966:-,chr20:42101679:+ POS,POS where POS=\S+:\d+:[+-] Genome start position of alignment segment

The aligater stats step outputs a .pvl file consistent with Extended Data Table 1 of Sharma E, Sterne-Weiler T, et al. 2016:

Extended Data Table 1
Gene-ids : Comma delimited HUGO or Repeat family name in lexographical order
OE[+amt/-amt] : (+AMT/-AMT) / (Expected(+AMT)/Expected(-AMT))
AMT/Mock : (+AMT/-AMT)
Exp[AMT/Mock] : Expected(+AMT)/Expected(-AMT)
AMTReads : Number of reads in the +AMT sample + 1 pseudo count
OE-amt : Observed / Expected for +AMT sample (see Methods)
MockReads : Number of reads in the -AMT sample + 1 pseudo count
OE-mock : Observed / Expected for -AMT sample (see Methods)
AMT+ pval : Binomial p-value for significance in the +AMT sample
Mock pval : Binomial p-value for significance in the -AMT sample
AMT RPM : The transcript's expression in Reads per Million from -ligase sample
Mock RPM : The transcript's expression in Reads per Million from -ligase sample
Transcript Biotypes : Gene biotypes from GENCODE annotations
... any other summary line from --vs string