- Splice graph representations of transcriptome structure
- Build an index for any species with a genome and annotation file
- de novo event discovery (splicing from/to any combination of annotated donor/acceptor splice sites)
- High speed PolyA+ Spliced Read Alignment (Read lengths <= 255)
- Repetitive read assignment for gene families
- Bias correction methods for 5' sequence and GC-content
- On-the-fly alignment/re-analysis of SRR accession ids using ebi.ac.uk
- Fast and robust quantification of transcriptome structure and expression using EM
- Dynamic building and entropic measurements of splicing events of any complexity
- Event-specific Percent-spliced-in (PSI)
- Gene expression (TPM)
- Differential splicing comparisons
- Probabilistic calculations of delta PSI leveraging multi-sample biological replicates
Paper pre-print: http://www.biorxiv.org/content/early/2017/07/03/158519
Get latest version of Julia if you don't have it. If you are new to julia, there is a helpful guide here
Install Whippet through the Julia REPL:
julia> Pkg.add("Whippet")
julia> using Whippet
The Whippet package directory can now be found in the default location:
$ cd ~/.julia/v0.6/Whippet
Notes:
- If you are having trouble finding the Whippet directory, you can ask the Julia REPL:
julia> Pkg.dir("Whippet")
- For all executables in
Whippet/bin
, you can use the-h
flag to get a list of the available command line options, their usage and defaults. - You can always update to the latest version of Whippet using
Pkg.update()
in the Julia REPL! - You should install Julia and its packages locally, if you absolutely have to install system-wide, there is some help here
- It might be convenient to add a link (
ln -s ~/.julia/v0.6/Whippet
) to this directory for easy access, or exportWhippet/bin
to your path.
You need your genome sequence in fasta, and a gene annotation file in GTF or Refflat format. Default annotation supplied for hg19 GENCODEv25 TSL1-level transcriptome in Whippet/anno
.
$ julia bin/whippet-index.jl --fasta hg19.fa.gz --gtf anno/gencode_hg19.v25.tsl1.gtf.gz
Notes:
- Whippet only uses GTF
exon
lines (others are ignored). These must contain bothgene_id
andtranscript_id
attributes (which should not be the same as one another!). This GTF file should be consistent with the GTF2.2 specification, and should have all entries for a transcript in a continuous block. - You can specify the output name and location of the index to build using the
-x / --index
parameter. The default (for both whippet-index.jl and whippet-quant.jl) is a generic index namedgraph
located at~/.julia/v0.6/Whippet/index/graph.jls
, so you must have write-access to this location to use the default.
$ julia bin/whippet-quant.jl file.fastq.gz
Or you can provide a link/s to the fastq file/s on the web using --url
, or just the experiment accession id (SRR id) using --ebi
! These will align on the fly as the reads are downloaded from the web directly into alignment. For example:
$ julia bin/whippet-quant.jl --ebi SRR1199010
is equivalent to
$ julia bin/whippet-quant.jl --url http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR119/000/SRR1199010/SRR1199010.fastq.gz
$ julia bin/whippet-quant.jl fwd_file.fastq.gz rev_file.fastq.gz
To stream reads from ebi.ac.uk, just provide a paired-end SRR id to to the --ebi
flag:
$ julia bin/whippet-quant.jl --ebi ERR1994736
is equivalent to
$ julia bin/whippet-quant.jl --url http://ftp.sra.ebi.ac.uk/vol1/fastq/ERR199/006/ERR1994736/ERR1994736_1.fastq.gz http://ftp.sra.ebi.ac.uk/vol1/fastq/ERR199/006/ERR1994736/ERR1994736_2.fastq.gz
To specify output location or a specific index:
$ julia bin/whippet-quant.jl fwd_file.fastq.gz -o outputname -x customindex.jls
You can also output the alignments in SAM format with the --sam
flag and convert to bam with a pipe:
$ julia bin/whippet-quant.jl fwd_file.fastq.gz --sam | samtools view -bS - > fwd_file.bam
It is also possible to pool fastq files at runtime using shell commands, and the optional (--force-gz
) for pooled gz files (files without .gz suffix)
$ julia bin/whippet-quant.jl <( cat time-series_{1,2,3,4,5}.fastq.gz ) --force-gz -o interval_1-5
Compare .psi.gz
files from from two samples -a
and -b
with any number of replicates (comma delimited list of files or common pattern matching) per sample.
$ ls *.psi.gz
sample1-r1.psi.gz sample1-r2.psi.gz sample2-r1.psi.gz sample2-r2.psi.gz
$ julia bin/whippet-delta.jl -a sample1 -b sample2
OR
$ julia bin/whippet-delta.jl -a sample1-r1.psi.gz,sample1-r2.psi.gz -b sample2-r1.psi.gz,sample2-r2.psi.gz
Note: comparisons of single files still need a comma: -a singlefile_a.psi.gz, -b singlefile_b.psi.gz,
The output format for whippet-quant.jl
is saved into two core quant filetypes, .psi.gz
and .tpm.gz
files.
Each .tpm.gz
file contains a simple format compatible with many downstream tools (one for the TPM of each annotated isoform, and another at the gene-level):
Gene/Isoform | TpM | Read Counts |
---|---|---|
NFIA | 2897.11 | 24657.0 |
Meanwhile the .psi.gz
file is a bit more complex and requires more explanation. Quantified nodes can fall into a number of "node-type" categories:
Type | Interpretation |
---|---|
CE | Core exon, which may be bounded by one or more alternative AA/AD nodes |
AA | Alternative Acceptor splice site |
AD | Alternative Donor splice site |
RI | Retained intron |
TS | Tandem transcription start site |
TE | Tandem alternative polyadenylation site |
AF | Alternative First exon |
AL | Alternative Last exon |
BS | Circular back-splicing (output only when --circ flag is given) |
Each node is defined by a type (above) and has a corresponding value for Psi
or the Percent-Spliced-In.
The .psi.gz
file itself is outputted in the form:
Gene | Node | Coord | Strand | Type | Psi | CI Width | CI Lo,Hi | Total Reads | Complexity | Entropy | Inc Paths | Exc Paths |
---|---|---|---|---|---|---|---|---|---|---|---|---|
NFIA | 2 | chr1:61547534-61547719 | + | AF | 0.782 | 0.191 | 0.669,0.86 | 49.0 | K1 | 0.756 | 2-4-5:0.782 | 1-5:0.218 |
NFIA | 4 | chr1:61548433-61548490 | + | CE | 0.8329 | 0.069 | 0.795,0.864 | 318.0 | K2 | 1.25 | 2-4-5:0.3342,3-4-5:0.4987 | 1-5:0.1671 |
NFIA | 5 | chr1:61553821-61554352 | + | CE | 0.99 | NA | NA | NA | NA | NA | NA | NA |
NFIA | 6 | chr1:61743192-61743257 | + | CE | 0.99 | NA | NA | NA | NA | NA | NA | NA |
Whippet outputs quantification at the node-level (one PSI value for each node, not exon). For example, Whippet CE ('core exon') nodes may be bounded by one or more AA or AD nodes. So when those AA or AD nodes are spliced in, the full exon consists of the nodes combined (AA+CE).
We believe that this is (and should be) the correct generalization of event-level splicing quantification for the following reasons:
- Whippet allows for dynamic quantification of observed splicing patterns. Therefore, in order to maintain consistent output from different Whippet runs on various samples, the basic unit of quantification needs to be a
node
. - Whippet can handle highly complex events, in contrast to many other splicing quantification tools which only report binary event types. Since, for example, it is possible for both (CE+AA) and CE exons to be excluded from the mature message, complex events may involve a number of overlapping full exons. If Whippet output was enumerated for all such combinations, the output for complex events would grow exponentially and approach uselessness.
- The general definition of Percent-spliced-in for an exon (or node) is the percentage of transcripts that 'have that exon spliced in', irrespective of the upstream or downstream splice sites that connect to it (those merely alter another node's PSI). Therefore, we feel that the output of PSI values should not change based on upstream or downstream splice-sites as they might with some other programs.
This must be taken into account when intersecting .psi.gz
coordinate output with other formats that only represent full exons (which can be one or more adjacent nodes combined).
Note: Complexity refers to the discrete categories based-on the ceiling(log2(number of paths)) through each AS event. Entropy refers to the shannon-entropy of the relative expression of the paths through the AS event.
The raw junctions are output in the format of .jnc.gz
files, which look like:
Chrom | Donor_Coord | Acceptor_Coord | GeneID:Nodes:Type | ReadCount | Strand |
---|---|---|---|---|---|
chr1 | 150600068 | 150601890 | ENSG00000143420.17_1:2-3:CON_ANNO | 113.04 | - |
chr1 | 150598284 | 150599943 | ENSG00000143420.17_1:3-6:ALT_ANNO | 110.12 | - |
chr1 | 150595335 | 150598118 | ENSG00000143420.17_1:6-9:CON_ANNO | 35.74 | - |
chr1 | 161185160 | 161187776 | ENSG00000158869.10_1:1-2:CON_ANNO | 20.47 | + |
chr1 | 161185160 | 161188031 | ENSG00000158869.10_1:1-3:ALT_UNIQ | 3.11 | + |
The Type
can be one of three:
Type | Meaning |
---|---|
CON_ANNO | Annotated constitutive junction |
ALT_ANNO | Annotated alternative junction |
ALT_UNIQ | Unannotated alternative junction |
The output format of .diff.gz
files from whippet-delta.jl
is:
Gene | Node | Coord | Strand | Type | Psi_A | Psi_B | DeltaPsi | Probability | Complexity | Entropy |
---|---|---|---|---|---|---|---|---|---|---|
ENSG00000117448.13_2 | 9 | chr1:46033654-46033849 | + | CE | 0.95971 | 0.97876 | -0.019047 | 0.643 | K0 | 0.0 |
ENSG00000117448.13_2 | 10 | chr1:46034157-46034356 | + | CE | 0.9115 | 0.69021 | 0.22129 | 0.966 | K1 | 0.874 |
Between the set of replicates from -a and -b whippet-delta.jl
outputs a mean Psi_A (from emperical sampling of the joint posterior distribution) and mean Psi_B. It also outputs the mean deltaPsi (Psi_A - Psi_B) and the probability that there is some change in deltaPsi, such that P(|deltaPsi| > 0.0). To determine the significance of output in .diff.gz
files, two filters are strongly encouraged: (1) significant nodes should have a high probability (>=0.95) suggesting there is likely to be a difference between the samples, and (2) significant nodes should have an absolute magnitude of change (|DeltaPsi| > x) where x is some value you consider to be a biologically relevant magnitude of change (we suggest |DeltaPsi| > 0.1).
If you are building an index for another organism, there are some general guidelines that can help to ensure that the index you build is as effective as it can be. In general you should seek to:
- Use only the highest quality annotations you can find.
- If a high-quality annotation set like GENCODE, contains both high and low quality transcripts, filter it! For human, we only use TSL-1 annotations.
- Avoid giving annotations with 'indels' such as ESTs or mRNAs without filtering out invalid splice sites first.
- If you plan to align very short reads (~36nt), decrease the Kmer size (we have used 6nt before), otherwise the default Kmer size (9nt) should be used.
With all of the executables in Whippet/bin
, you can use the -h
flag to get a list of the available command line options and their usage. If you are having trouble using or interpreting the output of Whippet
then please ask a question in our gitter chat!. If you are having trouble running a Whippet executable in /bin, try running Pkg.test("Whippet")
from the Julia REPL, Whippet should run cleanly if your environment is working correctly (though it is untested on Windows, it should still work). If you still think you have found a bug feel free to open an issue in github or make a pull request!