SPOT
SPOT (SPlicing Outlier deTection) is a probabilistic framework to detect Splicing Outilers from RNA-seq data. Briefly, SPOT fits a Dirichlet-Multinomial distribution directly to counts of reads split across alternatively spliced exon-exon junctions for each gene. SPOT then identifies individuals that deviate significantly from the expectation based on this fitted distribution. Please see [ref bioRxiv preprint] for more details.
Input data format
SPOT identifies outliers at the level of a LeafCutter Cluster (see https://github.com/davidaknowles/leafcutter for more details). Therefor, to generate input data for SPOT you must follow the pre-processing steps described in LeafCutter:
- Align reads and generate exon-exon junction files. (described in steps 0 and 1 here: http://davidaknowles.github.io/leafcutter/articles/Usage.html)
- Run LeafCutter "Intron clustering" (described in step 2 here: http://davidaknowles.github.io/leafcutter/articles/Usage.html). This will generate a file with the suffix "perind_numers.counts.gz". This file is the input file for SPOT. SPOT will accept this file either zipped or un-zipped. Briefly, each column in this file corresponds to an RNA-seq sample and each row corresponds to an intron, which are identified as chromosome:intron_start:intron_end:cluster_id.
An example of a previously generated SPOT input file can be found under 'example_data/exon_exon_junction_file.txt'
It is important to note that in the following paper [https://www.biorxiv.org/content/10.1101/786053v1], we applied the following set of custom filters to the LeafCutter files (before running SPOT) in order to remove exon-exon junctions with low expression while retaining rare exon-exon junctions:
- Removed exon-exon junctions where no sample has >= 15 split reads
- Re-defined LeafCutter cluster assignments after removal of exon-exon junctions (according to the above filter) and removed exon-exon junctions that no longer shared a splice site with any other exon-exon junction.
- Removed all exon-exon junctions belonging to a LeafCutter cluster where more than 10% of the samples had less than 3 reads summed across all exon-exon junctions assigned to that LeafCutter cluster.
Running SPOT
Once you have generated the SPOT input file (with help from LeafCutter), SPOT can be easily run using the following command:
python spot.py --juncfile $junction_file_name --outprefix $output_root
SPOT deliverables
SPOT will generate two files:
- $output_root'md.txt'
- $output_root"emperical_pvalue.txt"
Both files are of dimension C X N where C is the number of clusters and N is the number of samples. Each element of the first file is the Mahalanobis distance of a particular sample for a particular cluster. Each element of the second file is the splicing outlier pvalue for a particular sample for a particular cluster.
Dependencies
Python packages:
- numpy
- sys
- pystan
- gzip
Testing environment
SPOT was generated and tested using the following versions:
- python 2.7.15
- numpy 1.15.4
- pystan 2.17.1.0
Authors
- Ben Strober -- BennyStrobes -- bstrober3@gmail.com