Benchmarking splice prediction tools

This repository contains the scripts used for the analysis of the manuscript "Benchmarking deep learning splice prediction tools using functional splice assays".

All variants and splice prediction scores used for the analysis are included in data/variant_scores.xlsx. The datasets are:

ABCA4 NCSS (non-canonical splice site) variants
ABCA4 DI (deep intronic) variants
MYBPC3 NCSS variants

Analysis scripts

`analysis_variants.py`

This script provides information about the variants in the dataset:

number of variants:
- number of non splice altering variants
- number of splice altering variants
number of variants located close to the splice donor site (SDS)
- number of splice altering variants at the SDS
- number of non splice altering variant at the SDS
number of variants located close to the acceptor donor site (SAS)
- number of splice altering variants at the SAS
- number of non splice altering variant at the SAS
Number of splice altering and non splice altering variants at each position in the non-canonical splice site

`confusion_matrix.py`

This scripts calculates, for each splice prediction tool used in the analysis, the optimal threshold and corresponding confusion matrix. The format of the confusion matrix in python is:

TN	FP
FN	TP

`create_vcffile.py`

This script is used to convert the variants into vcf format. This is required for CADD, MMSplice and SpliceAI. The script makes use of the pyhgvs package nad it required a reference genome file and RefSeq transcripts.

`functions.py`

This file containes functions that are used in other scripts. These include:

delta_score: A function to calculate the delta score
read_scores_from_excel: A function to read the variants and scores from an excel sheet
reverse_sequence: Convertes a sequence int the sequence of the complementary strand
Find_Optimal_Cutoff: Find the optimal probability cutoff point for a classification model related to event rate

`roc.py`

The roc.py script creates the receiver operatur curve (ROC) curve for the dataset. Additionally, it also prints the area under the curve (AUC) for each tool.

`roc_best5tools.py`

This scripts plots the ROC curve for the 5 best tools for the dataset including the AUC.

Splice prediction tools

Alamut 3/4 consensus (consensus of GeneSplicer, MaxEntScan, NNSPLICE and SpliceSiteFinder-like)
CADD
DSSP
GeneSplicer
MaxEntScan
MMSplice
NNSPLICE
Spidex
SpliceAI
SpliceRover
SpliceSiteFinder-like

cmbi/Benchmarking_splice_prediction_tools

Benchmarking splice prediction tools

Analysis scripts

analysis_variants.py

confusion_matrix.py

create_vcffile.py

functions.py

roc.py