/Benchmarking_splice_prediction_tools

Analysis scripts used for the manuscript Benchmarking deep learning splice prediction tools using functional splice assays

Primary LanguagePythonMIT LicenseMIT

Benchmarking splice prediction tools

This repository contains the scripts used for the analysis of the manuscript "Benchmarking deep learning splice prediction tools using functional splice assays".

All variants and splice prediction scores used for the analysis are included in data/variant_scores.xlsx. The datasets are:

  • ABCA4 NCSS (non-canonical splice site) variants
  • ABCA4 DI (deep intronic) variants
  • MYBPC3 NCSS variants

Analysis scripts

analysis_variants.py

This script provides information about the variants in the dataset:

  • number of variants:
    • number of non splice altering variants
    • number of splice altering variants
  • number of variants located close to the splice donor site (SDS)
    • number of splice altering variants at the SDS
    • number of non splice altering variant at the SDS
  • number of variants located close to the acceptor donor site (SAS)
    • number of splice altering variants at the SAS
    • number of non splice altering variant at the SAS
  • Number of splice altering and non splice altering variants at each position in the non-canonical splice site

confusion_matrix.py

This scripts calculates, for each splice prediction tool used in the analysis, the optimal threshold and corresponding confusion matrix. The format of the confusion matrix in python is:

TN FP
FN TP

create_vcffile.py

This script is used to convert the variants into vcf format. This is required for CADD, MMSplice and SpliceAI. The script makes use of the pyhgvs package nad it required a reference genome file and RefSeq transcripts.

functions.py

This file containes functions that are used in other scripts. These include:

  • delta_score: A function to calculate the delta score
  • read_scores_from_excel: A function to read the variants and scores from an excel sheet
  • reverse_sequence: Convertes a sequence int the sequence of the complementary strand
  • Find_Optimal_Cutoff: Find the optimal probability cutoff point for a classification model related to event rate

roc.py

The roc.py script creates the receiver operatur curve (ROC) curve for the dataset. Additionally, it also prints the area under the curve (AUC) for each tool.

roc_best5tools.py

This scripts plots the ROC curve for the 5 best tools for the dataset including the AUC.

Splice prediction tools