This is a pytorch implementation of spliceAI (paper). This repository provide the training dataset and the spliceAI model in pytorch. Please go through the following steps to reproduce the result from the spliceAI paper.
python=3.8.6
numpy
torch
wandb
ref_genome
should point to the hg19 genome.fa file
# Get the sequence bewtween (TSS - 5000, TES + 5000) for each gene.
bash data/grab_sequence.sh
# Create a h5 file with the following keys:
# NAME # Gene symbol
# PARALOG # 0 if no paralogs exist, 1 otherwise
# CHROM # Chromosome number
# STRAND # Strand in which the gene lies (+ or -)
# TX_START # Position where transcription starts
# TX_END # Position where transcription ends
# JN_START # Positions where canonical exons end
# JN_END # Positions where canonical exons start
# SEQ # Nucleotide sequence
python data/create_datafile.py train all
python data/create_datafile.py test 0
python data/create_dataset.py train all 1 pytorch
python data/create_dataset.py test 0 1 pytorch
python bin/train.py
The model achieves 0.95 top-k accuracy and 0.98 AUPRC after 30k steps.