/PyCorn

Prediction of Transcription Start Sites (TSS) on the corn (Zea Maize) genome using a trained neural network.

Primary LanguagePython

PyCorn

Introduction

Our pipeline is an open source tool developed for genome-wide prediction of transcription start sites (TSS) from maize genome data. Using a trained neural network, the pipeline takes as input a sequence and outputs coordinates of possible TSS locations.

The pipeline is composed of two main stages: Training and Testing. During the training phase, the parameters of the neural network are set. We supply as a default, a trained neural network. If the user wishes, we supply instructions on how to train the network. For the testing phase, the user supplies a file that contains genomic data from the Zea Maize genome in FASTA format. The output file will contain coordinates of possible TSS locations.

Installation

First, install scikit-neuralnetwork:

pip install scikit-neuralnetwork

To install the pipeline, simply clone the repository:

git clone https://github.com/adamscarlat/pyCorn.git

Or download it as a zip file.

Training Phase

Data Preparation

To build the neural-network model, we used 75,681 pre-labeled genomic coordinates of predominant TSS taken from the article Mejia-Guerra et al., 2015 as our positive data. We used pybedtools to generate the corresponding sequence. For negative data, we pick from the rest of nucleotides that were not labeled as predominant TSS from the corn genome. Next, to denote the pattern sequence around TSS (or non-TSS in the negative data), we choose a frame that centers at the target nucleotide and expands 500 nucleotides upstream and 499 nucleotides downstream.

The coordinates, from either the original bed file provided by Mejia-Guerra or our choice of negative data, along with the whole corn genomic sequence are fed into pybedtools to generate a sequence in FASTA format. This serves as our main training data.

Sequences that contain more than 10 'N' nucleotides (bad reads) are ignored by PyCorn.

Construction of the Neural Network

Input nodes: 1000 nodes that represent the sequence of input data

Hidden layer: 1 hidden layer with 128 nodes

Output nodes: 2 nodes that specifies whether the central nucleotide is TSS or not

Finding the Best Model

By varying the amount of positive and negative training data, as well as some other parameters, we can find the best neural-network model. You can find more details about those parameters in the Performance Evaluation section. The best model is serialized and stored as a separate file which can be replaced in the future if a better model is found.

Running the Pipeline

You can find the main script pycorn.py in App\src\.

The command obeys the following format:

$ python pycorn.py -i inputFile -o outputFile -w windowSlideSize

inputFile denotes genomic sequence in FASTA format
outputFile denotes the position of TSS and its neighboring nucleotides
windowslidesize denotes the window size when scanning the genome

Example:

$ python pycorn.py  -i myGenome.fa  -o tssLocation -w 100

NOTE: If no parameters are supplied, test data from testInputSmall will be used with a default window size of 100. This test data simulates a small genomic sequence. The output will be saved to testResult.txt.

The parameter windowSlideSize defines the resolution of the search. From our testing, a smaller window size gives a better resolution up to a certain point. A window size that is too small will result in too many false positives, while a window size that is too large will result in too many false negatives.The recommended window slide size is between 50 - 100.

Input File

Accepted input sequence is in FASTA format, which begins with a single-line description starting with a >, followed by lines of sequence data.

An example for a valid input file (chromosome 1 of the Zea Maize):

>1 dna:chromosome chromosome:AGPv3:1:1:301476924:1
GAATTCCAAAGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCTCCTATCATTCTTTCTG
ATGTTGAAAATGATATTAAGCCTAGGATTCGTGAATGGGAGAAGGTATTTTTGTTCATGG
TAGTCATTGGAACCTGCTAGATTGTACACTTGACAATAACATATATTAATATTAGTGACC
CCATTTTTAAATTTCCTAGGCTGGCATTGAACAAGACTATGTTAGTAGGATGTTGTTGAA
...

Output File

You can check the format of output files:

$ cat ../output

You can find the results of PyCorn in this directory.

output
This file contains the predicted position of transcription start sites.
    Column_No.	Description
        1		Coordinate of transcription start sites in genome
        2		Sequence of transcription start site and its neighbors on both sides
For example,
	167 	...ACGTG[C]ACGGT...
	653		...TGCCA[G]CGTGT...
	1355	...GATCG[A]TGCCA...
			......

Performance Evaluation

The performance of the trained model was evaluated by collecting the training error of each epoch. We observed a decrease of 90% in the training error over a span of 40 epochs. The validation rate obtained for the given network was 77%.

To test the result of the pipeline we matched our results against a sequence of 60,000 nucleotides from the Zea Maize genome that does not contain a TSS. We tested four different neural network models, which are displayed below:

Model Sequence Length Negative Sequences Positive Sequences sensitivity specificity
1 400 100,000 75,000 0.809 0.702
2 400 100,000 15,000 0.046 0.994
3 400 100,000 75,000 0.648 0.799
4 800 100,000 75,000 0.44 0.898
5 400 100,000 15,000 0.181 0.967

Sequence Length - length of the training example and testing sequences
Negative Sequences - number of negative sequences in the negative set
Positive Sequences - number of positive sequences in the positive set
sensitivity - true positive rate
specificity - true negative rate