On the Relation of Gene Essentiality to Intron Structure: A Computational and Deep Learning Approach
Ethan Schonfeld, Edward Vendrow, Joshua Vendrow, Elan Schonfeld.
This code repository contains code for data processing, figure generation, and model training, as well as the weights of our final models.
Figure 1: Details of convolutional neural network and testing results
a, Our model uses a convolutional architecture to predict intron essentialities. The convolutional layer contains multiple filters that detect motifs within the intronic sequence. Then, the pooling layer averages each filter’s response across the sequence to determine the cumulative presence of motifs. The resulting values are fed into a fully-connected layer followed by a two-value softmax output layer corresponding to the probabilities of the intron being part of an essential or nonessential gene. The best-performing model from our hyperparameter search used 128 convolutional filters with a window size of 24 and a fully connected layer with 128 neurons. We found best results when training with an L2 regularization parameter of 10−6 and a dropout rate of 0.2. We trained two models, one on the first 1000 bp of introns and one on the last 1000 bp. This includes the 5’ splice site in the first 1000 bp, as well as the 3’ splice site and the branch site in the last 1000 bp. In all following results, these models are tested on their respective sections of the intronic sequence. b, Our model, trained on the first 1000 bp of introns, had an AUC of 0.747. Our model, trained on the last 1000 bp of introns, had an AUC of 0.739. We predicted gene essentiality using a majority classifier on all introns of a gene. The majority classifier of the model trained on the first 1000 bp of introns saw an AUC of 0.843, and the majority classifier of the model trained on the last 1000 bp of introns saw an AUC of 0.824. We further improved accuracy by averaging the outputs of both majority classifiers. This combined classification strategy achieved an AUC of 0.857. c, As the first intron is known to have unique properties, we separately tested the models on only first introns, seeing improved accuracy. On first introns, the model trained on the first 1000 bp of introns had an AUC of 0.792 and the model trained on the last 1000 bp of introns had an AUC of 0.791. We further improved first intron essentiality prediction by averaging the outputs of both models to make a dual average prediction, achieving an AUC of 0.835.
Figure 2: Introns of essential genes differ from introns of nonessential genes by size, number, and position
a, The dashed-green line represents the mean and the notches are calculated using a gaussian-based-asymptotic approximation to represent confidence intervals around the medians (orange lines). The first introns for essential (p=0.0001), conditional (p<0.00001), and nonessential (p<0.00001) genes are larger than the later introns; however, essential gene first introns are longer than the later introns to a lesser degree than those of nonessential introns. The nonessential first intron is much longer (mean three times greater) than the essential first intron (p<0.00001). For later introns, nonessential are longer than essential (p<0.00001), but these lengths are closer than the disparity between first intron sizes. Conditional introns typically fall within the middle. b, Essential genes have a greater number of introns than both conditional (p=0.0383) and nonessential (p=0.0003) genes c, However, essential genes have a lesser total length of intronic sequence than both conditional (p<0.00001) and nonessential (p<0.00001) genes.
Figure 3: Introns of essential genes differ from introns of nonessential genes by GC density and lower frequency of unusual 5’ / 3’ splice sites
a, The first introns of essential (p<0.00001), conditional (p<0.00001), and nonessential (p<0.00001) genes have a higher GC density than the later introns. Essential (p=0.003) and conditional (p<0.00001) genes have a higher density of GC regions in their first introns than nonessential first introns. The proportion of GC density of the first intron to later introns for nonessential genes is 1.1, for conditional genes is 1.2, and for essential genes is 1.3. GC density is greater in first introns of essential genes. b, Essential gene introns less frequently have unusual sequences at the 5’ splice site than conditional introns which in turn have less frequent unusual sequences at the 5’ splice site than nonessential introns. The first intron of essential genes is less likely to have an unusual 5’ splice site than conditional or nonessential first introns. Additionally, essential first introns are less likely to have an unusual 5’ splice site than essential later introns. A conditional first intron is less likely to have an unusual 5’ splice site than nonessential first introns, so we see that this effect correlates with essentiality. The first intron of nonessential genes is most likely to have an unusual 5’ splice site. c, The first intron of essential genes is less likely to have an unusual 3’ splice site than conditional genes which in turn are less likely to have an unusual 3’ splice site than first introns of nonessential genes. We see that this effect again correlates with essentiality.