protein-DNA-binding-prediction

Usage of python program

python3 encodingSeq_train.py train.data 10(flanking length)
python3 encodingSeq_test.py test.data 10(flanking length)

Task description

The objective of this task is to correctly predict whether CTCF protein can bind on the given DNA sequence, and I tried to finish the task by using the CNN technique of deep learning. The approach mentioned below was inspired by MNIST hand-written digit classification.

Data description

sample size: 77531
data format example (abbrev.): >chr3:13238050-13238150 CTGGCTGTCA...AGAAGAACAC 1

  • Testing data

samlpe size: 19383
data format example (abbrev.): CAGTTGGCCT...CACAAGTAGA

  • Testing data with label

sample size: 19383 (9709 positive, 9674 negative)
data format example (abbrev.): >chr20:42901189-42901289 CAGTTGGCCT...CACAAGTAGA 1

file name chromosome number loci sequence length label
train.data chr # loci 101 0 negative, 1 positive
test.data N / A N / A 101 N / A
test_ans.data.txt chr # loci 101 0 negative, 1 positive

Preprocessing

I use encodingSeq_train.py and encodingSeq_test.py to convert train.data and test.data to pickle format, turning the sequence data into one-hot encoding form. Besides, the flanking length of the sequence can be determined by users.

Model description

def model_train(data):
    conv = tf.nn.conv2d(data, layer1_weights, [1, stride_1, stride_1, 1], padding='SAME')
    hidden = tf.nn.relu(conv + layer1_biases)
    drop = tf.nn.dropout(hidden, 0.75)
    conv = tf.nn.conv2d(drop, layer2_weights, [1, stride_2, stride_2, 1], padding='SAME')
    hidden = tf.nn.relu(conv + layer2_biases)
    drop = tf.nn.dropout(hidden, 0.75)
    shape = hidden.get_shape().as_list()
    reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
    hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
    drop = tf.nn.dropout(hidden, 0.75)
    hidden = tf.nn.relu(tf.matmul(drop, layer4_weights) + layer4_biases)
    return tf.matmul(hidden, layer5_weights) + layer5_biases
def model(data):
    conv = tf.nn.conv2d(data, layer1_weights, [1, stride_1, stride_1, 1], padding='SAME')
    hidden = tf.nn.relu(conv + layer1_biases)
    conv = tf.nn.conv2d(drop, layer2_weights, [1, stride_2, stride_2, 1], padding='SAME')
    hidden = tf.nn.relu(conv + layer2_biases)
    shape = hidden.get_shape().as_list()
    reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
    hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
    hidden = tf.nn.relu(tf.matmul(drop, layer4_weights) + layer4_biases)
    return tf.matmul(hidden, layer5_weights) + layer5_biases

Training parameters and settings

  • batch size = 290 (train_dataset.shape[0] // 200)
  • training steps = 15000
  • learning rate = 0.25 (starting rate) with exponential decay after 5000 steps, decay rate = 0.96
  • optimizer: GradientDescentOptimize
  • using regularization to eliminate overfitting circumstances

Results

  • Training process
Steps minibatch loss minibatch accuracy validation accuracy
0 1.036431 52.759 % 50.663 %
500 0.683988 62.069 % 60.486 %
1000 0.624465 65.862 % 71.129 %
1500 0.633703 68.276 % 78.105 %
2000 0.470493 78.621 % 84.543 %
2500 0.467052 81.724 % 80.426 %
3000 0.340815 85.517 % 87.535 %
3500 0.343168 87.241 % 88.155 %
4000 0.281665 90.000 % 88.222 %
4500 0.316459 89.310 % 89.223 %
5000 0.306602 88.621 % 89.651 %
5500 0.323350 85.517 % 89.656 %
6000 0.370277 84.828 % 88.278 %
6500 0.326758 86.897 % 89.764 %
7000 0.319943 87.586 % 90.069 %
7500 0.260177 90.345 % 90.058 %
8000 0.330911 87.931 % 89.795 %
8500 0.266430 91.034 % 90.378 %
9000 0.265763 90.000 % 90.528 %
9500 0.285275 91.724 % 90.358 %
10000 0.260739 90.690 % 90.487 %
10500 0.273746 89.655 % 89.919 %
11000 0.281265 91.034 % 90.590 %
11500 0.278727 91.034 % 90.512 %
12000 0.310407 88.621 % 90.595 %
12500 0.282559 87.586 % 90.523 %
13000 0.295398 89.310 % 90.812 %
13500 0.254486 90.000 % 90.636 %
14000 0.266505 89.655 % 90.672 %
14500 0.257149 90.690 % 90.579 %
15000 0.259550 88.966 % 90.559 %


  • Test accuracy: 90.517 %

Future work

The model I used in the task is not complicated at all, and maybe trying models with more layers can get better results. Also, I think RNN might be an another way to build or improve the model.

Reference

MNIST CNN tutorial
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/udacity

DeepBind
http://www.nature.com/nbt/journal/v33/n8/full/nbt.3300.html
http://www.nature.com/nbt/journal/v33/n8/extref/nbt.3300-S2.pdf