
Usage of python program

python3 10(flanking length)
python3 10(flanking length)

Task description

The objective of this task is to correctly predict whether CTCF protein can bind on the given DNA sequence, and I tried to finish the task by using the CNN technique of deep learning. The approach mentioned below was inspired by MNIST hand-written digit classification.

Data description

sample size: 77531
data format example (abbrev.): >chr3:13238050-13238150 CTGGCTGTCA...AGAAGAACAC 1

  • Testing data

samlpe size: 19383
data format example (abbrev.): CAGTTGGCCT...CACAAGTAGA

  • Testing data with label

sample size: 19383 (9709 positive, 9674 negative)
data format example (abbrev.): >chr20:42901189-42901289 CAGTTGGCCT...CACAAGTAGA 1

file name chromosome number loci sequence length label chr # loci 101 0 negative, 1 positive N / A N / A 101 N / A chr # loci 101 0 negative, 1 positive


I use and to convert and to pickle format, turning the sequence data into one-hot encoding form. Besides, the flanking length of the sequence can be determined by users.

Model description

def model_train(data):
    conv = tf.nn.conv2d(data, layer1_weights, [1, stride_1, stride_1, 1], padding='SAME')
    hidden = tf.nn.relu(conv + layer1_biases)
    drop = tf.nn.dropout(hidden, 0.75)
    conv = tf.nn.conv2d(drop, layer2_weights, [1, stride_2, stride_2, 1], padding='SAME')
    hidden = tf.nn.relu(conv + layer2_biases)
    drop = tf.nn.dropout(hidden, 0.75)
    shape = hidden.get_shape().as_list()
    reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
    hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
    drop = tf.nn.dropout(hidden, 0.75)
    hidden = tf.nn.relu(tf.matmul(drop, layer4_weights) + layer4_biases)
    return tf.matmul(hidden, layer5_weights) + layer5_biases
def model(data):
    conv = tf.nn.conv2d(data, layer1_weights, [1, stride_1, stride_1, 1], padding='SAME')
    hidden = tf.nn.relu(conv + layer1_biases)
    conv = tf.nn.conv2d(drop, layer2_weights, [1, stride_2, stride_2, 1], padding='SAME')
    hidden = tf.nn.relu(conv + layer2_biases)
    shape = hidden.get_shape().as_list()
    reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
    hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
    hidden = tf.nn.relu(tf.matmul(drop, layer4_weights) + layer4_biases)
    return tf.matmul(hidden, layer5_weights) + layer5_biases

Training parameters and settings

  • batch size = 290 (train_dataset.shape[0] // 200)
  • training steps = 15000
  • learning rate = 0.25 (starting rate) with exponential decay after 5000 steps, decay rate = 0.96
  • optimizer: GradientDescentOptimize
  • using regularization to eliminate overfitting circumstances


  • Training process
Steps minibatch loss minibatch accuracy validation accuracy
0 1.036431 52.759 % 50.663 %
500 0.683988 62.069 % 60.486 %
1000 0.624465 65.862 % 71.129 %
1500 0.633703 68.276 % 78.105 %
2000 0.470493 78.621 % 84.543 %
2500 0.467052 81.724 % 80.426 %
3000 0.340815 85.517 % 87.535 %
3500 0.343168 87.241 % 88.155 %
4000 0.281665 90.000 % 88.222 %
4500 0.316459 89.310 % 89.223 %
5000 0.306602 88.621 % 89.651 %
5500 0.323350 85.517 % 89.656 %
6000 0.370277 84.828 % 88.278 %
6500 0.326758 86.897 % 89.764 %
7000 0.319943 87.586 % 90.069 %
7500 0.260177 90.345 % 90.058 %
8000 0.330911 87.931 % 89.795 %
8500 0.266430 91.034 % 90.378 %
9000 0.265763 90.000 % 90.528 %
9500 0.285275 91.724 % 90.358 %
10000 0.260739 90.690 % 90.487 %
10500 0.273746 89.655 % 89.919 %
11000 0.281265 91.034 % 90.590 %
11500 0.278727 91.034 % 90.512 %
12000 0.310407 88.621 % 90.595 %
12500 0.282559 87.586 % 90.523 %
13000 0.295398 89.310 % 90.812 %
13500 0.254486 90.000 % 90.636 %
14000 0.266505 89.655 % 90.672 %
14500 0.257149 90.690 % 90.579 %
15000 0.259550 88.966 % 90.559 %

  • Test accuracy: 90.517 %

Future work

The model I used in the task is not complicated at all, and maybe trying models with more layers can get better results. Also, I think RNN might be an another way to build or improve the model.


MNIST CNN tutorial
