uci-cbcl/DanQ

How to convert the convolution kernels to motifs?

Closed this issue · 19 comments

Hello! Can you explain how to convert the convolution kernels to motifs in detail ? Although I have read the paper about DeepBind model , I cannot understand even find the responding contents.Thanks!

I'm trying to do the same thing but I can't find any resources anywhere that describe it. I asked a question on reddit with no luck.

I also e-mailed the authors of DeepBind and DeepSEA but no one is responding :( It's very esoteric in the methods and hard to conceptualize.

For example,

Are the position weight matrices generated from the convolution or the max-pooling layers? The convolutional layer weights are of shape (n_kernels, 4 nucleotides, n_filters) and it looks like each 2D slice of (4 nucleotides, n_filters) may be the position weight matrix but I am not sure. I understand how convolutional and pooling layers work but there are not many resources that describe how to interpret the weights.

The weights are negative and positive value so I don't know if it would make sense to scale them, absolute value them, or how to interpret the axis.

Does the ordering of the dimensions have any relevance to the one-hot encoded DNA in the input?

Hi! You'll find that I'm a lot more responsive than the authors of DeepBind and DeepSEA :)

Converting convolutional filters directly to PWMs is very difficult. Instead, we use an empirical approach. You need a large collection of sequences (e.g. 10,000). Out of personal preference, I use sequences that are positive for NRSF binding from any cell (this is completely arbitrary, you can pick whatever sequences you want).

To simplify things, let's say I have one convolutional filter of length 26 bps I want to convert to a PWM, and 1 sample DeepSEA sequence of length 1,000 bps. I would scan the sample sequences with the convolutional filter at each possible position. Since my filter is 26 bps, it can scan a 1,000 bp sequence at 1000 - 26 + 1 = 975 locations. After collecting 975 activation values, I would select the maximum value. If that maximum value is above 0, I would save the 26 bp subsequence corresponding to that activation (this can be achieved with the argmax function). Repeat this procedure for the rest of your sample 1,000 bp sequences, and align all the valid 26 bp subsequences to form a position frequency matrix, which can then be converted to a position weight matrix fairly straightforward. Some code example of this can be found in my repository for the FactorNet project: https://github.com/uci-cbcl/FactorNet/blob/master/visualize.py

For your convenience, I've attached code to do this with DeepSEA data (with the newest version of Keras). Hopefully you can understand it without comments, but if you need help understanding it, I'll be happy to answer:

from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv1D, MaxPooling1D, LSTM
from keras import backend as K

import numpy as np
import h5py
import scipy.io
np.random.seed(420) # for reproducibility

print('Loading data')
trainmat = h5py.File('data/train.mat')
validmat = scipy.io.loadmat('data/valid.mat')
testmat = scipy.io.loadmat('data/test.mat')

X_train = np.transpose(np.array(trainmat['trainxdata']),axes=(2,0,1))
y_train = np.array(trainmat['traindata']).T

trainmat.close()

total_pos_train = y_train.sum()
print('Sparsity: %0.4f' % (1 - 1.0 * total_pos_train / np.prod(y_train.shape)))

X_valid = np.transpose(validmat['validxdata'],axes=(0,2,1))
y_valid = validmat['validdata']

X_test = np.transpose(testmat['testxdata'],axes=(0,2,1))
y_test = testmat['testdata']

model = Sequential()

conv_layer = Conv1D(filters=320,
                    kernel_size=26,
                    strides=1,
                    padding='valid',
                    activation='relu',
                    input_shape=(1000,4))

model.add(conv_layer)
model.add(MaxPooling1D(pool_size=13,
                       strides=13))

model.add(LSTM(320, return_sequences=True))

model.add(Flatten())
model.add(Dense(925,
                activation='relu'))
model.add(Dense(919,
                activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

model.fit(X_train, y_train,
          batch_size=100,
          epochs=2,
          validation_data=(X_valid, y_valid),
          shuffle=True)

t_results = model.evaluate(X_test, y_test,
                           batch_size=100,
                           verbose=1)
print(t_results)

conv_output = conv_layer.get_output_at(0)
f = K.function([model.input], [K.argmax(conv_output, axis=1), K.max(conv_output, axis=1)])

motifs = np.zeros((320, 26, 4))
nsites = np.zeros(320)

y_test_NRSF = y_test[:, [184, 223, 269, 292, 313, 314, 351, 373, 377, 382, 383, 399]].sum(axis=1) > 0
X_test_NRSF = X_test[y_test_NRSF]

for i in range(0, len(X_test_NRSF), 100):
    x = X_test_NRSF[i:i+100]
    z = f([x])
    max_inds = z[0] # N x M matrix, where M is the number of motifs
    max_acts = z[1]
    for m in range(320):
        for n in range(len(x)):
            # Forward strand
            if max_acts[n, m] > 0:
                nsites[m] += 1
                motifs[m] += x[n, max_inds[n, m]:max_inds[n, m] + 26, :]

print('Making motifs')

motifs = motifs[:, :, [0, 2, 1, 3]]

motifs_file = open('motifs.txt', 'w')
motifs_file.write('MEME version 4.9.0\n\n'
                  'ALPHABET= ACGT\n\n'
                  'strands: + -\n\n'
                  'Background letter frequencies (from uniform background):\n'
                  'A 0.25000 C 0.25000 G 0.25000 T 0.25000\n\n')

for m in range(320):
    if nsites[m] == 0:
        continue
    motifs_file.write('MOTIF M%i O%i\n' % (m, m))
    motifs_file.write("letter-probability matrix: alength= 4 w= %i nsites= %i E= 1337.0e-6\n" % (26, nsites[m]))
    for j in range(26):
        motifs_file.write("%f %f %f %f\n" % tuple(1.0 * motifs[m, j, 0:4] / np.sum(motifs[m, j, 0:4])))
    motifs_file.write('\n')

motifs_file.close()

And please, cite DanQ and FactorNet if you can. Thanks.

First off, thank you so much for responding to this and doing this research. I just recently started learning about Deep Neural Networks and asked on reddit for some recommendations of use cases in biological frameworks. Your paper seemed the most interesting and it was using Keras which seemed perfect.

I want to try out some of this code with a smaller dataset (DeepSEA is massive) but I am a little confused on a couple things:

(1) What is the shape and dimensions of Y? It looks like X is (number of sequences, 1000 positions, 4 nts) but I'm not sure what Y is. Is it one-hot encoded [OHE] for 919 categories? Or is it a multilabel vector whose dimensions are 919 where there can be multiple 1s (instead of OHE).

(2) In recurrent neural networks when return_sequences = True does it output a value for each LSTM cell (320 in this case) and if it's False does it generate a single value for the entire 1000 nt sequence? (c | one to many) and (d | many to one), respectively from this snippet of Deep Learning with Keras

(3) Is the 925 neurons in model.add(Dense(925, activation='relu')) significant (i.e. p * q from some other step)?

(4)

"After collecting 975 activation values, I would select the maximum value. If that maximum value is above 0, I would save the 26 bp subsequence corresponding to that activation (this can be achieved with the argmax function)."

Does this mean that for every kernel (320 of them of shape (4, 26)) you get argmax and max of the the weight matrix OR the actual output values that go from the Conv1D to the MaxPooling1D layer?

I have a couple deas of how I can apply this for current research and will definitely cite DanQ and FactorNet (at the very least) if publications or posters come from this. Right now I am in the phase of learning how these networks work! Your work is really interesting and a very enlightening usage case since all I see are image/speech recognition tutorials for conv nets on the interwebs.

  1. Y has the shape (number of sequences, 919). The 919 corresponds to binary values indicating the presence/absence of a chromatin mark (e.g. does CTCF bind here in the GM12878 cell line?). Hence, each row is a binary vector of length 919. This is an example of multi-task learning, where each training sample can have multiple unrelated, non-mutually exclusive labels.

2)You're almost correct. If return_sequences = True, then it would output a value for each LSTM cell at each scanning position. If return_sequences = False, then it would output a value for each LSTM cell at only the LAST scanning position. You can play around with this pretty easily in keras. Try different combinations and use the model.summary() command to see what you get.

3)No, I just picked the number 925 because it looked nice. The DeepSEA authors used 925 as well, and I thought it would look nice if I used the same number.

  1. The latter. You know those values that the MaxPooling1D layer is applied to? I'm using those values. You can see it most clearly in these lines of code:
conv_output = conv_layer.get_output_at(0)
f = K.function([model.input], [K.argmax(conv_output, axis=1), K.max(conv_output, axis=1)])

This is making a lot of sense now. I'm going to have to brush up on exactly how LSTMs work but this answers most of my questions about implementing the actual network. Thanks again this is very helpful.

@daquang, I was taking a look at the code above and trying to decipher the motif creation portion. Would you mind describing what is happening in :

motifs = np.zeros((320, 26, 4))
nsites = np.zeros(320)

y_test_NRSF = y_test[:, [184, 223, 269, 292, 313, 314, 351, 373, 377, 382, 383, 399]].sum(axis=1) > 0
X_test_NRSF = X_test[y_test_NRSF]

for i in range(0, len(X_test_NRSF), 100):
    x = X_test_NRSF[i:i+100]
    z = f([x])
    max_inds = z[0] # N x M matrix, where M is the number of motifs
    max_acts = z[1]
    for m in range(320):
        for n in range(len(x)):
            # Forward strand
            if max_acts[n, m] > 0:
                nsites[m] += 1
                motifs[m] += x[n, max_inds[n, m]:max_inds[n, m] + 26, :]

I'm not sure what the [184, 223, 269, 292, 313, 314, 351, 373, 377, 382, 383, 399] indicies represent. Can you explain what these values are and why they are used to index the test set?

The (320, 26, 4) shape of motifs seems to represent the filters, sequence window, and nucleotide channel which gets populated by f = K.function([model.input], [K.argmax(conv_output, axis=1), K.max(conv_output, axis=1)]). Is this correct?

Can you explain why you are iterating through range(0, len(X_test_NRSF), 100)?

What do you mean by z[0] (where M is the number of motifs)? Is this the number of sequence windows that contain a certain motif?

What exactly is X_test_NRSF[i:i+100] and why is length 100?

Thanks! I think this should be useful for @believezl as well.

The indices [184, 223, 269, 292, 313, 314, 351, 373, 377, 382, 383, 399] correspond to NRSF labels for different cell lines. As I said before, you need to collect some example sequences, and I arbitrarily chose any sequence that overlaps an NRSF binding site. I could have easily just used the entire X_test data tensor, but then this script would take a lot longer to complete. This is just my way of reducing the number of example sequences to process. If you wanted to, you can just take the first 10,000 sequences for example: X_test[0:10000].

You are correct about the shape of motifs.

Can you explain why you are iterating through range(0, len(X_test_NRSF), 100)? I need to iterate through the example sequences in minibatches. The f function applies the convolution operation on a minibatch of data, finds the max activation, and the location of the max activation. I highly suggest you play around with this (probably in the ipython terminal) on you own to understand this better.

What do you mean by z[0]? z is a tuple output from the functional f. z is size 2. The first item contains a matrix of activations from a minibatch input. Suppose x_test is a minibatch of sequences and has the size (100, 1000, 4) (again, 100 is the minibatch size and is completely arbitrary). Then z = f(x_test). z[0] would be a matrix of size (100, 320), and contain integers telling you the location of the maximum activation for each minibatch sample sequence for each filter. For example, z[0][50,12] would tell you the location on sequence 50 where filter 12 achieves its maximum activation. z[1] has the same shape as z[0], but tells you the actual max activation value.

This is some pretty advanced usage of keras, so I highly suggest you check out the keras website for more info (https://keras.io/getting-started/faq/#how-can-i-obtain-the-output-of-an-intermediate-layer).

What exactly is X_test_NRSF[i:i+100] and why is length 100? 100 is the batch size. I need to process the example sequences in minibatches. 100 is a completely arbitrary number.

Thank you. The motif code makes sense now and I extract get motifs from the activations. However, I had a question about the return_sequences=True (Please let me know if this warrants it's own GitHub issue).

The shapes do not seem to match in the last layer. It appears that when return_sequences = True, each "sequence" of the biLSTM is being categorized.
image

For a simplified version of this, I set return_sequences=False and the i/o is set for classifying each sequence.

NUMBER_OF_POSITIONS = 500

# Build model
# ===========
# Input Layer
input_layer = Input(shape=(NUMBER_OF_POSITIONS,4))
# Hidden Layers
y = Conv1D(320, 26, strides=1, activation="relu", kernel_initializer="glorot_normal")(input_layer)
y = MaxPooling1D(pool_size=13, strides=13)(y)
y = Dropout(0.2)(y)
y = Bidirectional(LSTM(320, return_sequences = False, activation="tanh", kernel_initializer="glorot_normal"))(y)
y = Dropout(0.5)(y)
y = Flatten()(y)
y = Dense(128, activation='relu', kernel_initializer="glorot_normal")(y)
y = Dropout(0.5)(y)
y = Dense(1)(y)

# Output layer
output_layer = Activation("sigmoid")(y)

model = Model(input_layer, output_layer)
model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.001), loss="categorical_crossentropy", metrics=["accuracy"])
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_38 (InputLayer)        (None, 500, 4)            0         
_________________________________________________________________
conv1d_40 (Conv1D)           (None, 475, 320)          33600     
_________________________________________________________________
max_pooling1d_40 (MaxPooling (None, 36, 320)           0         
_________________________________________________________________
dropout_98 (Dropout)         (None, 36, 320)           0         
_________________________________________________________________
bidirectional_30 (Bidirectio (None, 640)               1640960   
_________________________________________________________________
dropout_99 (Dropout)         (None, 640)               0         
_________________________________________________________________
flatten_37 (Flatten)         (None, 640)               0         
_________________________________________________________________
dense_76 (Dense)             (None, 128)               82048     
_________________________________________________________________
dropout_100 (Dropout)        (None, 128)               0         
_________________________________________________________________
dense_77 (Dense)             (None, 1)                 129       
_________________________________________________________________
activation_9 (Activation)    (None, 1)                 0         
=================================================================
Total params: 1,756,737
Trainable params: 1,756,737
Non-trainable params: 0

If return_sequences = True as in the original, I have to remove the last Flatten() layer to successfully compile and I get the following summary:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_42 (InputLayer)        (None, 500, 4)            0         
_________________________________________________________________
conv1d_44 (Conv1D)           (None, 475, 320)          33600     
_________________________________________________________________
max_pooling1d_44 (MaxPooling (None, 36, 320)           0         
_________________________________________________________________
dropout_109 (Dropout)        (None, 36, 320)           0         
_________________________________________________________________
bidirectional_34 (Bidirectio (None, None, 640)         1640960   
_________________________________________________________________
dropout_110 (Dropout)        (None, None, 640)         0         
_________________________________________________________________
dense_83 (Dense)             (None, None, 128)         82048     
_________________________________________________________________
dropout_111 (Dropout)        (None, None, 128)         0         
_________________________________________________________________
dense_84 (Dense)             (None, None, 1)           129       
_________________________________________________________________
activation_12 (Activation)   (None, None, 1)           0         
=================================================================
Total params: 1,756,737
Trainable params: 1,756,737
Non-trainable params: 0
_________________________________________________________________

This appears to want an extra dimension. In the stackoverflow discussion it appears that return_sequences = True classifies each step.

Classifying.

If you're going to classify the entire sequence as a whole, your last LSTM must use return_sequences=False. (Or you may use some flatten + dense instead after)

If you're going to classify each step of the sequence, all your LSTMs should have return_sequences=True. You should not flatten the data after them.

Am I missing a key step of the logic? I thought each sequence was being classified; though, It seems that each MaxPooled filter is being classified when return_sequences=True but the shapes do not match. When you get a chance, can you print your model.summary()? Apologies for all of the questions.

I'm not quite understanding your issue/question here. Are you getting the above error when you tried the following code:

model = Sequential()

conv_layer = Conv1D(filters=320,
                    kernel_size=26,
                    strides=1,
                    padding='valid',
                    activation='relu',
                    input_shape=(1000,4))

model.add(conv_layer)
model.add(MaxPooling1D(pool_size=13,
                       strides=13))

model.add(LSTM(320, return_sequences=True))

model.add(Flatten())
model.add(Dense(925,
                activation='relu'))
model.add(Dense(919,
                activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

When I run model.summary() I get the follow:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv1d_1 (Conv1D)            (None, 975, 320)          33600     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 75, 320)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 75, 320)           820480    
_________________________________________________________________
flatten_1 (Flatten)          (None, 24000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 925)               22200925  
_________________________________________________________________
dense_2 (Dense)              (None, 919)               850994    
=================================================================
Total params: 23,905,999
Trainable params: 23,905,999
Non-trainable params: 0
_________________________________________________________________

The code isn't returning any errors for me. What versions of keras and tensorflow are you using?

That's really interesting and it must be a version issue. I'm using the keras in tensorflow

import tensorflow as tf
tf.keras.__version__
# '2.0.8-tf'
tf.__version__
# '1.4.0'

The summary output above is really helpful and I can probably tweak some parameters to mimic that. When I copy and paste the code exactly as you have above, I get the error below so it must be a version issue.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-389-956501e5fb90> in <module>()
     16 model.add(Flatten())
     17 model.add(Dense(925,
---> 18                 activation='relu'))
     19 model.add(Dense(919,
     20                 activation='sigmoid'))

~/anaconda/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/models.py in add(self, layer)
    499           output_tensors=self.outputs)
    500     else:
--> 501       output_tensor = layer(self.outputs[0])
    502       if isinstance(output_tensor, list):
    503         raise TypeError('All layers in a Sequential model '

~/anaconda/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/engine/topology.py in __call__(self, inputs, **kwargs)
    250     """
    251     # Actually call the layer (optionally building it).
--> 252     output = super(Layer, self).__call__(inputs, **kwargs)
    253 
    254     # Update learning phase info.

~/anaconda/lib/python3.6/site-packages/tensorflow/python/layers/base.py in __call__(self, inputs, *args, **kwargs)
    557           input_shapes = [x.get_shape() for x in input_list]
    558           if len(input_shapes) == 1:
--> 559             self.build(input_shapes[0])
    560           else:
    561             self.build(input_shapes)

~/anaconda/lib/python3.6/site-packages/tensorflow/python/layers/core.py in build(self, input_shape)
    125     input_shape = tensor_shape.TensorShape(input_shape)
    126     if input_shape[-1].value is None:
--> 127       raise ValueError('The last dimension of the inputs to `Dense` '
    128                        'should be defined. Found `None`.')
    129     self.input_spec = base.InputSpec(min_ndim=2,

ValueError: The last dimension of the inputs to `Dense` should be defined. Found `None`.

If I set return_sequences = False, I get:

Layer (type)                 Output Shape              Param #   
=================================================================
conv1d_47 (Conv1D)           (None, 975, 320)          33600     
_________________________________________________________________
max_pooling1d_47 (MaxPooling (None, 75, 320)           0         
_________________________________________________________________
lstm_31 (LSTM)               (None, 320)               820480    
_________________________________________________________________
flatten_42 (Flatten)         (None, 320)               0         
_________________________________________________________________
dense_88 (Dense)             (None, 925)               296925    
_________________________________________________________________
dense_89 (Dense)             (None, 919)               850994    
=================================================================
Total params: 2,001,999
Trainable params: 2,001,999
Non-trainable params: 0
____________________________

If I keep return_sequences = True and remove Flatten() after the LSTM I get:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv1d_49 (Conv1D)           (None, 975, 320)          33600     
_________________________________________________________________
max_pooling1d_49 (MaxPooling (None, 75, 320)           0         
_________________________________________________________________
lstm_33 (LSTM)               (None, None, 320)         820480    
_________________________________________________________________
dense_92 (Dense)             (None, None, 925)         296925    
_________________________________________________________________
dense_93 (Dense)             (None, None, 919)         850994    
=================================================================
Total params: 2,001,999
Trainable params: 2,001,999
Non-trainable params: 0
_________________________________________________________________

It looks like there is a fundamental change in the LSTM layer between our versions. What version are you using?

I am using 2.1.12

Yea that was it. I'll just use keras instead of the tensorflow.keras.

Thanks a lot! Another question is that my all prediction probability value is quiet lower than 0.3 . So how to compute the accuracy and is it by defining the threshold?
Thanks again for your help!

Can you give more details on your question? It's difficult to understand what you're asking. Also, if you could put up code, summaries, and anything else it would be easier to understand what you're asking.

@jolespin After running DanQ_test.py, we can get a n*919 array which shows the probability respectively of the 919 classes, for example, [0.01 0.002 0.003 0.03 ... 0.03 0.02 0.004]. My question is how I can determine which probability can be equal to the lable 1 and which equal to lable 0.Then I can compute the accuracy.

We usually use the default threshold of 0.5

Dear @daquang,

I've created a toy example to get known this type of deep learning.

There is a positive set of sequence with A k-mers and a negative without them.

neg set:
1 ATCGATCGATCGATCGATCGATACGATCCACCACACCACCACACACAC
2 ATCGACTCGATCCACAACCACACACCACCTACACCACCACCCGACCGT
3 CCCCTACGTCACGATCAGCTGACTACGATCAGCATCGCATCGTACGGG
4 ACAACCACACATCGATCGCTACGTAGCATCGATGCTCGTACAGTAGCT
....

pos set:

1 CTAGTACAGTCGTCTATCGATCATAAAAAAAAAAAACATCGTAGCTAG
2 TCTTATCATCGATCACTATAAAAAAAAAAAATCGACACACACAGCACA
3 ACTGATCGACTCACCACACAAAAAAATGATCAGACACTTTTTTTTTCA
4 TATCGACCCCCCCCGTATCACTAAAAACTCGTACTGATCACCACCCCC
...

I've attached the test data and my script which were based on your tutorial above.

Can such a model recognize this pattern? Do you have any idea what mistake I've made?

Thanks for your help!

toy.zip

@daquang Thank you very much for providing the code description and patient explanation.But I still have some questions:

  1. What‘s the physical meaning of nsites when max_acts[n, m] > 0 in the code?What if max_acts[n, m] < 0?
  2. The 1000bp input can get the output of 975 through the 26bp convolution operation. Each output represents the characteristics of a 26bp sequence. Why can we use the position and data of the maximum to represent the PWM, but the data of the convolution kernel is more like PWM?Is there any mathematical theory support for this approach? What information can I learn to understand that the output of the convolution kernel can be obtained by the maximum value of the PWM?
  1. nsites is the number of sequences that contain a convolution activation >0. If the kernel is not activated in any sequences, it would have 0 nsites.

  2. This is simply a heuristic. It is generally thought that a transcription factor binds to a single site within a stretch of DNA. Thus, it makes sense to only include one subsequence for the PWM calculation.