Convolutional neural network architectures for predicting DNA–protein binding
cgreene opened this issue · 10 comments
This is a benchmarking paper focusing on convolutional network architectures for predicting transcription factor binding from sequence. They use the DeepBind (Alipanahi et al. 2015) architecture as a baseline, and proceed to vary some aspects of this architecture: the number of layers, the number of convolution kernels/filters, and the type of pooling.
One nice thing about this paper is that they provide a Docker image that they say is runnable on any GPU machine (although I did not try this). Everything is implemented in Caffe, which I have had some trouble installing on my Macs, so a Docker image is probably a good idea.
They actually look at two different classification tasks (the quoted parts are verbatim from the paper):
- the motif discovery task which "classifies sequences that are bound by a transcription factor from negative sequences that are dinucleotide shuffles of the positively bound sequences" and
- the motif occupancy task which "discriminates genomic motif instances that are bound
by a transcription factor (positive set) from motif instances that are not bound by the same transcription factor (negative set) in the same cell type"
Apart from the architectural parameters mentioned above, they seem to do parameter searches separately for each motif to try to optimize dropout rate, momentum, and the "delta" parameter in the AdaDelta optimizer. There are up to 690 classification tasks.
Some questions I had after reading this paper:
- Does it always have to be a binary classification task for each of the 690 motifs? Wouldn't it be quite natural to do multi-class classification? Too difficult / too little training data?
- It is shown in this paper as well as others that sequences are represented by "one-hot" matrices which are fed to convolutional layers. For images, the (2D) convolutional layers receive inputs of the form [image height, image width, # color channels]. For sequence classification it appears that the standard way is to also use 2D convolutions where height is set to 1 and the nucleotides are the "color channels". Is it self-evident that this is better than using a single color channel and treating the one-hot input matrix as a 2D "image"?
- The sampling of the parameter space is still quite sparse here, for understandable reasons (it takes time to run these models!).
The biggest and in my view fatal flaw in this and many other related papers is that they artificially balance their positives and negatives and report auROCs. auROCs are totally misleading in the context of heavily unbalanced classification tasks and the wrong performance measure to optimize in situations where the negatives far outnumber the positives. Models with superior auROCs can be terribly inferior in precision and recall in these types of unbalanced settings. These models look amazing with great auROCs. But they will be highly suboptimal in unbalanced settings. As a community, we need to stop reporting performance on artificially balanced datasets. Everyone knows this but everyone continues to do this. Not sure why. This is of course nothing to do with deep learning specifically. The same flaw is present in countless other papers performing prediction tasks across the genome (almost all of which are heavily unbalanced).
Absolutely agree about the auROCs. Should at least be complemented with precision/recall curves. Although you say everyone knows this, I have come across people who believe that ROC curves somehow account for class imbalance.
As you have worked on this type of problem (I think), do you have any comments on my second question above, related to "color channels" and 1D vs 2D convolutions? Also, is there any potential in using some sort of recurrent neural networks (LSTMs and the like) for learning motifs and regulatory regions? Cheers!
It doesn't make much sense to have 2D convolutions on a 1-hot encoding IMHO since the channels (nucleotides) are mutually exclusive and your filters would want to know the identify of every nucleotide at every position. Also, yes dumping an LSTM/RNN on top of a conv net or a pure LSTM/RNN can help. It doesn't seem to help a lot for most TF binding tasks but it seems to help when learning more complex grammars. See this paper for example that modifies the DeepSEA model by dumping an RNN on top of it http://nar.oxfordjournals.org/content/44/11/e107.
Thanks for the comments! I had seen DanQ but forgotten that they used an RNN in it.
@akundaje : Also possible that reviewers ask for auROCs. I had a paper where reviewers would not let up on them, despite our argument that what mattered was precision over the first 150 predictions, which was what we could afford to experimentally validate. We ended up including it somewhere, I think potentially in the supplement.
Since then, I've come around a little bit on them. They're not a great measure in many cases, but as a quick and dirty diagnostic - particularly if one is comparing to a previous algorithm where they are all that is available - then they can still be helpful. It would have been nice to see precision-based metrics included for a systematic comparison, but I don't want someone to read this and say: "oh man - I should never use auROCs!"
One solution might be to cite something like http://doi.org/10.1371/journal.pone.0118432 that discusses the perils of auROC on imbalanced data but not get into it deeply in this review. http://doi.org/10.1145/1143844.1143874 also has a simple example (Figure 7) with imbalanced data.
I don't want someone to read this and say: "oh man - I should never use auROCs!"
Not never, but there are a class of problems for which auROC should be discouraged. This point is not made often enough.
Tweets on this paper at ISMB this year, mainly by me.
https://twitter.com/search?f=tweets&q=HZ%20%23ISMB16%20since%3A2016-07-11%20until%3A2016-07-12
A serious flaw in this analysis is that they do not examine how their method performs on different classes of TFs.
I agree. I didn't mean they should never be used at all. But let me just
put it this way. A ton of high profile papers with auROCs of > 0.8 get
essentially close to 0 recall at 25 and 50% FDR. Those models are
essentially useless in real world scenarios.
On Aug 9, 2016 6:01 AM, "Casey Greene" notifications@github.com wrote:
@akundaje https://github.com/akundaje : Also possible that reviewers
ask for auROCs. I had a paper where reviewers would not let up on them,
despite our argument that what mattered was precision over the first 150
predictions, which was what we could afford to experimentally validate. We
ended up including it somewhere, I think potentially in the supplement.Since then, I've come around a little bit on them. They're not a great
measure in many cases, but as a quick and dirty diagnostic - particularly
if one is comparing to a previous algorithm where they are all that is
available - then they can still be helpful. It would have been nice to see
precision-based metrics included for a systematic comparison, but I don't
want someone to read this and say: "oh man - I should never use auROCs!"—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#43 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAI7EdMu4Rpk6yYnnMCVK-eFkkcFL_BCks5qeHo2gaJpZM4Jdw81
.