Discover regulatory DNA elements using chromatin signatures and artificial neural network.

Question

Discover regulatory DNA elements using chromatin signatures and artificial neural network.

gwaybio opened this issue 8 years ago · 1 comments

https://dx.doi.org/10.1093/bioinformatics/btq248

Answer 1 · 2016-08-15T21:20:01.000Z

Clearly written article predicting the location of enhancers using chromatin signatures. The method (CSI-ANN) does not have great performance predicting enhancers in HeLa cells or CD4+ T cells but it significantly outperformed the state of the art in 2010 (see table 2). It is possible (maybe even likely) that the gold standard for enhancer locations is diluting performance. Several of the computation steps I have not seen in this context before - but I think are clever manipulations of the data that actually seem to make sense.

Biology

Six chromatin marks from ENCODE to predict enhancers in HeLa cells and 39 histone marks to predict enhancers in CD4+ T cells.

Computation

The authors use the chromatin marks to engineer a single feature that is input into a time-delay neural network (TDNN).
The single feature is built using a Fisher discriminant analysis (FDA) applied to histone marks
- Mean and "energy transformed marks"
- Finds the linear combination of features that maximally separates background from enhancer.
- This feature is computed genome wide by a sliding window of 2.5 kb (with a 1.25 kb step size)
TDNN
- One input layer, one hidden layer, one output layer
  - A supervised algorithm with a similar architecture seen in #22
- The way I see it, a TDNN has operations similar to convolutions
  - The "delay" can capture local dependencies and changes among peaks of the engineered variable
Trained with particle swarm optimization
Training and testing on two different cell types with reasonable performance

General comments

Good discussion points about their feature engineering decisions - namely, a non-linear feature extractor may work better (an autoencoder maybe?). I also think lack of gold standards here harm performance reports - something that could be a major problem when applying to supervised learning problems and (although less so) unsupervised tasks