Computational Phenotype Discovery Using Unsupervised Feature Learning over Noisy, Sparse, and Irregular Clinical Data

Question

Computational Phenotype Discovery Using Unsupervised Feature Learning over Noisy, Sparse, and Irregular Clinical Data

agitter opened this issue 8 years ago · 4 comments

http://doi.org/10.1371/journal.pone.0066341

Could be a nice way to complement the papers we have on autoencoders for gene expression (e.g., #6), showing how autoencoders can be used with clinical data.

Inferring precise phenotypic patterns from population-scale clinical data is a core computational task in the development of precision, personalized medicine. The traditional approach uses supervised learning, in which an expert designates which patterns to look for (by specifying the learning task and the class labels), and where to look for them (by specifying the input variables). While appropriate for individual tasks, this approach scales poorly and misses the patterns that we don’t think to look for. Unsupervised feature learning overcomes these limitations by identifying patterns (or features) that collectively form a compact and expressive representation of the source data, with no need for expert input or labeled examples. Its rising popularity is driven by new deep learning methods, which have produced high-profile successes on difficult standardized problems of object recognition in images. Here we introduce its use for phenotype discovery in clinical data. This use is challenging because the largest source of clinical data – Electronic Medical Records – typically contains noisy, sparse, and irregularly timed observations, rendering them poor substrates for deep learning methods. Our approach couples dirty clinical data to deep learning architecture via longitudinal probability densities inferred using Gaussian process regression. From episodic, longitudinal sequences of serum uric acid measurements in 4368 individuals we produced continuous phenotypic features that suggest multiple population subtypes, and that accurately distinguished (0.97 AUC) the uric-acid signatures of gout vs. acute leukemia despite not being optimized for the task. The unsupervised features were as accurate as gold-standard features engineered by an expert with complete knowledge of the domain, the classification task, and the class labels. Our findings demonstrate the potential for achieving computational phenotype discovery at population scale. We expect such data-driven phenotypes to expose unknown disease variants and subtypes and to provide rich targets for genetic association studies.

Answer 1 · 2016-08-15T19:35:39.000Z

Focused on learning patterns in longitudinal data. Uses autoencoders. Should be discussed alongside #25 and #63. I think that it's pretty clear that we need to discuss the application of learning from the EHR since this seems to be a burgeoning area in our domain of interest.

Due to the ease of the problem, the evaluations are a bit limited (Table 3), but performance of autoencoder features is similar to expert-engineered features.

Answer 2 · 2016-08-15T19:38:43.000Z

Also Figure 5 on subtypes in this paper and Figure 7 on #25 both point to these methods for potential subtype discovery. Of course, interpreting anything out of a t-SNE is dangerous, but the survival analysis in #25 also supports potential subtype discovery via these approaches.

Answer 3 · 2016-08-22T17:04:53.000Z

Really nice early paper - focuses on the longitudinal measurements of a single continuous feature. Uses sparse autoencoders (no noise/dropout) and squared error loss.

The key novel contributions of the paper seem to be the preprocessing step using Gaussian process regression to effectively use longitudinal measurements and the idea of applying t-SNE to learned feature representations for subtype discovery.

Answer 4 · 2016-08-26T20:31:53.000Z

I scanned this paper too - I also thought it was a nice, well explained paper. To add to @brettbj 's review:

They trained on 30 day patches of input features (longitudinal uric acid measurements for ~4,300 individuals)
- Another element they contribute are visualizations of learned representations
Two hidden layers
- They visualize (rather nicely) the learned features for both layers