A machine learning algorithm for identifying atopic dermatitis in adults from electronic health records

The current work aims to identify patients for inclusion in genome-wide association studies (GWAS). The scripts in this repository implement a machine learning-based phenotype algorithm for identifying patients with atopic dermatitis (AD). The main algorithms of interest that are currently under development can be found in models.R. This script includes the infrastructure for experimenting with lasso logistic regression, adaptive lasso logistic regression, random forest, and support vector machines.

The features used as input to the algorithm were derived from the electronic health record (EHR) stored in the Northwestern Medicine Enterprise Data Warehouse. We include information codified in the EHR, such as diagnosis codes, prescribed medications, laboratory results, and demographics. Codified information is easily extracted from the EHR and requires little pre-processing to be included as a feature in machine learning algorithms.

Other features included in the algorithms were derived by applying natural language processing (NLP) to roughly 15,000 physicians' encounter notes. We curated a dictionary of concepts specific to the current phenotype, and applied a dictionary look-up algorithm to identify these concepts in the free text of encounter notes. Each concept was tagged for a number of assertion statuses: negation, subject (patient vs. family), and history (patient/family history vs. current). After running our NLP pipeline, we were left with a large dataset of counts; our dataset contains the number of times each concept appeared in each encounter note for 562 patients. A number of preprocessing steps must occur before these NLP-derived features can be ready for inclusion in machine learning algorithms:

helper_functions/preprocessing.R aggregates the count data at the patient level. After preprocessing, our dataset consists of one row for each patient with the total number of occurrences of each medical concept across notes.

helper_functions/dimensionality_reduction.R groups together concepts that belong to broader clinical categories. For example, in order to capture the symptom of itching, our dictionary contained concepts such as 'pruritus' (itching), as well as phenotypically-relevant conditions that lead to itching (e.g., 'prurigo') or results of itching (e.g., 'lichenification'). To reduce the dimensionality of the dataset and provide a stronger signal for the broad category of itching, these similar concepts were grouped together and their counts aggregated. Similarly, medications recommended for treating AD (both reflected in the clinical notes and codified in the EHR) were grouped into broad categories (e.g., antihistimines, topical corticosteroids).

Other helper functions prepare features from codified sources for inclusion in the algorithms. The representation of these features can be customized for experimentation (e.g., log transform some vs. all features).