CONTENTS 1. Introduction 2. Background 3. Prerequisites 4. Documentation 5. Contact 1. INTRODUCTION This tool builds a text classifier for retrieving relevant documents from dbGap. Before running, please check the prerequisites section below and make sure you have all the tools installed. In order to run, check out the entire codebase and navigate to the root directory (which contains this file). Then type: $> cd code $> python classify.py You should be able to see classification results for 6 different queries, related to heart ailments, atherosclerosis, blood ailment, lung problems, diabetes and other general problems respectively. These 6 categories are not hard-written into the code, but are determined by the data file and its contents. A different data file with different tags (or labels) would yield different results. 2. BACKGROUND This tool was used to produce results that were published in the proceedings of the American Thoracic Society. The background and objectives are as detailed below. DbGaP was developed by NHLBI for genome-wide association study data * Growing quickly; 285 studies as of May 8, 2012. * Users unable to reap full benefits: data content is not standardized. * The Division of Biomedical Informatics at UCSD NHLBI-funded project to address these limitations is entitled Phenotype Discoverer, which was funded under the Phenotype Finder IN Database resources (PFINDR) initiative. To enhance data retrieval we developed two text classifiers and tested them against standard manual keyword search for accuracy, precision, recall and F-measure. The current keyword-dependent search function in dbGaP is insufficient for focused data retrieval. The low level of precision in all categories indicates that researchers using these search strategies need to spend significant time and effort to retrieve relevant studies. Standardized annotation of data content for data retrieval are needed to facilitate validating studies, discovering control populations, performing virtual experiments, meta- analysis and more. The dbGaP content will be standardized using a text-mining approach based on natural language processing (NLP) and semantic integration. The database will be fully annotated using standardized terminologies. In addition, a new user-friendly query interface will be created. 3. PREREQUISITES The following pre-requisite packages must be installed on your system. Newer versions should also work, but currently only the below tools have been used to run and test the code. * Python version 2.7.1 * Numpy (http://numpy.scipy.org/) * Scipy version 10.1 (http://www.scipy.org/) * Scikit_learn version 0.10 (http://scikit-learn.org/stable/) * textmining version 1.0 (http://www.christianpeccei.com/projects/textmining) 4. DOCUMENTATION Rough draft: https://github.com/abhishek-kumar/dbgap-classifier/blob/master/doc/dbgap-classifier-draft.pdf?raw=true 5. CONTACT Abhishek Kumar abhishek@ucsd.edu
abhishek-kumar/dbgap-classifier
A text classifier for retrieving relevant documents from dbGap. This was used to produce results published in the proceedings of the American Thoracic Society, 2012.
Python