/SNP-Data-Analysis-Project

A GitHub compiling the input data, Python and Jupyter Notebook scripts, and all relevant statistical outputs from running the AutoMLPipe-BC automated machine learning pipeline (from the Urbanowicz Lab - https://github.com/UrbsLab) on a large-scale single nucleotide polymorphism (SNP) dataset from patients with congenital heart disease (CHD)

Primary LanguageJupyter Notebook

SNP-Data-Analysis-Project

In this PURM research project, I applied the AutoMLPipe-BC automated machine learning pipeline from the Urbanowicz Lab to conduct a full statistical analysis of a single nucleotide polymorphism (SNP) dataset comprised of SNPs from patients with congenital heart disease. The end goal of the statistical analysis was to identify key SNPs that demonstrated the strongest correlation between feature value and disease outcome. Some of the tasks I was responsible for in this project include significant data wrangling and preprocessing methods in Python 3 and Jupyter Notebook to ready the massive SNP dataset for initial exploratory analyses and for the entire AutoML pipeline to run smoothly. In addition, I simplified the imputation step of the automated ML pipeline from a multivariate to a univariate function. Future extensions to this pipeline include incorporating a multivariate imputation step that iteratively runs through batches of data rather than the full dataset and extending the pipeline to handle more complex, even non-tabular datsets.