/An-explainable-model-of-host-genetic-interactions-linked-to-Covid19-severity

This project focused on the mapping of the host-genetics factors determining COVID-19 severity using Machine learning approaches (supervised, unsupervised Machine learning methods, Pathway signaling processes, and Open Targets web-based Platform). Our study utilized the whole-exome sequencing genome dataset of 2000 European descent patients collected from the GEN-COVID Multicenter Study group (https://clinicaltrials.gov/ct2/show/NCT04549831) coordinated by the University of Siena. The whole-exome genome sequencing dataset contained 1.057M genetic variants of the patients. We used the 2000 patients’ original phenotype information to filter only patients with severity and asymptomatic across all classification criteria (841 patients). We introduced an innovative variant screening strategy that applied K-stratified fold splits of the original dataset to randomly draw a unique 5-fold pool of variants using the patients’ original phenotype information (841 unique patients).

Primary LanguageJupyter Notebook

An Explainable Model of Host Genetic Interactions Linked to Covid-19 Severity

Introduction

This project employed a multifaceted computational strategy to identify the genetic factors contributing to increased risk of severe COVID-19 infection from a Whole Exome Sequencing (WES) dataset of a cohort of 2000 Italian patients. We used a stratified k-fold screening, to rank variants more associated with severity, with training of multiple supervised classifiers, to predict severity on the basis of screened features. More details are found on (DOI:10.21203/rs.3.rs-1062190/v1).

Project Tasks were organized as follows:

  1. Variant screening;

    1a. Stratified K-fold split of sample cohort into train and test sets;

    1b. Variant screening;

    1c. Feature matrix generation;

  2. Supervised Machine Learning Techniques;

    2a. Feature selection: Removal of Multicollinearity;

    2b. Supervised Binary Classification;

    2c. Feature importance scores;

    2d. Final testing on a follow-up cohort;

  3. Unsupervised Machine Learning Technqiues;

    3a. Principal Component Analysis (PCA) and clustering

  4. Variant functional analysis;

    4a. Pathway Enrichment Analysis;

    4b. Retrieving associations between variants and disease traits or phenotypes.

Authors: Anthony Onoja, Nicola Picchiotti, Chiara Fallerini, Margherita Baldassarri, Francesca Fava, Francesca Colombo, Francesca Chiaromonte, Alessandra Renieri, Simone Furini, Francesco Raimondi.

Software and Packages: All the analyses were performed using a customized Python script,Jupyter notebook, with the following libraries: scipy 1.2.0, numpy 1.19.4, scikit-learn 0.23.2., statsmodels 0.11.0 and matplotlib 3.2.1

Note: The WES dataset (variant stat., and feature count matrices) that support the findings of this study are available from GEN-COVID Multicenter Study group (https://clinicaltrials.gov/ct2/show/NCT04549831) and Francesco Raimondi (Ph.D.) head of Bioinformatics group at Bio@SNS lab but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly made available. Data are however available from the authors upon reasonable request and with permission of GEN-COVID Multicenter Study group.

Date: 10/11/2021