Sex differences in viral entry protein expression, host responses to SARS-CoV-2, and in vitro responses to sex steroid hormone treatment in COVID-19

Epidemiological studies suggest that men exhibit a higher mortality rate to COVID-19 than women, yet the underlying biology is largely unknown. Here, we seek to delineate sex differences in the expression of entry genes ACE2 and TMPRSS2, host responses to SARS-CoV-2, and in vitro responses to sex steroid hormone treatment. Using over 220,000 human gene expression profiles covering a wide range of age, tissues, and diseases, we found that male samples show higher expression levels of ACE2 and TMPRSS2, especially in the older group (>60 years) and in the kidney. Analysis of 6,031 COVID-19 patients at Mount Sinai Health System revealed that men have significantly higher creatinine levels, an indicator of impaired kidney function. Further analysis of 782 COVID-19 patient gene expression profiles taken from upper airway and blood suggested men and women present profound expression differences in responses to SARS-CoV-2. Computational deconvolution analysis of these profiles revealed male COVID-19 patients have enriched kidney-specific mesangial cells in blood compared to healthy patients. Finally, we observed selective estrogen receptor modulators, but not other hormone drugs (agonists/antagonists of estrogen, androgen, and progesterone), could reduce SARS-CoV-2 infection in vitro.

This is the code repository for understanding sex difference of covid-19 using big data.

Repository structure

main-analysis

Code and input files for main analysis

raw: preprocessed data files for ARCHS4 GPL11154 (A), GEO GPL570 (G), and Treehouse OCTAD (T), obtained from machine learning prediction and human annotation (correction). These files are used for main analysis.
all R code for main analysis, which generates the figures in the main text.
figure: figures generated by the main analysis, we kept some figures as examples due to space limit.

All the R scripts were verified on a RStudio Desktop 1.1.463. Note that the bootstrapping process may take time to finish.

ml-prediction

Code, input and output for machine learning models for gender, age and tissue prediction.

input: we provide ARCHS GPL11154 as an example, which were divided into 5 folds (fold_0-fold_4) for cross validation and fold_all for final prediction.
all python code for gender-tissue prediction (deep multi-task model), and age prediction (XGBoost model)
output: prediction results generated by the models.

All the Python scripts were tested under the following environment:

NVIDIA Driver Version: 440.100
CUDA Version: 10.2
Python Version: 3.7
Pytorch Version: 1.5.1

Python Dependencies:

numpy
pandas
sklearn
pickle
xgboost
keras
imblearn

Command to run gender-tissue cross-validation:

python predict_gender_tissue.py --fold 0 --dataset GPL11154 --dropout 0.1

Command to run gender-tissue prediction:

python predict_gender_tissue.py --fold all --dataset GPL11154 --dropout 0.1

Command to run age prediction:

python GPL11154_XGB_age.py

Patients_data_code

The raw expression data of the COVID-19 infected patients were obtained from GEO (https://www.ncbi.nlm.nih.gov/geo/) and is provided in /data/ directory. The raw reads in all the datasets were used to identify the differentially expressed (DE) genes between all possible combination of infection and control group for male and female using the “Diff_Exp” utility provided in OCTAD (http://octad.org/). The DE genes from each datasets were used as input in “Venn_diagramm_and_enrichment.R” for comparison and gene ontology (GO) enrichment analysis. In addition, normalized count value (TPM value) of blood dataset (GSE157103) was used as input in the “Immune_cell_composition_and_figure_code.R” to identify the cell composition and their associations with the severity of the COVID-19 infection in male and female patients.

other