covid19-transcriptomics-pathogenesis-diagnostics-results

This repo houses code for the analyses in the pre-print "Upper airway gene expression differentiates COVID-19 from other acute respiratory illnesses and reveals suppression of innate immune responses by SARS-CoV-2" by Mick, Kamm, et al. available at https://doi.org/10.1101/2020.05.18.20105171.

The analyses include:

Differential gene expression on metagenomic RNA-sequencing data from nasopharyngeal/oropharyngeal swabs of patients with COVID-19, other viral acute respiratory illnesses (ARIs) or non-viral ARIs.
Cell type deconvolution of the bulk RNA-seq in the three patient groups.
Diagnostic classifier based on patient gene expression to distinguish COVID-19 from all other ARIs (viral or non-viral).

Sample metadata table

Sample metadata is at data/metatable.csv. This table has the following fields:

CZB_ID - unique sample identifier.
Sequencing_batch.
Gender - as reported by patient, and if missing/ambiguous imputed based on chr Y gene expression.
Age - as reported by patient, and if missing taken as the mean age of patients with age reported in the same patient group. Note that in this public repo we are required to censor patient age at 89 (see below).
SC2_PCR - the result of the clinical PCR test for SARS-CoV-2. This field determines whether a patient belongs to the COVID-19 patient group.
SC2_rpm - reads-per-million mapping to SARS-CoV-2 in the metagenomic sequencing data. These values were generated using the pipeline here: https://github.com/czbiohub/sc2-msspe-bioinfo. Human subtracted FASTQs will soon be available on SRA and we will update the information here.
IDSeq_sample_name - name of the matching IDSeq sample (see below).

A second table data/metatable_with_viral_status.csv is generated by code/viral_calls-2.R, which determines the final assignment of samples to the three patient groups depending on whether other respiratory pathogenic viruses were detected (see below). This table includes an additional field:
viral_status - either SC2 (for SARS-CoV-2), other_virus or no_virus.

A note on patient age

For privacy reasons, ages above 89 were censored in the public metadata table included in this repo. The results reported in the paper rely on the uncensored metadata, which means analyses that make use of age (differential expression and classifier with age covariate) cannot be reproduced precisely here. In practice, the difference in the results is infinitesimal since very few patients were older than 89. We have indicated in the code where the censored metadata table that is available in the repo can be swapped in to reproduce those analyses, up to a very small difference.

Viral calls

Viral calls are generated by the script code/viral_calls-1.R, which creates results/viral_calls/viruses_with_pval.csv, a table of viral read counts, along with p-values for being above background levels according to a negative binomial model. The second script code/viral_calls-2.R aggregates these results and combines them with the metadata table to create data/metatable_with_viral_status.csv.

This script relies on code from the package idseqr which is currently in alpha stage. The specific version of that package used for the analysis is included as a git submodule at code/idseqr.

The script expects the metagenomic count data to reside in data/viral_calls/sample_taxon_reports, but this data is not included in the repo due to size limitations. To download the data, register for a free account on IDSeq, then download the "Sample Taxon Reports" for the project covid19_transcriptomics_pathogenesis_diagnostics. The Download will prompt you to select a "Background" but it doesn't affect any of the variables we use so you can select any background (e.g. "NID Human CSF"). Unzip the reports and then move them into data/viral_calls/sample_taxon_reports. The metadata table described above includes the IDSeq sample name associated with each CZB_ID.

Gene counts table

Transcript quantification was performed with kallisto (v. 0.46.1, with bias correction) against an index consisting of transcripts of protein coding genes (ENSEMBL v. 99), cytosolic and mitochondrial ribosomal RNA and ERCC RNA standards.

Aggregation to the gene level was done with tximport using countsFromAbundance="lengthScaledTPM". The mapping from transcript ID to gene ID is in annotations/tx2gene.txt. Genes were retained for analysis if they had at least 10 counts in at least 20% of the samples in the dataset. The counts for those gene are in data/swab_gene_counts.csv. Note, gene counts in the table were not normalized further. Genes in the table are identified by their ENSEMBL ID. The mapping to gene symbol and chromosome is in annotations/gene2name.txt.

DE analysis

Pairwise DE analyses between the three patient groups as well as regression of DE gene counts against viral abundance (rpM) are performed in code/DE_analysis.R. Results are in results/DE.

Classifier analysis

classifier-1-generate-cv-folds.R generates the cross-validation folds.
classifier-2-fit-models.R fits the classifier models. Results are stored in results/classifier.
classifier-3-aggregate-results.py and classifier-4-sensitivity-specifity.R aggregate results and put them in figures/classifier.

Deconvolution analysis

We used the Human Lung Cell Atlas dataset Travaglini et al. (bioRxiv 2019) to generate the single cell signature files used as the basis for the deconvolution analysis in CiberSortX.

Cell deconvolution analysis is in code/host_deconvolution.ipynb, with results in results/deconvolution.

wconnell/covid19-transcriptomics-pathogenesis-diagnostics-results