This repository contains part of the workflows for reproducing the results from the bioRxiv paper scRNA-seq analysis of colon and esophageal tumors uncovers abundant microbial reads in myeloid cells undergoing proinflammatory transcriptional alterations by Welles Robinson, Josh Stone, Fiorella Schischlik, Billel Gasmi, Michael Kelly, Charlie Seibert, Kimia Dadkhah, E. Michael Gertz, Joo Sang Lee, Kaiyuan Zhu, Lichun Ma, Xin Wang, S. Cenk Sahinalp, Rob Patro, Mark D.M. Leiserson, Curtis Harris, Alejandro A. Schäffer, and Eytan Ruppin. This repository contains the workflows to analyze microbial reads from 10x and Smart-seq2 scRNA-seq datasets to identify microbial taxa that are differentially abundant or differentially present. Prior to running this code, these microbial reads must be identified using the CSI-Microbes-identification repository. The code in this repository was written by Welles Robinson and Fio Schischlik and alpha-tested by Alejandro Schaffer.
This workflow has been tested on Mac OS Mojave (10.14.6) and the Linux OS (biowulf). The minimum memory requirements are 10 GB for all steps except for figure 5A, which requires 30 GB of RAM. This workflow expects that conda has been installed. For instructions on how to install conda, see conda install documentation.
It should take < 30 minutes to install the software, which involves downloading the codebase and setting up the environment (not including the time needed unzip the files, which depends on the OS). There are two ways to download the codebase. To reproduce the key results from our paper, it is recommended to download the latest version of CSI-Microbes-analysis from Zenodo, which contains the intermediate files generated using CSI-Microbes-identification. The intermediate files for a given dataset are located in the <dataset_of_interest>/raw
directory. For example, the intermediate files needed to reproduce Aulicino2018 are in Aulicino2018/raw
).
The second way to download the codebase is to clone the GitHub repository as shown below (which does not contain the intermediate files). The below instructions assume that you have an ssh key associated with your GitHub account. If you do not, you can generate a new ssh key and associate it with your GitHub username by following these instructions.
git clone git@github.com:ruppinlab/CSI-Microbes-analysis.git
Once the codebase is downloaded, you need to create the conda environment (you need to perform this step only once unless you explicitly delete the conda environment).
cd CSI-Microbes-analysis
conda env create -f envs/CSI-Microbes-analysis.yaml
Finally, you need to activate the recently created conda environment (all of the commands assume that the conda environment CSI-Microbes-env
is active).
conda activate CSI-Microbes-env
CSI-Microbes-analysis depends on the following software packages that are installed via the conda channels conda-forge, bioconda and defaults: dplyr (1.0.5)REF, ggforce (0.3.3)REF, ggplot2 (3.3.3)REF, ggpubr (0.4.0)REF, rpy2 (3.4.4)REF, scater (1.16.0) REF, scran (1.16.0) REF, SingleCellExperiment (1.10.1)REF, Snakemake (6.2.1)REF, and Seurat (4.0.1)REF.
The reproduction of key results and figures from the paper requires intermediate files generated by CSI-Microbes-identification and available for download from Zenodo.
To reproduce the results from Aulicino2018REF, you first need to be in the Aulicino2018
directory.
cd Aulicino2018
and then you can use snakemake to reproduce the key results
snakemake --cores <number of CPUs> --use-conda all
To reproduce the results from Ben-Moshe2019REF, you first need to be in the Ben-Moshe2019
directory.
cd Ben-Moshe2019
and then you can use snakemake to reproduce the key results
snakemake --cores <number of CPUs> --use-conda all
The results from Pelka2021REF are divided into two directories divided by microbial vs. human results. We show how to reproduce the microbial results in this example but the others are very similar.
cd Pelka2021
and then you can use snakemake to reproduce the key results
snakemake --cores <number of CPUs> --use-conda all
The results from Robinson2023 are divided into four directories divided by microbial vs. human results and 10x vs. plexWell. We show how to reproduce the microbial results from the 10x dataset in this example but the others are very similar.
cd Robinson2023-10x
and then you can use snakemake to reproduce the key results
snakemake --cores <number of CPUs> --use-conda all
The results from Zhang2021REF are divided into two directories divided by microbial vs. human results. We show how to reproduce the microbial results in this example but the others are very similar.
cd Zhang2021
and then you can use snakemake to reproduce the key results
snakemake --cores <number of CPUs> --use-conda all
Aulicino, A. et al. Invasive Salmonella exploits divergent immune evasion strategies in infected and bystander dendritic cell subsets. Nat. Commun. 9, 4883 (2018).
Bossel Ben-Moshe, N. et al. Predicting bacterial infection outcomes using single cell RNA-sequencing analysis of human immune cells. Nat. Commun. 10, 3266 (2019).
Paulson, K. G. et al. Acquired cancer resistance to combination immunotherapy from transcriptional loss of class I HLA. Nat. Commun. 9, 3868 (2018).
Pelka, K. et al. Spatially organized multicellular immune hubs in human colorectal cancer. Cell, (2021).
Zhang, X. et al. Dissecting esophageal squamous-cell carcinoma ecosystem by single-cell transcriptomic analysis. Nat.Commun. 12, 5291 (2021).
Wickham, H., François, R., Henry, L. and Müller, K (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.5. https://CRAN.R-project.org/package=dplyr
Pedersen, T.L. (2021). ggforce: Accelerating 'ggplot2'. R package version 0.3.3. https://CRAN.R-project.org/package=ggforce
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.
Kassambara, A. (2020). ggpubr: 'ggplot2' Based Publication Ready Plots. R package version 0.4.0. https://CRAN.R-project.org/package=ggpubr.
rpy2. https://rpy2.github.io/
McCarthy DJ, Campbell KR, Lun ATL, Willis QF (2017). “Scater: pre-processing, quality control, normalisation and visualisation of single-cell RNA-seq data in R.” Bioinformatics, 33, 1179-1186. doi:10.1093/bioinformatics/btw777 (URL:https://doi.org/10.1093/bioinformatics/btw777).
Lun, A. T. L., Mccarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor [ version 2 ; referees : 3 approved , 2 approved with reservations ]. F1000Research 5, (2016). https://github.com/MarioniLab/scran
Lun, A. and Risso, D. (2020). SingleCellExperiment: S4 Classes for Single Cell Data. R package version 1.10.1.
Köster, J., & Rahmann, S. (2012). Snakemake-a scalable bioinformatics workflow engine. Bioinformatics, 28(19), 2520–2522. https://doi.org/10.1093/bioinformatics/bts480
Hao and Hao et al. Integrated analysis of multimodal single-cell data. bioRxiv (2020) [Seurat V4]