Location and condition based reconstruction of colon cancer microbiome from human RNA sequencing data
The association between microbes and cancer has been reported repeatedly; however, it is not clear if molecular tumour properties are connected to specific microbial colonisation patterns. This is due mainly to the current technical and analytical strategy limitations to characterise tumour-associated bacteria. Here, we propose an approach to detect bacterial signals in human RNA sequencing data and associate them with the clinical and molecular properties of the tumours. The method was tested on public datasets from The Cancer Genome Atlas and its accuracy was assessed on a new cohort of colorectal cancer patients. Our analysis shows that intratumoural microbiome composition is correlated with survival, anatomic location, microsatellite instability, consensus molecular subtype and immune cell infiltration in colon tumours. In particular, we find Faecalibacterium prausnitzii, Coprococcus comes, Bacteroides spp., Fusobacterium spp. and Clostridium spp. to be strongly associated with tumour properties. We implemented an approach to concurrently analyze clinical and molecular properties of the tumour as well as the composition of the associated microbiome. Our results may improve patient stratification and pave the path for mechanistic studies on microbiota-tumour crosstalk.
This project was developed on a cluster of 12 computing nodes of 28 cores, 128GB of ram running CentOS Linux 7 (Core) (alignment to microbial genomes, second step) and a local computer of 6 cores, 15.5GB of ram running Ubuntu 20.04.4 LTS (Focal Fossa). The alignment to microbial genomes is executed by Pathseq from GATK (version 4.0.10.1). All the scripts are written in R (R version 3.6.1 (2019-07-05) -- "Action of the Toes"). The scripts were tested also on a laptop with 4 cores, 8 GB of ram running Windows 11 Pro 21H2.
To execute all the steps of this workflow, you must pull our docker container and work in it (suggested approach) or manually download the required packages (check the requirements.txt file for R package versions).
The first step is to clone this repository (you need git, if you haven't installed it, check this link):
git clone https://github.com/SamGa3/microbiome_reconstruction.git
To install Docker on your local computer, follow the instructions described here. Remember that you need root permissions to install and run docker. You can then pull the microbiome_reconstruction container:
# Linux/Ubuntu users
docker pull gaiasamb/microbiome_reconstruction
# Windows users
docker pull "gaiasamb/microbiome_reconstruction"
Following this tutorial, you will test the microbial reads extraction from bam files using a toy example and reproduce the figures of the paper for COAD samples. To obtain the same results for another cancer type, you need to adapt the scripts accordingly to input that cancer-type-specific data provided or your own data.
The scripts use Rmarkdown and produce an html file that wraps together the figures. When needed for further analyses, tables are also produced. The structure of this workflow has a scripts folder that contains all the required scripts, categorised by type of analysis. The results of these scripts are stored in a folder with a similar structure to the results folder. Each analysis requires its own functions, written in the functions.R file in each script subfolder, while a general set of functions needed by multiple scripts is listed in general_functions.R script. All the metadata files and the tables produced by the workflow use a "file_id" column as key to link the information of each sample (for TCGA samples it is the file_id provided by the gdc database). To easily recover the patient's barcode, an extra table linking the file_id to the TCGA barcode has been added. IEO metadata are available under request.
You must create the folder structure:
# Linux/Ubuntu users
chmod u+wrx scripts/make_structure.sh
./scripts/make_structure.sh
# Windows users
"scripts/make_structure.sh"
After pulling the docker container as explained above, you must activate it in interactive way:
# Linux/Ubuntu users
sudo docker run -it -e DISPLAY -v /your/path/to/microbiome_reconstruction/:/microbiome_reconstruction 8086d9d9c85a /bin/bash
# Windows users
docker run -it -e DISPLAY -v /your/path/to/microbiome_reconstruction/:/microbiome_reconstruction 8086d9d9c85a /bin/bash
Where 7a13d2e31810 is the image id of the repository, to check your own image id you can run:
sudo docker images
This step was run on the IEO cluster using 16 cores with 50GB of memory, taking advantage of Singularity and gatk docker image. Here, as an example, the command to run Pathseq on the file from sample1:
gatk PathSeqPipelineSpark \
--input data/RNAseq/input_bam/sample1/input_bam.bam \
--filter-bwa-image data/pathseq_tools/pathseq_host.fa.img \
--is-host-aligned true \
--kmer-file data/pathseq_tools/pathseq_host.bfi \
--microbe-fasta data/pathseq_tools/pathseq_microbe.fa \
--microbe-bwa-image data/pathseq_tools/pathseq_microbe.fa.img \
--taxonomy-file data/pathseq_tools/pathseq_taxonomy.db \
--output data/RNAseq/pathseq_output/sample1/bam_out.bam \
--score-metrics data/RNAseq/pathseq_output/sample1/scores.txt \
--filter-metrics data/RNAseq/pathseq_output/sample1/metrics.txt \
--scores-output data/RNAseq/pathseq_output/sample1/score_out.txt \
--spark-runner LOCAL \
--spark-master local[16]
Since this step requires a lot of time, we provide here a toy sample to test Pathseq on your local computer. This example was taken from gatk Pathseq tutorial. The first step is to download the example references:
cd /microbiome_reconstruction/data/pathseq_tools
wget 'ftp://gsapubftp-anonymous@ftp.broadinstitute.org/tutorials/datasets/tutorial_10913.tar.gz' -P /microbiome_reconstruction/data/pathseq_tools
tar –xvf tutorial_10913.tar.gz
Three samples were downloaded from Zmora et al. from ENA with the ERR2756905-7 accession number and aligned with STAR-2.7.7a following this pipeline. We selected the unmapped reads with samtools. The resulting files are located in /microbiome_reconstruction/data/RNAseq/input_bam/ folder. Then you can run the example (a few minutes per sample):
cd /microbiome_reconstruction
gatk PathSeqPipelineSpark \
--input data/RNAseq/input_bam/example1/ERR2756905.bam \
--filter-bwa-image data/pathseq_tools/pathseq_tutorial/hg19mini.fasta.img \
--is-host-aligned true \
--kmer-file data/pathseq_tools/pathseq_tutorial/hg19mini.hss \
--microbe-fasta data/pathseq_tools/pathseq_tutorial/e_coli_k12.fasta \
--microbe-bwa-image data/pathseq_tools/pathseq_tutorial/e_coli_k12.fasta.img \
--taxonomy-file data/pathseq_tools/pathseq_tutorial/e_coli_k12.db \
--output data/RNAseq/pathseq_output/ERR2756905_out/bam_out.bam \
--score-metrics data/RNAseq/pathseq_output/ERR2756905_out/scores.txt \
--filter-metrics data/RNAseq/pathseq_output/ERR2756905_out/metrics.txt \
--scores-output data/RNAseq/pathseq_output/ERR2756905_out/score_out.txt \
--spark-runner LOCAL \
--spark-master local[4]
gatk PathSeqPipelineSpark \
--input data/RNAseq/input_bam/example2/ERR2756906.bam \
--filter-bwa-image data/pathseq_tools/pathseq_tutorial/hg19mini.fasta.img \
--is-host-aligned true \
--kmer-file data/pathseq_tools/pathseq_tutorial/hg19mini.hss \
--microbe-fasta data/pathseq_tools/pathseq_tutorial/e_coli_k12.fasta \
--microbe-bwa-image data/pathseq_tools/pathseq_tutorial/e_coli_k12.fasta.img \
--taxonomy-file data/pathseq_tools/pathseq_tutorial/e_coli_k12.db \
--output data/RNAseq/pathseq_output/ERR2756906_out/bam_out.bam \
--score-metrics data/RNAseq/pathseq_output/ERR2756906_out/scores.txt \
--filter-metrics data/RNAseq/pathseq_output/ERR2756906_out/metrics.txt \
--scores-output data/RNAseq/pathseq_output/ERR2756906_out/score_out.txt \
--spark-runner LOCAL \
--spark-master local[4]
gatk PathSeqPipelineSpark \
--input data/RNAseq/input_bam/example3/ERR2756907.bam \
--filter-bwa-image data/pathseq_tools/pathseq_tutorial/hg19mini.fasta.img \
--is-host-aligned true \
--kmer-file data/pathseq_tools/pathseq_tutorial/hg19mini.hss \
--microbe-fasta data/pathseq_tools/pathseq_tutorial/e_coli_k12.fasta \
--microbe-bwa-image data/pathseq_tools/pathseq_tutorial/e_coli_k12.fasta.img \
--taxonomy-file data/pathseq_tools/pathseq_tutorial/e_coli_k12.db \
--output data/RNAseq/pathseq_output/ERR2756907_out/bam_out.bam \
--score-metrics data/RNAseq/pathseq_output/ERR2756907_out/scores.txt \
--filter-metrics data/RNAseq/pathseq_output/ERR2756907_out/metrics.txt \
--scores-output data/RNAseq/pathseq_output/ERR2756907_out/score_out.txt \
--spark-runner LOCAL \
--spark-master local[4]
From this step on, we will run R scripts only (HUMAnN tool is the only exception, a summary of all the scripts needed to reproduce this paper figures are listed in a R file that can be run outside R using Rscript command), so you need to activate R from the microbiome_reconstruction folder.
cd /microbiome_reconstruction
../R-3.6.1/bin/R
This step of the pipeline creates a table of unambiguous reads and scores of all samples analysed by Pathseq. The results of this step are already available in the final folders (/microbiome_reconstruction/data/RNAseq/bacteria/raw/unamb and /microbiome_reconstruction/data/RNAseq/bacteria/score) for all the analysed TCGA cancer types and IEO cohort samples. Here we apply the pipeline to the toy samples previously analysed to test it on your local computer.
rmarkdown::render("scripts/microbes_values/microbiome_table_maker.Rmd",
params = list(
sample_sheet = "../../metadata/example/gdc_sample_sheet_fac_simile.tsv",
input_folder = "../../data/RNAseq/pathseq_output/",
output_table = "../../data/RNAseq/bacteria/raw/unamb/example_bacteria_species_unamb.txt",
kingdom_level = "Bacteria",
taxon_level = "species",
type_of_value = "unambiguous"
)
)
rmarkdown::render("scripts/microbes_values/microbiome_table_maker.Rmd",
params = list(
sample_sheet = "../../metadata/example/gdc_sample_sheet_fac_simile.tsv",
input_folder = "../../data/RNAseq/pathseq_output/",
output_table = "../../data/RNAseq/bacteria/raw/score/example_bacteria_species_score.txt",
kingdom_level = "Bacteria",
taxon_level = "species",
type_of_value = "score"
)
)
This step of the pipeline creates the table of unambiguous scores. To do this you need both the table of unambiguous reads and score in the data folder. These two tables have been created summarising each sample table output of pathseq (step 3). As suggested by Pathseq, some taxonomic_id must be merged together. The list of these taxa is in the merged.dmp file provided with Pathseq.
rmarkdown::render("scripts/microbes_values/microbiome_estimation.Rmd",
params = list(
unamb="../../data/RNAseq/bacteria/raw/unamb/example_bacteria_species_unamb.txt",
score="../../data/RNAseq/bacteria/raw/score/example_bacteria_species_score.txt",
unamb_score_norm_tab="../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/example/example_bacteria_species_merged_unamb_score_norm.txt"
)
)
To obtain all the tables needed to reproduce this paper results, you can run this file that wraps up all the runs needed.
../R-3.6.1/bin/Rscript scripts/microbes_values/microbiome_estimation_commands.R
This step selects the samples you are interested in given their properties (e.g. only primary tumour, by read length, or to remove duplicates) and creates a new table with the subset of the analysed samples. Here we show how to select only primary, not duplicated COAD samples.
rmarkdown::render("scripts/microbes_values/sample_selection.Rmd",
params = list(
metadata = c("../../metadata/COAD/COAD_technical_metadata.txt", "../../metadata/COAD/COAD_clinical_metadata.txt"),
match_metadata = "file_id",
taxa_tab = "../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/COAD/COAD_bacteria_species_merged_unamb_score_norm.txt",
match_taxa = "rownames",
properties = c("sample_type", "is_ffpe", "library_name"),
selection = list("Primary Tumor", "NO", c("Illumina TruSeq", "unknown")),
output = "../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/COAD/COAD_selectedTumor_bacteria_species_merged_unamb_score_norm.txt"
)
)
A list of all the subset of samples used in this paper is available in scripts/microbes_values/sample_selection_commands.R and can be run altogether:
../R-3.6.1/bin/Rscript scripts/microbes_values/sample_selection_commands.R
The bacterial signal detected can be affected by technical issues. This step detects the strongest batch effect by selecting the technical property from TCGA metadata that better describes the differences between samples for each cancer type:
rmarkdown::render("scripts/technical_batch_effect/batch_detection.Rmd",
params = list(
tissues = c("COAD", "GBM", "LUAD", "LUSC", "HNSC", "OV", "READ", "SKCM", "BRCA"),
metadata = c("../../metadata/COAD/COAD_technical_metadata.txt", "../../metadata/GBM/GBM_technical_metadata.txt",
"../../metadata/LUAD/LUAD_technical_metadata.txt", "../../metadata/LUSC/LUSC_technical_metadata.txt",
"../../metadata/HNSC/HNSC_technical_metadata.txt", "../../metadata/OV/OV_technical_metadata.txt",
"../../metadata/READ/READ_technical_metadata.txt", "../../metadata/SKCM/SKCM_technical_metadata.txt",
"../../metadata/BRCA/BRCA_technical_metadata.txt"),
new_property = list(c(old="plate_id", met="corr_plate_id", new_name="corr_plate_id")),
taxa = c("../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/COAD/COAD_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/GBM/GBM_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/LUAD/LUAD_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/LUSC/LUSC_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/HNSC/HNSC_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/OV/OV_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/READ/READ_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/SKCM/SKCM_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/BRCA/BRCA_selectedTumor_bacteria_species_merged_unamb_score_norm.txt"
)
),
output_file = "../../results/technical_batch_effect/COAD_LUAD_LUSC_HNSC_OV_READ_SKCM_BRCA_bacteria_species_merged_unamb_score_norm_batch_detection.html"
)
A list of all the tests done in this paper is available in scripts/technical_batch_effect/batch_effect_detection_commands.R and can be run altogether:
../R-3.6.1/bin/Rscript scripts/technical_batch_effect/batch_effect_detection_commands.R
This step corrects the batch effect detected in the previous step. Here we show how to correct the batch effect due to the plate_id (corrected to sum together plate_ids with too few samples) in COAD:
rmarkdown::render("scripts/technical_batch_effect/batch_correction.Rmd",
params = list(
metadata = c("../../metadata/COAD/COAD_clinical_metadata.txt", "../../metadata/COAD/COAD_technical_metadata.txt"),
join = "columns",
taxa = "../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/COAD/COAD_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
new_property = list(c(old="plate_id", met="corr_plate_id", new_name="corr_plate_id")),
property = "corr_plate_id",
output = "../../data/RNAseq/bacteria/ComBat_plate_id/merged_unamb_score_norm/COAD_ComBat_corr_plate_id_selectedTumor_bacteria_species_merged_unamb_score_norm.txt"
),
output_file = "../../results/technical_batch_effect/COAD_selectedTumor_ComBat_batch_correction_corr_plate_id.html"
)
A list of all the corrections used in this paper is in scripts/technical_batch_effect/batch_correction_commands.R and can be run altogether:
../R-3.6.1/bin/Rscript scripts/technical_batch_effect/batch_correction_commands.R
To compare the differences of bacteria estimations before and after the batch correction of plate_id in COAD, LUAD, LUSC, HNSC, OV, READ, SKCM and BRCA samples:
rmarkdown::render("scripts/technical_batch_effect/batch_correction_comparison.Rmd",
params = list(
tissues = c("COAD", "LUAD", "LUSC", "HNSC", "OV", "READ", "SKCM", "BRCA"),
metadata = list("../../metadata/COAD/COAD_technical_metadata.txt",
"../../metadata/LUAD/LUAD_technical_metadata.txt",
"../../metadata/LUSC/LUSC_technical_metadata.txt",
"../../metadata/HNSC/HNSC_technical_metadata.txt",
"../../metadata/OV/OV_technical_metadata.txt",
"../../metadata/READ/READ_technical_metadata.txt",
"../../metadata/SKCM/SKCM_technical_metadata.txt",
"../../metadata/BRCA/BRCA_technical_metadata.txt"),
taxa_raw = c("../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/COAD/COAD_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/LUAD/LUAD_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/LUSC/LUSC_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/HNSC/HNSC_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/OV/OV_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/READ/READ_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/SKCM/SKCM_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/BRCA/BRCA_selectedTumor_bacteria_species_merged_unamb_score_norm.txt"),
taxa_corrected = c("../../data/RNAseq/bacteria/ComBat_plate_id/merged_unamb_score_norm/COAD_ComBat_corr_plate_id_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/ComBat_plate_id/merged_unamb_score_norm/LUAD_ComBat_corr_plate_id_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/ComBat_plate_id/merged_unamb_score_norm/LUSC_ComBat_corr_plate_id_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/ComBat_plate_id/merged_unamb_score_norm/HNSC_ComBat_corr_plate_id_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/ComBat_plate_id/merged_unamb_score_norm/OV_ComBat_corr_plate_id_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/ComBat_plate_id/merged_unamb_score_norm/READ_ComBat_corr_plate_id_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/ComBat_plate_id/merged_unamb_score_norm/SKCM_ComBat_corr_plate_id_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/ComBat_plate_id/merged_unamb_score_norm/BRCA_ComBat_corr_plate_id_selectedTumor_bacteria_species_merged_unamb_score_norm.txt"),
batches = rep("plate_id", 8)
),
output_file = "../../results/technical_batch_effect/COAD_LUAD_LUSC_HNSC_OV_READ_SKCM_BRCA_ComBat_corr_plate_id_bacteria_species_batch_comparison_plate_id.html"
)
Since COAD and READ show a strong batch effect due to the read length used to sequence the samples (48 and 76bp), here we test the differences between samples with different read_length before and after the batch correction of plate_id in COAD and READ samples:
rmarkdown::render("scripts/technical_batch_effect/batch_correction_comparison.Rmd",
params = list(
tissues = c("COAD", "READ"),
metadata = c("../../metadata/COAD/COAD_technical_metadata.txt", "../../metadata/READ/READ_technical_metadata.txt"),
taxa_raw = c("../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/COAD/COAD_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/READ/READ_selectedTumor_bacteria_species_merged_unamb_score_norm.txt"),
taxa_corrected = c("../../data/RNAseq/bacteria/ComBat_plate_id/merged_unamb_score_norm/COAD_ComBat_corr_plate_id_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/ComBat_plate_id/merged_unamb_score_norm/READ_ComBat_corr_plate_id_selectedTumor_bacteria_species_merged_unamb_score_norm.txt"),
batches = rep("read_length", 2)
),
output_file = "../../results/technical_batch_effect/COAD_READ_ComBat_corr_plate_id_bacteria_species_batch_comparison_read_length.html"
)
A list of all the comparisons done in this paper is in scripts/technical_batch_effect/batch_effect_comparison_commands.R and can be run altogether:
../R-3.6.1/bin/Rscript scripts/technical_batch_effect/batch_effect_comparison_commands.R
This represents one of the most important steps of the workflow. It applies PCA to the reconstructed microbiome and associates the PCs to the properties of the tumour. Here is the script to find the associations between the COAD reconstructed microbiome and the clinical tumour properties.
rmarkdown::render("scripts/property_association/diversity.Rmd",
params = list(
metadata = c("../../metadata/COAD/COAD_technical_metadata.txt", "../../metadata/COAD/COAD_clinical_metadata.txt"),
join = "columns",
taxa = "../../data/RNAseq/bacteria/ComBat_plate_id/merged_unamb_score_norm/COAD_ComBat_corr_plate_id_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
cat_properties = c("gender", "bmi", "stage", "CMS", "history_of_other_malignancy", "side", "MSI_status", "CIMP_status", "history_colon_polyps"),
values_not_considered = list("unknown", "unknown", "unknown", c("unknown", "NOLBL"), c("unknown","inconsistency"), "unknown",
"unknown", "unknown", "unknown"),
cont_properties = c("percent_normal_cells", "age", "mutation_burden", "stemness",
"aneuploidy_score"),
new_property = list(c(old="plate_id", met="corr_plate_id", new_name="corr_plate_id")),
rotations_path="../../results/property_association/bacteria_species/ComBat_batch_corrected_plate_id/merged_unamb_score_norm/COAD/tables/COAD_ComBat_corr_plate_id_selectedTumor_rotations.txt",
pca_matrix_path = "../../results/property_association/bacteria_species/ComBat_batch_corrected_plate_id/merged_unamb_score_norm/COAD/tables/COAD_ComBat_corr_plate_id_selectedTumor_pca_tab.txt",
palette = c("default", "default", "default", "CMS", rep("default", 10))
),
output_file = "../../results/property_association/bacteria_species/ComBat_batch_corrected_plate_id/merged_unamb_score_norm/COAD/COAD_ComBat_plate_id_selectedTumor_property_association.html"
)
A list of all the tests used in this paper is in scripts/property_association/porperty_association_commands.R and can be run altogether:
../R-3.6.1/bin/Rscript scripts/property_association/property_association_commands.R
This step looks for associations between the reconstructed microbiome and the survival of patients.
Here we test if any PCs of COAD reconstructed microbiome is associated to the DFS of patients:
rmarkdown::render("scripts/survival_analysis/cox_analysis.Rmd",
params = list(
metadata = c("../../metadata/COAD/COAD_clinical_metadata.txt",
"../../results/property_association/bacteria_species/ComBat_batch_corrected_plate_id/merged_unamb_score_norm/COAD/tables/COAD_ComBat_corr_plate_id_selectedTumor_pca_tab.txt"),
join = c("columns"),
taxa = "../../data/RNAseq/bacteria/ComBat_plate_id/merged_unamb_score_norm/COAD_ComBat_corr_plate_id_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
surv_data = "../../metadata/COAD/COAD_cBioPortal_disease_free_survival_firehose.txt",
survival_analysis = c("DFS_YEARS", "patient_status", "DFS"),
categorical_covariates = "",
values_not_considered = "",
timerange_cat = "",
numeric_covariates = c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6"),
timerange_cont = list(c(0, 5), c(0, 5), c(0, 5), c(0, 5), c(0, 5), c(0, 5))
),
output_file = "../../results/survival_analysis/COAD_selectedTumor_COX_DFS.html"
)
Since PC4 has been found associated to DFS in Cox proportional-hazard model test and other COAD tumour properties are associated to PC4, we check if PC4 or any of these properties are associated to the DFS of patients:
rmarkdown::render("scripts/survival_analysis/km_analysis.Rmd",
params = list(
metadata = c("../../metadata/COAD/COAD_clinical_metadata.txt",
"../../results/property_association/bacteria_species/ComBat_batch_corrected_plate_id/merged_unamb_score_norm/COAD/tables/COAD_ComBat_corr_plate_id_selectedTumor_pca_tab.txt"),
join = "columns",
taxa = "../../data/RNAseq/bacteria/ComBat_plate_id/merged_unamb_score_norm/COAD_ComBat_corr_plate_id_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
surv_data = "../../metadata/COAD/COAD_cBioPortal_disease_free_survival_firehose.txt",
survival_analysis = c("DFS_YEARS", "patient_status", "DFS"),
categorical_covariates = c("history_colon_polyps"),
values_not_considered = c("unknown"),
timerange_cat = list(c(0, 5)),
numeric_covariates = c("PC4", "age", "mutation_burden"),
timerange_cont = list(c(0, 5), c(0, 5), c(0, 5)),
break_val = 1
),
output_file = "../../results/survival_analysis/COAD_selectedTumor_KM_DFS_PC4.html"
)
A list of all the tests used in this paper is in scripts/survival_analysis/survival_analysis_commands.R and can be run altogether:
../R-3.6.1/bin/Rscript scripts/survival_analysis/survival_analysis_commands.R
This step selects the species that satisfy the criteria needed for either tissue specificity, not-batch association or most prevalence in tumour COAD reconstructed microbiomes.
rmarkdown::render("scripts/filters/filters.Rmd",
params = list(
metadata_tissue = c(COAD="../../metadata/COAD/COAD_technical_metadata.txt"),
metadata_comp = c(GBM="../../metadata/GBM/GBM_technical_metadata.txt",
LUAD="../../metadata/LUAD/LUAD_technical_metadata.txt",
LUSC="../../metadata/LUSC/LUSC_technical_metadata.txt",
HNSC="../../metadata/HNSC/HNSC_technical_metadata.txt",
OV="../../metadata/OV/OV_technical_metadata.txt",
SKCM="../../metadata/SKCM/SKCM_technical_metadata.txt",
BRCA="../../metadata/BRCA/BRCA_technical_metadata.txt"),
taxa_tissue = c(COAD="../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/COAD/COAD_selectedTumor_bacteria_species_merged_unamb_score_norm.txt"),
taxa_comp = c(GBM="../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/GBM/GBM_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
LUAD="../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/LUAD/LUAD_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
LUSC="../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/LUSC/LUSC_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
HNSC="../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/HNSC/HNSC_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
OV="../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/OV/OV_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
SKCM="../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/SKCM/SKCM_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
BRCA="../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/BRCA/BRCA_selectedTumor_bacteria_species_merged_unamb_score_norm.txt"),
feat_tissue = "Project_ID",
batch_feat = "plate_id",
output_folder = "../../results/filters/",
thr_presence = 0.1
),
output_file = "../../results/filters/COAD_selectedTumor_vs_GBM_LUAD_LUSC_HNSC_OV_SKCM_BRCA_selectedTumor_filters.html"
)
A list of all the tests used in this paper is in scripts/survival_analysis/survival_analysis_commands.R and can be run altogether:
../R-3.6.1/bin/Rscript scripts/filters/filters_commands.R
This step identifies the species that are related to the tumour properties previously detected as associated to tumour composition of COAD samples.
rmarkdown::render("scripts/identification_related_species/taxa_compositions.Rmd",
params = list(
metadata = c("../../metadata/COAD/COAD_clinical_metadata.txt", "../../metadata/COAD/COAD_technical_metadata.txt", "../../metadata/COAD/COAD_immuneInfiltrationRelative_pbelow05_metadata.txt"),
join_by = "columns",
new_property = list(c(old="plate_id", met="corr_plate_id", new_name="corr_plate_id")),
taxa = "../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/COAD/COAD_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
cat_properties = c("CMS", "side", "MSI_status", "bin_Mast_cells_resting", "bin_Mast_cells_activated", "bin_Eosinophils",
"bin_Dendritic_cells_resting", "bin_Macrophages_M2", "bin_T_cells_regulatory_Tregs", "bin_T_cells_CD4_memory_resting",
"bin_B_cells_memory"),
cat_correction = rep("corr_plate_id", 11),
cont_properties = "",
cont_correction = "",
values_not_considered = list(c("unknown", "NOLBL"), "unknown", "unknown", "unknown", "unknown", "unknown", "unknown", "unknown", "unknown",
"unknown", "unknown"),
taxa_selection = c("../../results/filters/Presence_more0.1samples_COAD.txt",
"../../results/filters/HighMeanVsRest_COAD_vs_GBM_LUAD_LUSC_HNSC_OV_SKCM_BRCA.txt"),
taxa_selection_approach="intersect",
palette = c("CMS", "Left_right", "MSI", "jama", "nejm", "default", "default", "default", "default", "default", "default"),
table_path = "../../results/identification_related_species/tables/COAD_selectedTumor_3filters"
),
output_file = "../../results/identification_related_species/COAD_selectTumor_3filters_bacteria_species_compositions.html"
)
A list of all the analyses is in scripts/identification_related_species/identification_related_species_commands.R and scripts/identification_related_species/identification_related_species_wilc_commands.R . They can be run altogether:
../R-3.6.1/bin/Rscript scripts/identification_related_species/identification_related_species_commands.R
../R-3.6.1/bin/Rscript scripts/identification_related_species/identification_related_species_wilc_commands.R
This step classifies the sample properties given their reconstructed microbiome:
rmarkdown::render("scripts/ml/ml_lasso_classifier.Rmd",
params = list(
metadata = "../../metadata/COAD/COAD_clinical_metadata.txt",
new_property = "",
taxa = "../../data/RNAseq/bacteria/ComBat_plate_id/merged_unamb_score_norm/COAD_ComBat_corr_plate_id_selectedCoupledTumorNormal_bacteria_species_merged_unamb_score_norm.txt",
match_metadata_to_taxa = "file_id",
cat_properties = "sample_type",
values_not_considered = list("unknown"),
mutate_cat_properties = list(adj_sample_type=list("Primary Tumor"="", "Solid Tissue Normal"="")),
paired = "patient_id",
filter = c("sd", 200),
total_taxa = "../../data/all_bacteria_species.txt"
),
output_file = "../../results/ml/COAD_CoupledTumorNormal_ml_lasso200.html"
)
A list of all the analyses is in scripts/ml/ml_lasso_classifier_commands.R and can be run altogether:
../R-3.6.1/bin/Rscript scripts/ml/ml_lasso_classifier_commands.R
To detect which microbial pathways are present in the analysed samples, we used the tool HUMAnN 3.0, from Huttenhower lab. Before running the tool, we merged together the samples we are interested in (left or right side of the colon or their random subsets). Here we show an example of the already presented toy samples from Zmora et al. In the first step we convert the bam files from Pathseq to fastq files with samtools:
samtools bam2fq data/RNAseq/pathseq_output/ERR2756905_out/bam_out.bam > data/RNAseq/humann_output/example/ERR2756905_out/fastq_out.fastq
samtools bam2fq data/RNAseq/pathseq_output/ERR2756906_out/bam_out.bam > data/RNAseq/humann_output/example/ERR2756906_out/fastq_out.fastq
samtools bam2fq data/RNAseq/pathseq_output/ERR2756907_out/bam_out.bam > data/RNAseq/humann_output/example/ERR2756907_out/fastq_out.fastq
cat data/RNAseq/humann_output/example/ERR2756905_out/fastq_out.fastq data/RNAseq/humann_output/example/ERR2756906_out/fastq_out.fastq data/RNAseq/humann_output/example/ERR2756907_out/fastq_out.fastq > data/RNAseq/humann_output/example/merged_fastq_out.fastq
The tool step was run on the IEO cluster using local computer of 8 cores, 25 GB of ram. Since the reference databases are big (several GBs) and the setting up of HUMAnN tool is behind the scope of this paper, we provide the command we used and the required HUMAnN output tables. Please refer to the HUMAnN tutorial for setting up and running the tool. This is the command used to run it on the cluster:
humann --input data/RNAseq/humann_output/example/merged_fastq_out.fastq --output data/RNAseq/humann_output/example
The second step of HUMAnN 3.0 tool involves the split of stratified and unstratified pathways (not stratified by microbes) and the normalisation of the data from RPKs to CPM:
# left
humann_split_stratified_table --input data/RNAseq/humann_output/left/COAD_selectedTumor_left_pathabundance.tsv --output data/RNAseq/humann_output/left/
humann_renorm_table --input data/RNAseq/humann_output/left/COAD_selectedTumor_left_pathabundance_unstratified.tsv \
--output data/RNAseq/humann_output/left/COAD_selectedTumor_left_pathabundance_unstratified_cpm.tsv \
--units cpm \
--update-snames
# right
humann_split_stratified_table --input data/RNAseq/humann_output/right/COAD_selectedTumor_right_pathabundance.tsv --output data/RNAseq/humann_output/right/
humann_renorm_table --input data/RNAseq/humann_output/right/COAD_selectedTumor_right_pathabundance_unstratified.tsv \
--output data/RNAseq/humann_output/right/COAD_selectedTumor_right_pathabundance_unstratified_cpm.tsv \
--units cpm \
--update-snames
A list of all the analyses is in scripts/pathway_analysis/samples_management_commands.sh and can be run altogether:
# Linux/Ubuntu users
./scripts/pathway_analysis/samples_management_commands.sh
# Windows users
sed -i -e 's/\r$//' ./scripts/pathway_analysis/samples_management_commands.sh
.scripts/pathway_analysis/samples_management_commands.sh
The selection of the subsets of samples to be merged has been done with this script: :warning: You must run this step from R
rmarkdown::render("scripts/pathway_analysis/sample_bootstrapping.Rmd",
params=list(
metadata = c("../../metadata/COAD/COAD_RNAseq_barcodes.txt", "../../metadata/COAD/COAD_clinical_metadata.txt", "../../metadata/COAD/COAD_technical_metadata.txt"),
match_metadata = "file_id",
properties = c("sample_type", "is_ffpe", "library_name", "side"),
selection = list("Primary Tumor", "NO", c("Illumina TruSeq", "unknown"), "left"),
column_selected = "file_id",
taxa_tab = "../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/COAD/COAD_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
match_taxa = "rownames",
subset_size = 35,
random_tries = 50,
output = "../../results/pathway_analysis/bootstrapped_samples/COAD_selectedTumor_left"
)
)
The scripts to obtain all the subsets of samples are in:
../R-3.6.1/bin/Rscript scripts/pathway_analysis/boothstrapping_commands.R
The workflow listed in the general workflow step is applied to each subset of samples: for privacy reasons, the fastq files of the COAD samples are not reported, while the output table of the runs of HUMAnN 3.0 on the bootstrapped sets of samples is in data/RNAseq/humann_output . The list of the script to obtain unstratified and normalized tables are in:
# Linux/Ubuntu users
./scripts/pathway_analysis/boothstrapped_samples_management_commands.sh
# Windows users
sed -i -e 's/\r$//' ./scripts/pathway_analysis/boothstrapped_samples_management_commands.sh
To see the distributions of the pathways from the bootstrapped samples, we ran:
rmarkdown::render("scripts/pathway_analysis/pathway_analysis.Rmd",
params=list(
pathways1 = c(left="../../data/RNAseq/humann_output/left/COAD_selectedTumor_left_pathabundance_unstratified_cpm.tsv"),
pathways2 = c(right="../../data/RNAseq/humann_output/right/COAD_selectedTumor_right_pathabundance_unstratified_cpm.tsv"),
random_pathways1 = "../../data/RNAseq/humann_output/left_random/",
base_name1 = "COAD_selectedTumor_left_30random",
random_pathways2 = "../../data/RNAseq/humann_output/right_random/",
base_name2 = "COAD_selectedTumor_right_30random",
output = "../../results/pathway_analysis/COAD_side_table.tsv"
),
output_file = "../../results/pathway_analysis/COAD_side_bootstrapping.html"
)
Comparison of FISH quantifications of Akkermansia muciniphila and Faecalibacterium prausnitzii with their estimation from human RNA-seq data from this workflow
rmarkdown::render("scripts/comparison/continuous_metadata_analysis.Rmd",
params=list(
metadata1 = c(FISH="../../metadata/IEO/IEO_FISH_metadata.txt"),
metadata2 = c(RNAseq="../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/IEO/IEO_selectedTumor_bacteria_species_merged_unamb_score_norm.txt"),
property1 = c("ratio_akk_eub_dapi", "ratio_praus_eub_dapi"),
property2 = c("239935", "853"),
taxa_tab = c("../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/IEO/IEO_selectedTumor_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/IEO/IEO_selectedTumor_bacteria_species_merged_unamb_score_norm.txt"
),
output = "../../results/comparison/IEO_correlations_FISH_RNAseq_akk_faec.txt"
),
output_file = "../../results/comparison/IEO_correlations_FISH_RNAseq_akk_faec.html"
)
Comparison of 16S quantifications with human RNA-seq estimation from this workflow
rmarkdown::render("scripts/comparison/continuous_metadata_analysis_overview.Rmd",
params = list(
metadata1 = c("../../metadata/IEO/IEO_16S_metadata.txt"),
name1 = "16S",
metadata2 = "../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/IEO/IEO_bacteria_genus_merged_unamb_score_norm.txt",
name2 = "RNAseq",
taxa_tab = list("../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/IEO/IEO_selectedTumorNormal_bacteria_species_merged_unamb_score_norm.txt",
"../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/IEO/IEO_selectedTumorNormal_bacteria_species_merged_unamb_score_norm.txt"
),
property1 = c("intersection"),
property2 = c("intersection"),
thrs1 = c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6),
thrs2 = c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6),
palette="viridis7",
output = "../../results/comparison/IEO_correlations_16S_RNAseq_"
),
output_file = "../../results/comparison/IEO_correlations_16S_RNAseq.html"
)
Select the genera with the highest correlation
genus_tab=read.csv("results/comparison/IEO_correlations_16S_RNAseq_spearman.txt", sep="\t", header=TRUE, check.names=FALSE, stringsAsFactors=FALSE)
all_genera=read.csv("data/all_bacteria_genus.txt", sep="\t", header=TRUE, check.names=FALSE, stringsAsFactors=FALSE)
genera_names=sapply(genus_tab[,"tax_id"], function(x){
p=which(x==all_genera[,"tax_id"])
if(length(p)>0){
return(all_genera[p,"taxon_name"])
} else {
return(NA)
}
}
)
genus_tab=data.frame(genus_tab, taxon_name=genera_names)
pos=which(!is.na(genus_tab[,"R_spearm"]) & genus_tab[,"R_spearm"]>0.25)
res=genus_tab[pos,c("tax_id", "taxon_name")]
write.table(res, file="results/comparison/top_spearman_r_over0.25_16SvsRNAseq.txt", quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE)
Test property association of the highest correlated genera
rmarkdown::render("scripts/property_association/diversity.Rmd",
params = list(
metadata = c("../../metadata/COAD/COAD_technical_metadata.txt", "../../metadata/COAD/COAD_clinical_metadata.txt"),
join = "columns",
taxa = c("../../data/RNAseq/bacteria/raw/merged_unamb_score_norm/COAD/COAD_selectedTumor_bacteria_genus_merged_unamb_score_norm.txt"),
cat_properties = c("gender", "bmi", "stage", "CMS", "history_of_other_malignancy", "side", "MSI_status", "CIMP_status", "history_colon_polyps"),
values_not_considered = list("unknown", "unknown", "unknown", c("unknown", "NOLBL"), c("unknown","inconsistency"), "unknown", "unknown", "unknown", "unknown"),
new_property = list(c(old="plate_id", met="corr_plate_id", new_name="corr_plate_id")),
cont_properties = c("percent_normal_cells", "age", "mutation_burden", "stemness", "aneuploidy_score"),
taxa_selection = c("../../../microbiome_reconstruction/results/comparison/top_spearman_r_over0.25_16SvsRNAseq.txt"),
taxa_selection_approach=c("all"),
total_taxa = "../../data/all_bacteria_genus.txt"
),
output_file = "../../results/property_association/bacteria_genus/raw/merged_unamb_score_norm/COAD/COAD_selectedTumor_property_association.html"
)
CIBERSORTx requires gene expression to estimate immune infiltration. We upload TPM tables to impute cell fraction, used the provided LM22 signature, enabled batch correction B-mode, disabled quantile normalization (as suggested by the authors) and run 1000 permutations.
To download FPKM values we used the GDC Data Transfer Tool using the provided manifest, which download the FPKM values of all the cancer types.
mkdir data/RNAseq/FPKM/tcga/COAD/
../bin/gdc-client download -m data/RNAseq/FPKM/tcga/COAD_fpkm_manifest.tsv -d data/RNAseq/FPKM/tcga/COAD/
gzip -d data/RNAseq/FPKM/tcga/COAD/*/*.gz
The scripts to obtain all the FPKM tables of the cancer types are in:
./scripts/gene_expression/gene_expression_data_download.sh
To create the TPM tables per TCGA cancer type to be uploaded to CIBERSORtx, in R:
rmarkdown::render("scripts/gene_expression/from_FPKM_to_TPM.Rmd",
params=list(
manifest = "../../data/RNAseq/FPKM/tcga/COAD_fpkm_manifest.tsv",
dir = "../../data/RNAseq/FPKM/tcga/COAD/",
converter_tab = "../../data/RNAseq/FPKM/tcga/gene_annotation_v22_gene_length.txt",
output_fpkm = c("../../data/RNAseq/FPKM/tcga/"),
output_tpm = c("../../data/RNAseq/TPM/tcga/")
),
output_file = "../../data/RNAseq/TPM/tpm.html"
)
The scripts to obtain all the TPM tables of the cancer types are in:
../R-3.6.1/bin/Rscript scripts/gene_expression/TPM_conversion_commands.R