adhesiomeR-paper

01-Filter_RefSeq.R

This script was used to filter downloaded genomes based on associated metadata.

Data files necessary to run the scripts are available here.

02-Adhesin_sequence_comparison.R

Analysis and generation of plots from blast all to all comparison of adhesin sequences.

Files with adhesin information necessary to run the script: Adhesins.xlsx Directory with BLAST results and plots we obtained: BLAST

03-Initial_pathotypes_analysis.R

Initial analysis of pathotyped genomes with older version of adhesiomeR (commit 0f675a9), which should correspond to current relaxed version with thresholds of identity and coverage set to 80%. You can download result files here

04-Fimbrial_adhesins_analysis.R

This script describes analysis of fimbrial adhesins and their gene co-localization to select the reference set for calculation of bit score thresholds. It uses results linked in the previous step. You can download all results obtained by us here. Note that this directory contains a lot of large files.

04-Intimin_analysis.R

Analysis of intimin variants to select one representative sequence for the database. This script runs BLAST on pathotyped genome collection. You can download our results and source sequences for intimin here.

04-Nonfimbrial_adhesins_analysis.R

Analysis of genomic context of nonfimbrial adhesins to select sequences for the reference set. The analyses were run multiple times, you can download all our results here. Note that this directory contains a lot of large files.

05-Bitscore_thresholds.R

Filter results from step 4 to obtain reference sequences and calculate bit score thresholds (Table S5). Files necessary to run this script are linked in previous steps.

06-Pathotypes_analysis.R

Analysis of pathotyped genomes with final version of adhesiomeR using both types of search settings. You can download our results for strict version and relaxed version.

07-Selecting_k_clara.R

Script running selection of the optimal number of clusters according to gap statistic for all adhesins.

07-Selecting_k_clara_fimbrial.R

Script running selection of the optimal number of clusters according to gap statistic for fimbrial adhesins.

07-Selecting_k_clara_nonfimbrial.R

Script running selection of the optimal number of clusters according to gap statistic for nonfimbrial adhesins.

08-Profile_analysis_and_initial_clustering.R

This script generates adhesin profiles and performs initial clustering according to selected in a previous step optimal numbers of clusters. It generates HTML reports with overview of each clustering (based on Clara_clustering.Rmd). Clustering data generated in this script is available for download here.

09-Clustering.R

Performs final clustering, assigns names to clusters, calculates gene importance for each clustering and generates clustering plots with fractions of pathotypes and numbers of profiles and genomes assigned to each cluster. File with clustering data used in this script is obtained and linked in a previous step.

10-Benchmark.R

Performs comparison of adhesiomeR results with experimental annotations from Von Mentzer et al. on ETEC strains and generates figures. Files necessary to run this analysis: ETEC metadata, adhesiomeR results.