chewBBACA_tutorial: An HTML repository from wangmz0617

Objective

The objective of this tutorial is to illustrate the complete workflow of a chewBBACA pipeline for creating a wgMLST and a cgMLST schema for a colection of 714 Streptococcus agalactiae genomes (32 complete genomes and 682 assemblies deposited on the Sequence Read Archive) by providing step-by-step instructions and displaying the obtained outputs.

All information about NCBI genomes used in this example is on the .tsv file. inside the genomes folder.

The setup is done by the following steps

Install chewBBACA and download the training files repository. Check the installation guide.
do a git clone https://github.com/B-UMMI/chewBBACA_tutorial to download the entire repository in the folder of your choice
cd chewBBACA_tutorial/ to enter the dir
unzip genomes/complete_genomes.zip to extract all the complete genomes files. a directory named complete_genomes will be available at chewBBACA tutorial/

The reported times using the 32 complete genomes were calculated for a Dual QuadCore laptop with intel i7 2.4GHz using 6 cores. Using slower machines or using less number of CPUs can greatly increase the duration of the analyses.

Schema creation

The wgMLST schema was created using the 32 Streptococcus agalactiae complete genomes (32 genomes with a level of assembly classified as complete genome or chromossome) available at NCBI. The sequences are present in the complete_genomes/ folder. The command is the following:

chewBBACA.py CreateSchema -i complete_genomes/ --cpu 6 -o schema_seed --ptf Streptococcus_agalactiae.trn

The command uses 6 CPU and outputs the schema to schema_seed folder using the prodigal training set for Streptococcus agalactiae and took around 16 minutes to complete resulting on a wgMLST schema wiht 3126 loci. At this point the schema is defined as a set of loci each with a single allele.

Allele calling

The next step was to perform allele calling with the created wgMLST schema for the 32 complete genomes.

chewBBACA.py AlleleCall -i complete_genomes/ -g schema_seed/ -o results_cg --cpu 6 --ptf Streptococcus_agalactiae.trn

The allele call used the default BSR threshold of 0.6 (more information on the thresold here) and took approximately 12 mins to complete (an average of 22 secs per genome)

Paralog detection

The next step in the analysis is to determine if some of the loci can be considered paralogs, based on the result of the wgMLST allele calling. The Allele call returns a list of Paralogous genes in the RepeatedLoci.txt file that can be found on the results_cg folder. The output example is present in chewBBACA_tutorial/results_cg/results_20180202T104252/. In chewBBACA_tutorial/results_cg/results_20180202T104252/RepeatedLoci.txt, a set of 27 loci were identified as possible paralogs that were removed from further analysis. For a more detailed description see the Alelle Calling entry on the wiki.

chewBBACA.py RemoveGenes -i results_alleles.tsv -g RepeatedLoci.txt -o alleleCallMatrix_cg

A set of 1133 loci were found to be present in all the analyzed complete genomes, while 1265 loci were present in at least 95%.

chewBBACA.py TestGenomeQuality -i alleleCallMatrix_cg.tsv -n 13 -t 200 -s 5

larger image fig 1 or see interactive plot online

For further analysis only the 1265 loci that we chose to represent the cgMLST schema will be used. The list can be retrieved from the analysis_cg/Genes_95%.txt that TestGenomeQuality creates. There you can find the list of loci present in 95% of the strains per threshold, in this case we will use any threshold from 60 to 195 since the number of loci . You can see the list file with 1265 loci at analysis_cg/listgenes_core.txt and for further use you should add for each loci the full path for each locus fasta file.

Allele call for 682 Streptococcus agalactiae assemblies

682 assemblies of Streptococcus agalactiae available on NCBI were downloaded ( 03-08-2016, downloadable zip file here) and analyzed with MLST in order to exclude possibly mislabeled samples as Streptococcus agalactiae.

Out of the 682 genomes, 2 (GCA_000323065.2_ASM32306v2 and GCA_001017915.1_ASM101791v1) were detected as being of a different species/contamination and removed from the analysis. Allele call was performed on the bona fide Streptococcus agalactiae 680 genomes using the 1265 loci for schema validation. Paralog detection found no paralog loci.

chewBBACA.py AlleleCall -i .genomes/ -g listgenes_core.txt -o results --cpu 6 --ptf Streptococcus_agalactiae.trn

Run on the same laptop with 6 cpu took approximately 30 mins to complete (an average of 2.6 secs per genome)

Since our initial 32 genomes allele call was performed with the wgMLST schema we need to remove the loci that constitute the auxiliary genome, in order to be able to compare with the cgMLST allele call we performed for the 680 genomes. We select the alleleCallMatrix_cg.tsv file we previously created (already paralog free but still a wgMLST profile) and we extract only the loci present in 95% of the matrix.

chewBBACA.py ExtractCgMLST -i alleleCallMatrix_cg.tsv -o cgMLST_completegenomes -p 0.95

Now the file cgMLST_completegenomes/cgMLST.tsv can be concatenated with the allele call result from the 680 genomes results_all/results_20180202T112710/results_alleles.tsv. The concatenated file can be found at analysis_all/cgMLST_all.tsv.

The new concatenated file was analyzed in order to assess the cgMLST allele quality attribution for all the genomes.

chewBBACA.py TestGenomeQuality -i cgMLST_all.tsv -n 13 -t 300 -s 5

larger image here fig 2 or see interactive plot online

While the number of loci present in 95% of genomes remains virtually constant at around 1200 loci, considering all or most of the genomes (90%<x≤100%) the number of loci present is lower and presents some variation when specific genomes are removed from the analysis.

We selected the results at the threshold of 25 for further analysis. Although this selection is somewhat arbitrary, when moving to a lower threshold there is step increase in the number of loci present in 95% and 99% of genomes that could represent the exclusion of a more divergent clade from the analysis. Furthermore, at the threshold of 25 there is an acceptable number of loci present in all considered genomes (650 genomes/440 loci), which we felt would afford a good discriminatory power.

The genomes that were removed at each threshold are indicated in the file analysis_all/removedGenomes.txt and a file analysis_all/removedGenomes_25.txt was created with only the genomes removed at the 25 threshold.

The following command creates a directory analysis_all/cgMLST_25/ and saves the cgMLST schema selected at the chosen threshold to the file cgMLST.tsv.

chewBBACA.py ExtractCgMLST -i cgMLST_all.tsv -o cgMLST_25 -g removedGenomes_25.txt

Minimum Spanning Tree

analysis_all/cgMLST_25/cgMLST.tsv was uploaded to Phyloviz online and can be accessed here

Genome Quality analysis

Since the quality of the used assemblies was not confirmed, it is possible that some of the assemblies included were of low quality. A general analysis of the assemblies show a N50 variation that ranges from 8055 to over 2.2M, while the number of contigs ranges between 1 and 553. These results made us suspect that the quality of the genomes could have affected the allele call results and consequently caused a significant drop in the number of loci detected as present in all genomes.

As stated previously, to obtain the cgMLST schema, some genomes (n=62) had to be removed since they were extremes cases of missing data. In order to assess the possible reason for their poor allele call performance, two plots were built. The removed genomes were then highlighted and dashed lines were drawn linking the values for the same genomes.

The first plot represents the total number of bp in contigs with a size >10 kbp and the N50 of the assemblies, sorted by decreasing values.

larger image fig 3

The second plot represents the total number of contigs and the number of contigs >10 kbp

larger image fig 4

See interactive plot online

At first sight, most of the removed genomes (56/62) were located on the lower range of N50 and bp in contigs >10 kbp (fig.3) and the higher number of contigs (fig.4)

The 5 genomes that were outside this pattern were individually checked :

GCA_000186445.1 here - 21 contigs but only 1 is above 10k (Scaffold with lot of Ns, 134 real contigs)
GCA_000221325.2 here- NCBI curated it out of RefSeq because it had a genome length too large
GCA_000427055.1 here- NCBI curated it out of RefSeq because it had many frameshifted proteins
GCA_000289455.1 here- No ST found. We concluded the assembly has a problem but we have not yet identified it.
GCA_000288835.1 here- NCBI curated it out of RefSeq because it had many frameshifted proteins

Schema Evaluation

Schema Evaluator was run on the cgMLST schema:

chewBBACA.py SchemaEvaluator -i schema_seed/ -l rms/RmS.html -ta 11 --title "cgMLST schema GBS tutorial schema evaluator" --cpu 6

See the schema evaluator page here

wangmz0617/chewBBACA_tutorial