This tutorial describes a complete walkthrough for CONCOCT2 and DESMAN on the STAMPS servers using 4 threads. Some of of the pre-processing steps are time consuming and not that informative. These have been pre-run and you will just copy across the output. We will begin by constructing a new working directory in our home directories:
cd ~
mkdir CDTutorial
cd CDTutorial
The tutorial data consists of 32 synthetic samples of 500,000 2X150 bp paired end reads taken from 8 species and 15 genomes. Copy across the simulation config file just to have a look at it with the less command:
cp /class/stamps-shared/CDTutorial/genome_coverage.tsv .
less genome_coverage.tsv
This has format "Sample\tSpecies\Strain\Coverage\Relative frequency". We will now use CONCOCT/DESMAN to resolve these de novo.
The simulated reads and genomes are in the tar archives /class/stamps-shared/CDTutorial/Reads.tar.gz and /class/stamps-shared/CDTutorial/Genomes.tar.gz respectively. Please do not run the commands below. They indicate what you would do if you wanted to run the preprocessing steps yourself.
cp /class/stamps-shared/CDTutorial/Reads.tar.gz
tar -xvzf Reads.tar.gz
cp /class/stamps-shared/CDTutorial/Genomes.tar.gz
tar -xvzf Genomes.tar.gz
do not run the above
We need to load up the concoct and desman modules:
module load concoct
source /class/stamps-software/concoct_speedup_mp/bin/activate
module load desman
source /class/stamps-software/desman/bin/activate
We now describe how to perform a complete analysis binning and resolving strains on this synthetic community. Some of these steps are time consuming so we recommend that you copy across the output instead. This assumes that DESMAN and CONCOCT are installed and their paths set to the variables DESMAN and CONCOCT respectively e.g. (changing paths to your system set-up):
export CONCOCT=/automounts/classfs/classfs/stamps-software/concoct_speedup_mp/
export CONCOCT_DATA=/class/stamps-shared/CDTutorial/CONCOCT/
export DESMAN=/automounts/classfs/classfs/stamps-software/desman/
export CDSCRIPTS=/class/stamps-shared/CDTutorial/scripts
We will also create a new variable pointing to our current working dir for all this analysis:
export METASIMWD=~/CDTutorial
The first step in the analysis is to assemble the reads.
I previously co-assembled the reads using MEGAHIT 1.1.1 and default parameters:
module load megahit/1.0.6
ls Reads/*r1*gz | tr "\n" "," | sed 's/,$//' > r1.csv
ls Reads/*r2*gz | tr "\n" "," | sed 's/,$//' > r2.csv
megahit -1 $(<r1.csv) -2 $(<r2.csv) -t 4 -o Assembly
This would take approximately 20 minutes. Please do not run the above now. Instead copy across results that I ran earlier:
cp /class/stamps-shared/CDTutorial/Assembly.tar.gz .
tar -xvzf Assembly.tar.gz
Evaluate the assembly quality with the script provided:
$CDSCRIPTS/contig-stats.pl < Assembly/final.contigs.fa
Results in output:
sequence #: 13863 total length: 31472365 max length: 1015941 N50: 9774 N90: 706
Do you think this is a good co-assembly?
The next step in CONCOCT is to cut up contigs and map the reads back onto them. Again do not run any of the following:
cd Assembly
python $CONCOCT/scripts/cut_up_fasta.py -c 10000 -o 0 -m final.contigs.fa > final_contigs_c10K.fa
bwa index final_contigs_c10K.fa
cd ..
mkdir Map
for file in Reads/*.r1.fq.gz
do
stub=${file%.r1.fq.gz}
base=${stub##*/}
echo $base
bwa mem -t 8 Assembly/final_contigs_c10K.fa $file ${stub}.r2.fq.gz > Map/$base.sam
done
python $DESMAN/scripts/Lengths.py -i Assembly/final_contigs_c10K.fa | tr " " "\t" > Assembly/Lengths.txt
for file in Map/*.sam
do
stub=${file%.sam}
stub2=${stub#Map\/}
echo $stub
(samtools view -h -b -S $file > ${stub}.bam; samtools view -b -F 4 ${stub}.bam > ${stub}.mapped.bam; samtools sort -m 1000000000 ${stub}.mapped.bam ${stub}.mapped.sorted; bedtools genomecov -ibam ${stub}.mapped.sorted.bam -g Assembly/Lengths.txt > ${stub}_cov.txt)
done
do not run the above
Collate coverages together (do not run):
for i in Map/*_cov.txt
do
echo $i
stub=${i%_cov.txt}
stub=${stub#Map\/}
echo $stub
awk -F"\t" '{l[$1]=l[$1]+($2 *$3);r[$1]=$4} END {for (i in l){print i","(l[i]/r[i])}}' $i > Map/${stub}_cov.csv
done
$DESMAN/scripts/Collate.pl Map > Coverage.csv
sed -i 's/\.\.\/CDTutorial\/Map\///g' Coverage.csv
do not run the above
Instead just copy the resulting contig coverage file:
cp /class/stamps-shared/CDTutorial/Coverage.csv .
This just contains a header and a row for each contig givings its coverage (how is this defined?) in each sample. If you want to use CONCOCT with mappers other than bwa this is the input that is required.
We are going to use a more efficient version of CONCOCT, CONCOCT2. I will not show you how to do the refinement step discussed in the presentation. If that interests you we can work through it in an open lab. Now we can run CONCOCT:
mkdir Concoct
mv Coverage.csv Concoct
cd Concoct
tr "," "\t" < Coverage.csv > Coverage.tsv
concoct --coverage_file Coverage.tsv --composition_file ../Assembly/final_contigs_c10K.fa -t 4
As the program runs it outputs iteration number and fit and change in fit i.e.:
137,-171867.297247,0.000349
138,-171867.297516,0.000269
139,-171867.297723,0.000207
140,-171867.297881,0.000158
141,-171867.298002,0.000121
142,-171867.298094,0.000092
CONCOCT Finished, the log shows how it went.
as the clustering converges the improvement in fit should got to zero at which point the program terminates.
We can see the arguments that CONCOCT takes as follows:
concoct
The only two that generally need adjusting are the the number of initial clusters '-c' which defaults at 400 but should be at least twice the number of genomes in your assembly. See how to determine that below and the number of threads to use '-t' which ensures maximum performance on a given machine. The key output file is the cluster assignments.
head -n 10 clustering_gt1000.csv
This gives for each contig the bin it is placed in as an integer index. To get the number of contigs in each of the 21 clusters that should result:
sed '1d' clustering_gt1000.csv | cut -d"," -f 2 | sort | uniq -c
Other files generated by the program are:
-
original_data_gt1000.csv: the original data after normalisation
-
PCA_components_data_gt1000.csv: the PCA components
-
PCA_transformed_data_gt1000.csv: the data after PCA transformation
We can also visualise the clusters in the PCA space, run:
Rscript $CDSCRIPTS/ClusterPlot2.R -c clustering_gt1000.csv -p PCA_transformed_data_gt1000.csv -o clustering_gt1000_PCA.pdf
and download the resulting pdf it should look like this:
To determine how good our binning is we need to annotate the contigs with clusters of orthologous genes (COGs). Find genes using prodigal:
cd ..
mkdir Annotate
cd Annotate/
python $DESMAN/scripts/LengthFilter.py ../Assembly/final_contigs_c10K.fa -m 1000 > final_contigs_gt1000_c10K.fa
prodigal -i final_contigs_gt1000_c10K.fa -a final_contigs_gt1000_c10K.faa -d final_contigs_gt1000_c10K.fna -f gff -p meta -o final_contigs_gt1000_c10K.gff
do not run the above
Assign COGs change the -c flag which sets number of parallel processes appropriately:
export COGSDB_DIR=/class/stamps-shared/CDTutorial/rpsblast_cog_db/
module load parallel/201702222
$CONCOCT/scripts/RPSBLAST.sh -f final_contigs_gt1000_c10K.faa -p -c 4 -r 1
do not run the above
We will also write out lists of cogs and genes in each contig. These will be useful later:
python $DESMAN/scripts/ExtractCogs.py -b final_contigs_gt1000_c10K.out --cdd_cog_file $CONCOCT_DATA/scgs/cdd_to_cog.tsv -g final_contigs_gt1000_c10K.gff > final_contigs_gt1000_c10K.cogs
python $DESMAN/scripts/ExtractGenes.py -g final_contigs_gt1000_c10K.gff > final_contigs_gt1000_c10K.genes
do not run the above
We will also need contig lengths later on:
python $DESMAN/scripts/Lengths.py -i final_contigs_gt1000_c10K.fa > final_contigs_gt1000_c10K.len
do not run the above
Instead copy across the complete annotation directory:
cd $METASIMWD
cp /class/stamps-shared/CDTutorial/Annotate.tar.gz .
tar -xvzf Annotate.tar.gz
cd Annotate
The annotation files are just simple comma separated lists, have a look at them:
head final_contigs_gt1000_c10K.cogs
head final_contigs_gt1000_c10K.genes
Format is:
- gene_id,cog_id,startpos,endpos,cogstart,cogend,strand
- gene_id,contig_id,startpos,endpos,strand
We are going to determine the number of complete and pure genomes in our binning using single-core gene frequencies. Before doing that it is useful to estimate how many genomes we think are in our assembly. We can get this from the assembly by just calculating the median number of single-copy core genes:
cd $METASIMWD
cp /class/stamps-shared/CDTutorial/scgs.txt .
$CDSCRIPTS/CountSCGs.pl scgs.txt < Annotate/final_contigs_gt1000_c10K.cogs | cut -d"," -f2 | awk -f $CDSCRIPTS/median.awk
This should give 8 which since that is our species number is reassuring. Now we calculate scg frequencies on the CONCOCT clusters:
cd Concoct
python $CONCOCT/scripts/COG_table.py -b ../Annotate/final_contigs_gt1000_c10K.out -m $CONCOCT_DATA/scgs/scg_cogs_min0.97_max1.03_unique_genera.txt -c clustering_gt1000.csv --cdd_cog_file $CONCOCT_DATA/scgs/cdd_to_cog.tsv > clustering_gt1000_scg.tsv
This produces a tab separated file of SCG frequencies, which we can then visualise in R:
Rscript $CONCOCT/scripts/COGPlot.R -s clustering_gt1000_scg.tsv -o clustering_gt1000_scg.pdf
You may need to download this pdf off the class servers to visualise. We should see 8 nearly complete bins and some fragmentary ones:
We want to store a list of 75% complete and pure for DESMAN analysis:
python $CDSCRIPTS/CompleteClusters.py clustering_gt1000_scg.tsv > Cluster75.txt
In real studies it is important to know the coverage of each bin in each sample. This can then be transformed into relative copy numbers for correlation with meta-data:
python $CDSCRIPTS/ClusterMeanCov.py Coverage.csv clustering_gt1000.csv ../Assembly/final_contigs_c10K.fa > Cluster_Cov.csv
We can quickly use R to sum up the cluster coverages:
R
>Cluster_Cov <- read.csv("Cluster_Cov.csv",header=TRUE,row.names=1)
>sort(rowSums(Cluster_Cov))
>q()
We have two clusters, Cluster7 and Cluster16 that have a total coverage of > 100 and are 75% pure and complete, this is typically the minimum coverage necessary for strain resolution.
These reads were generated synthetically in silico from known genomes. It is therefore possible to compare the CONCOCT binning results to the assignments of contigs to genomes. This will not be possible for a genuine experimental metagenome. The assignments of contigs to genomes have been precomputed for you:
cd $METASIMWD
cp /class/stamps-shared/CDTutorial/AssignContigs.tar.gz .
tar -xvzf AssignContigs.tar.gz
Then move into this directory and run the following CONCOCT script which calculates accuracy statistics for a comparison of bins to reference genomes:
cd AssignContigs
$CONCOCT/scripts/Validate.pl --cfile=../Concoct/clustering_gt1000.csv --sfile=Contig_Species.csv --ffile=../Assembly/final_contigs_c10K.fa
The results should looks as follows:
N M TL S K Rec. Prec. NMI Rand AdjRand
5648 5642 2.7103e+07 8 21 0.765935 0.999235 0.869097 0.928564 0.695141
This confirms SCG plots we have some good bins high precision, but they are fragmented with lower recall.
We can also visualise the resulting confusion matrix.
Rscript $CONCOCT/scripts/ConfPlot.R -c Conf.csv -o Conf.pdf
Which should give something like...
First we separate out the contigs, cogs, and genes into the separate bins:
cd ..
mkdir Split
cd Split
$DESMAN/scripts/SplitClusters.pl ../Annotate/final_contigs_gt1000_c10K.fa ../Concoct/clustering_gt1000.csv
$CDSCRIPTS/SplitCOGs.pl ../Annotate/final_contigs_gt1000_c10K.cogs ../Concoct/clustering_gt1000.csv
$CDSCRIPTS/SplitGenes.pl ../Annotate/final_contigs_gt1000_c10K.genes ../Concoct/clustering_gt1000.csv
cd ..
Then we select the SCGS for each cluster:
cd $METASIMWD
while read -r cluster
do
echo $cluster
$DESMAN/scripts/SelectContigsPos.pl scgs.txt < Split/${cluster}/${cluster}.cog > Split/${cluster}/${cluster}_core.cogs
done < Concoct/Cluster75.txt
The first step in pre-processing for DESMAN would be to split up the bam files by each cluster in turn. Do not run this yourselves:
mkdir SplitBam
while read -r cluster
do
grep ">" Split/${cluster}/${cluster}.fa | sed 's/>//g' > Split/${cluster}/${cluster}_contigs.txt
$CDSCRIPTS/AddLengths.pl Annotate/final_contigs_gt1000_c10K.len < Split/${cluster}/${cluster}_contigs.txt > Split/${cluster}/${cluster}_contigs.tsv
mkdir SplitBam/${cluster}
for bamfile in Map/*.mapped.sorted.bam
do
stub=${bamfile#Map\/}
stub=${stub%.mapped.sorted.bam}
samtools view -bhL Split/${cluster}/${cluster}_contigs.tsv $bamfile > SplitBam/${cluster}/${stub}_Filter.bam&
done
wait
done < Concoct/Cluster75.txt
do not run the above
and use a third party program bam-readcount to get base frequencies at each position on each contig:
module load bam-readcount
while read line
do
mag=$line
echo $mag
cd SplitBam
cd ${mag}
cp ../../Split/${mag}/${mag}_contigs.tsv ${mag}_contigs.tsv
samtools faidx ../../Split/${mag}/${mag}.fa
echo "${mag}_contigs.tsv"
mkdir ReadcountFilter
for bamfile in *_Filter.bam
do
stub=${bamfile%_Filter.bam}
echo $stub
(samtools index $bamfile; bam-readcount -w 1 -q 20 -l "${mag}_contigs.tsv" -f ../../Split/${mag}/${mag}.fa $bamfile 2> ReadcountFilter/${stub}.err > ReadcountFilter/${stub}.cnt)&
done
cd ..
cd ..
wait
done < Concoct/Cluster75.txt
do not run the above
Then we can get the base counts on the core cogs:
mkdir Variants
while read -r cluster
do
echo $cluster
cd ./SplitBam/${cluster}/ReadcountFilter
gzip *cnt
cd ../../..
python $DESMAN/scripts/ExtractCountFreqGenes.py Split/${cluster}/${cluster}_core.cogs ./SplitBam/${cluster}/ReadcountFilter --output_file Variants/${cluster}_scg.freq > Variants/${cluster}log.txt&
done < Concoct/Cluster75.txt
do not run the above
Instead just copy the following tar archive into your working directory:
cp /class/stamps-shared/CDTutorial/Variants.tar.gz $METASIMWD
tar -xvzf Variants.tar.gz
The directory contains 8 .freq files one for each cluster. If we look at one:
head -n 10 Variants/Cluster2_scg.freq
We see it comprises a header, plus one line for each core gene position, giving base frequencies in the order A,C,G,T. This is the input required by DESMAN.
This may be necessary to get DESMAN accessory scripts to work:
export PATH=/class/stamps-software/desman/lib/python2.7/site-packages/desman-0.1.dev0-py2.7-linux-x86_64.egg/desman:$PATH
First we detect variants on both clusters that were identified as 75% pure and complete and had a coverage of greater than 100:
cd $METASIMWD
mkdir SCG_Analysis
for file in ./Variants/Cluster7_scg.freq ./Variants/Cluster16_scg.freq
do
echo $file
stub=${file%.freq}
stub=${stub#./Variants\/}
echo $stub
mkdir SCG_Analysis/$stub
cp $file SCG_Analysis/$stub
cd SCG_Analysis/$stub
Variant_Filter.py ${stub}.freq -o $stub -m 1.0 -f 25.0 -c -sf 0.80 -t 2.5
cd ../..
done
The Variant_Filter.py script produces an output file ${stub}sel_var.csv which lists those positions that are identified as variants by the log-ratio test with FDR < 1.0e-3. We can compare variant frequencies in the two clusters:
cd SCG_Analysis
wc */*sel_var.csv
We can also go into the Cluster 16 directory and look at the output files:
cd Cluster16_scg
more Cluster16_scgsel_var.csv
The other important file is:
more Cluster16_scgtran_df.csv
This is an estimate of base error rates which is used as a starting point for the haplotype inference.
So accounting for the header line we observe 17 and 0 variants in Clusters 16 and 7 respectively. For only Cluster 16 then can we attempt to actually resolve haplotypes. Using the desman executable:
cd $METASIMWD/SCG_Analysis/Cluster16_scg
varFile='Cluster16_scgsel_var.csv'
eFile='Cluster16_scgtran_df.csv'
for g in 1 2 3 4
do
echo $g
for r in 0 1 2 3 4
do
echo $r
desman $varFile -e $eFile -o Cluster16_${g}_${r} -g $g -s $r -m 1.0 -i 100
done
done
cat */fit.txt | cut -d"," -f2- > Dev.csv
sed -i '1iH,G,LP,Dev' Dev.csv
Which we can then visualise:
$DESMAN/scripts/PlotDev.R -l Dev.csv -o Dev.pdf
There are clearly two haplotypes. We can also run the heuristic to determine haplotype number:
python $DESMAN/scripts/resolvenhap.py Cluster16
This should output:
2,2,3,0.0,Cluster16_2_3/Filtered_Tau_star.csv
This has the format:
No of haplotypes in best fit, No. of good haplotypes in best fit, Index of best fit, Average error rate, File with base predictions
Have a look at the prediction file:
more Cluster16_2_3/Filtered_Tau_star.csv
The position encoding is ACGT so what are the base predictions at each variant position? We can turn these into actual sequences with the following commands:
cut -d"," -f 1 < Cluster16_scgcogf.csv | sort | uniq | sed '1d' > coregenes.txt
mkdir SCG_Fasta_2_3
python $DESMAN/scripts/GetVariantsCore.py ../../Annotate/final_contigs_gt1000_c10K.fa ../..//Split/Cluster16/Cluster16_core.cogs Cluster16_2_3/Filtered_Tau_star.csv coregenes.txt -o SCG_Fasta_2_3/
This generates one fasta sequence file for each gene with the two strains in:
ls SCG_Fasta_2_3
In this case we know Cluster16 maps onto species 2095, which reassuringly has two strains present. I have pre-calculated the true variants between these two strains.
cp /class/stamps-shared/CDTutorial/DesmanValidation/Cluster16_core_tauRF.csv .
python $DESMAN/scripts/validateSNP2.py Cluster16_2_3/Filtered_Tau_star.csv Cluster16_core_tauRF.csv
This gives distance matrices between the true variants and the predictions in terms of SNV and fractions:
Intersection: 16
[[ 0 16]
[16 0]]
[[ 0. 1.]
[ 1. 0.]]
Each predicted haplotype should match onto a reference strain with no errors.
The next step would be to calculate the accessory gene presence and absences for these strains. That too will be left to an open lab if there is interest.