Metagenomic analysis of nutrient-limited secondary cooling water

Repository for the article:

Props, R., Monsieurs, P., Vandamme, P., Leys, N., Denef V.J. and Boon, N., Gene Expansion and Positive Selection as Bacterial Adaptations to Oligotrophic Conditions. mSphere 4:e00011-19. https://doi.org/10.1128/mSphereDirect.00011-19.

1. QC and assembly

Quality control and adapter trimming

Sickle was used for removing erroneous and low-quality reads from the raw data. Scythe was used for removing adapter sequences.

Co-assembly with IDBA-UD with on interleaved fasta with following parameters

idba_ud -kmin 41 -kmax 101 -s 10 -i dt_int.fasta

Map reads to co-assembly

Reads were mapped to the co-assembly using bwa-mem on default settings. samtools, bedtools2 and anvi'o were used to format/profile the mapping files for anvi'o (convert to bam file + sort).

2. Binning strategy

Initial binning was performed on the contig splits generated by anvi'o by means of a supervised and manual binning strategy in Vizbin on default settings. The bins were inspected and further manually refined in anvi'o. The contig database with the associated bins are available here.

3. Binning QC and refinement

Completeness and redundancy estimates of all MAGs (and also all reference Ramlibacter genomes) were estimated through the marker gene databases in anvi'o as well as the lineage-specific marker sets in CheckM.

4. Calculation of relative abundances

Relative abundances were calculated by mapping all reads to the co-assembly using bwa-mem on default settings and applying the following normalization to bin size (no %GC correction was applied to the data):

Reconstruction of full-length 16S sequences (EMIRGE)

EMIRGE was run on the merged QC'd fastq files with insert size 500 and st.dev 500 in order to achieve maximum mapping of reads. Futher, EMIRGE was run both with the full NR database clustered at 97% using UCLUST and with the small manually curated freshwater database (FWDB) available from here. In addition emirge was run separately with -j on 0.97 and 1.0 allowing both 97% consensus sequences and unique sequences to be reconstructed. From all these runs a merged fasta file was constructed. Finally, all the reconstructed sequences were clustered at 97% identity using UCLUST. EMIRGE reconstructed sequences with a normalized prior abundance of less than 5% were removed, and sequences were ordered from high to low abundance before clustering because UCLUST is dependent on the order of the sequences (see here). Classification of the sequences was performed using the TaxAss pipeline as described here. Full-length sequences were classified using the FWDB database if their top matches to the FWDB were higher than 95%.

Emirge reconstructed sequences

NR_1.0
15|HQ222271.1.1558_Prior=0.791329_Length=1536_NormPrior=0.784664        Bacteria(100);Proteobacteria(100);Betaproteobacteria(100);Burkholderiales(100);Comamonadaceae(100);Comamonadaceae_unclassified(100);
16|EF516083.1.1452_Prior=0.020773_Length=1440_NormPrior=0.021971        Bacteria(100);Proteobacteria(100);Betaproteobacteria(100);Burkholderiales(100);Comamonadaceae(100);Comamonadaceae_unclassified(100);
0|FR853751.1.1492_Prior=0.187898_Length=1480_NormPrior=0.193365 Bacteria(100);Bacteroidetes(100);Sphingobacteriia(100);Sphingobacteriales(100);Chitinophagaceae(100);Sediminibacterium(100);

NR_0.97
15|HQ222271.1.1558_Prior=0.812102_Length=1536_NormPrior=0.806369        Bacteria(100);Proteobacteria(100);Betaproteobacteria(100);Burkholderiales(100);Comamonadaceae(100);Comamonadaceae_unclassified(100);
0|FR853751.1.1492_Prior=0.187898_Length=1480_NormPrior=0.193631 Bacteria(100);Bacteroidetes(100);Sphingobacteriia(100);Sphingobacteriales(100);Chitinophagaceae(100);Sediminibacterium(100);

FWDB_0.97
5_Bctrm474_Prior=1.000000_Length=1481_NormPrior=1.000000        Bacteria(100);Proteobacteria(100);Betaproteobacteria(100);Burkholderiales(100);Comamonadaceae(100);Ramlibacter(91);

FWDB_1.0
5|Bctrm474_Prior=0.097627_Length=1481_NormPrior=0.097627        Bacteria(100);Proteobacteria(100);Betaproteobacteria(100);Burkholderiales(100);Comamonadaceae(100);Comamonadaceae_unclassified(100);
22|LimCurv2_Prior=0.015439_Length=1481_NormPrior=0.015439       Bacteria(100);Proteobacteria(100);Betaproteobacteria(100);Burkholderiales(100);Comamonadaceae(100);Comamonadaceae_unclassified(100);
18|LimSpe14_Prior=0.886934_Length=1481_NormPrior=0.886934       Bacteria(100);Proteobacteria(100);Betaproteobacteria(100);Burkholderiales(100);Comamonadaceae(100);Comamonadaceae_unclassified(100);

Clustering of merged sequence file

usearch -cluster_fast merged.emirged.cons.fasta -id 0.97 -centroids merged.emirged.cons.0.97.fasta -uc merged.emirged.cons.0.97.clusters.uc

Final dereplicated consensus EMIRGE sequences

5_Bctrm474_Prior=1.000000_Length=1481_NormPrior=1.000000        Bacteria(100);Proteobacteria(100);Betaproteobacteria(100);Burkholderiales(100);Comamonadaceae(100);Ramlibacter(91);
0|FR853751.1.1492_Prior=0.187898_Length=1480_NormPrior=0.193365 Bacteria(100);Bacteroidetes(100);Sphingobacteriia(100);Sphingobacteriales(100);Chitinophagaceae(100);Sediminibacterium(100);