This repository contains a collection of code and scripts used in the paper Mercury methylation by metabolically versatile and cosmopolitan marine bacteria (DOI: 10.1038/s41396-020-00889-4) by Lin et al..
The links below the sub-headings lead to the scripts needed for the corresponding steps. Most of the scripts were developed for running on the SLURM workload manager. All code and scripts were created for and tested on Spartan HPC at The University of Melbourne. You may download and adapt the scripts to suit your own requirements.
Take the five samples from "SI047 S3" station as an example.
ids="SRR3724469;SRR3724456;SRR3724482;SRR3724508;SRR3724533" # SRA accessions for the 5 metagenomic samples
urls="SI047_ftp_urls.txt" # output
out_dir="SI047_raw_data" # output
# Using ENA API to retreive data urls
for str in ${ids//;/ } ; do echo 'ftp://ftp.sra.ebi.ac.uk/vol1/fastq/'${str:0:6}'/00'${str:0-1}'/'$str'/'$str'_1.fastq.gz' >> $urls; echo 'ftp://ftp.sra.ebi.ac.uk/vol1/fastq/'${str:0:6}'/00'${str:0-1}'/'$str'/'$str'_2.fastq.gz' >> $urls; done
# Download raw fastq files to $out_dir
wget -b -q -i $urls -P $out_dir
input="SI047_raw_data"
output="SI047_clean_data"
sbatch trimmomatic-loop.slurm $input $output
input="SI047_clean_data"
output="SI047_megahit"
sbatch megahit-coassembly.slurm $input $output
input="SI047_megahit/final.contigs.fa"
output="SI047_megahit/SI047.2k.fa"
perl remove_small_seqs.pl 2000 $input > $output
metaWRAP
conda env is required
input_fa="SI047_megahit/SI047.2k.fa"
input_fq="SI047_clean_data"
output="SI047_binning_out"
sbatch metaWRAP-binning $input_fa $input_fq $output # Using --metabat1 --metabat2 --maxbin2
input="SI047_binning_out"
output="SI047_bins_refined"
sbatch metaWRAP-BinRefinement.slurm $input $output # Integrating the output from 3 binners
# Pick up some bins of interest to process reassembly
mkdir SI047_bins_interest
cp SI047_bins_refined/metawrap_70_10_bins/bin5.fa SI047_bins_interest
cp SI047_bins_refined/metawrap_70_10_bins/bin12.fa SI047_bins_interest
# Running metaWRAP
input_reads_dir="SI047_bins_refined/metawrap_70_10_bins"
input_bins_dir="SI047_bins_interest"
output="SI047_bins_reassembled"
sbatch metaWRAP-ReassembleBins.slurm $input_reads_dir $input_bins_dir $output
mkdir SI047_final_bins
cp SI047_bins_refined/metawrap_70_10_bins/*fa SI047_final_bins
cp -rf SI047_bins_reassembled/reassembled_bins/*fa SI047_final_bins
input="SI047_final_bins"
output="SI047_final_bins_prokka"
for fa in $input/*.fa;
sbatch prokka.slurm $fa $output
done
input="SI047_final_bins"
output="SI047_final_bins_checkm"
sbatch checkm.slurm $input $output
input="SI047_final_bins"
output="SI047_final_bins_gtdbtk"
sbatch gtdbtk.slurm $input $output
The hmm database for HgcA proteins is needed for this step.
HgcA.hmm
is available upon request.
input="SI047_final_bins_prokka"
output="SI047_HgcA_search"
find $input -name *.faa |
while read faa
do
sbatch hmmsearch.slurm $faa $output
done
cat SI047_HgcA_search/*faa > SI047_hgcA_all.faa # need some outgroups
input="SI047_hgcA_all.faa"
output="SI047_hgcA.faa"
sbatch cd-hit.slurm $input $output
input="SI047_hgcA.faa"
output="SI047_hgcA.aln.faa"
sbatch mafft.slurm $input $output
# Change Fasta to Phylip format
python fasta2relaxedphylip.py -i SI047_hgcA.aln.faa -o SI047_hgcA.aln.phy
# Make Tree
input="SI047_hgcA.aln.phy"
sbatch mafft-iqtree.slurm $input
input_fa="15hgcA.fna"
input_metaG_fq="SI047_clean_data"
out_dir_metaG="hgcA_metaG_abundance_SI047"
out_dir_MC="SI047_MC"
sbatch bwa-bbmap.slurm $input_fa $input_metaG_fq $out_dir
sbatch MicrobeCensus.slurm $input_metaG_fq $out_dir_MC
htseq
conda env is required
# ORF prediction
input_fa="SI047_megahit/SI047.2k.fa"
out_dir_prodigal="SI047_prodigal"
sbatch prodigal.slurm $input_fa $out_dir_prodigal
# mapping reads with bwa
input_fa="SI047_megahit/SI047.2k.fa"
input_metaT_fq="SI047_metaT_clean_data" # derived from a similar procedure to metagenomic clean data
out_dir="SI047_metaT_mapping"
sbatch bwa-samtools.slurm $input_fa $input_metaT_fq $out_dir # mapping and sort
sbatch picard.slurm $out_dir # removing duplicates
# counting mapped reads per gene
input_bams_dir="SI047_metaT_mapping"
input_gff="SI047_prodigal/SI047.gff"
output_count="SI047_reads_count"
sbatch htseq.slurm $input_bams_dir $input_gff $output_count
# calculating gene lengths
input_gff="SI047_prodigal/SI047.gff"
cut -f4,5,9 $input_gff | sed 's/gene_id //g' | gawk '{print $3,$2-$1+1}' | tr ' ' '\t' > ${input_gff%.*}.gl.txt
RPKM: Reads per kilo base per million mapped reads
Formula
RPKM = C / ( (L/1000) * (N/1,000,000) ) = (10^9 * C)/(N * L)
- C - number of reads mapped to a gene sequence
- L - gene length in base-pairs for a gene
- N - total number of mapped reads of a sample
module load pandas/0.23.4-intel-2016.u3-Python-3.5.2 # loading Python3 and Pandas module
module load numpy/1.12.1-intel-2017.u2-Python-3.5.2 # loading Numpy module
count_dir="SI047_reads_count"
gene_len="SI047_prodigal/SI047.gl.txt"
RPKM_out="SI047.rpkm.tsv"
python RPKM_cal.py -c $count_dir -l $gene_len -o $RPKM_out
Heyu Lin heyu.lin@student.unimelb.edu.au
School of Earth Sciences, The University of Melbourne
Please cite the article if the scripts are helpful in your research.
Lin, H., Ascher, D.B., Myung, Y. et al. Mercury methylation by metabolically versatile and cosmopolitan marine bacteria. ISME J (2021). https://doi.org/10.1038/s41396-020-00889-4