/10k_genomes

Various scripts for efficient processing of 10k Salmonella genomes

Primary LanguageShellGNU General Public License v3.0GPL-3.0

10k Salmonella Genomes Project

Efficient data processing for 10k Salmonella genomes project.

Reference

Perez-Sepulveda BM, Heavens D, Pulford CV, Predeus AV et al., An accessible, efficient and global approach for the large-scale sequencing of bacterial genomes, bioRxiv 2020.

Data description

The project had the goal of genome sequencing invasive non-typhoid Salmonella on a large scale. While most (~7000) of the isolates were Salmonella enterica, collaborators also provided Shigella (~2000 isolates), Staphylococcus aureus (~400 isolates), as well as some other species.

Data in the project were generated using a modified Nextera library preparation approach (the LITE protocol) and 2x150 bp Illumina sequencing on HiSeq 4000. Details will be published soon (Perez-Sepulveda et al 2020).

Median sequencing depth was 30x; high-identity, low coverage samples were re-sequenced in an additional round of sequencing. Sequencing was done at Earlham Institute.

Main steps of data processing

  • Coverage estimation;
  • Bacterial species-level abundance estimation with Kraken2 (using 8Gb MiniKraken DB) and Bracken;
  • Trimming of Nextera adapters using bbduk;
  • Assembly with Unicycler;
  • Assembly quality evaluation: calculation of N50, assembly length, and number of contigs;
  • Alignment and PCR/optical duplicate estimation using Picard MarkDuplicates;
  • MLST prediction using mlst;
  • Serovar prediction using SISTR (Salmonella only);
  • Resistance and virulence gene profiling using abricate;
  • Automated annotation using Prokka (Salmonella only);
  • Final quality control based on Enterobase criteria (with modifications - see below).

Directory structure

Choose a working directory ($WDIR variable in scripts) with plenty of space. Within, create the following sub-directories:

  • 0_merged_fastq
  • 1_kraken2
  • 2_duplicates
  • 3_coverage_etc
  • 4_assembly
  • 5_sistr
  • 6_MLST
  • 7_rescue
  • 8_final_fasta
  • 9_prokka

Dependency installation

You can install all tools used by this pipeline with Bioconda. Following commands should help you install exact same versions of packages we have used in the paper:

conda create -n 10k_salmonella
source activate 10k_salmonella
conda install kraken2=2.0.8_beta
conda install unicycler=0.4.7
conda install sistr_cmd=1.0.2
conda install mlst=2.11 
conda install bracken=1.0.0
conda install bowtie2=2.3.5
conda install samtools=1.9
conda install picard=2.21.1
conda install prokka=1.13.7
conda install abricate=1.0.1

Also, you would need to download and install BBtools v38.07.

There's pretty good chance everything will work fine with other tool versions; known exceptions are samtools (version < 1), and old versions of sistr_cmd (different number of fields in the output file).

Parallel execution

All scripts used here were dedicated to the execution of one task for one sample (e.g. one_unicycler.sh) had a parallel wrapper (all_unicycler.sh for this example). The number of parallel jobs should be adjusted according to the system that you are running these scripts on.

Assembly quality control

All assemblies were classified according to the following scheme:

  • If an assembly has passed slightly modified Enterobase criteria for the appropriate species, it was classified as PASS; These criteria for Salmonella enterica are:

    Metrics Criteria
    Assembly length 4.0-5.8Mb
    N50 value >20kb
    Number of contigs <600
    Read assignment by Kraken2+Bracken >70% correct species
  • Alternatively, if an assembly satisfied one of the two conditions listed below, it was classified as RESCUE:

    • Passes relaxed Enterobase criteria: 4Mb < (length) < 5.8Mb, species 90%+, N50 > 10kb, n_contigs < 2,000;
    • or, passes Enterobase criteria if assembly is run on a subset of reads identified as Salmonella by Kraken2 (for impure samples);
  • If assembly does not satisfy any of the above criteria, it is classified as FAIL.

Notes on adapter trimming

Using the Nextera protocol makes it difficult to control insert size, often resulting in adapter sequences being present in the final reads. We have compared several strategies of adapter trimming, including Trimmomatic palindromic mode and bbduk. Bbduk trimming resulted in the highest assembly contiguity and the fastest processing time.