/GBS-Typer-sanger-nf

GBS Typer Pipeline in Nextflow

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

License: GPL v3

GitHub release (latest by date)

GitHub Workflow Status

GBS-Typer-sanger-nf

The GBS Typer is for characterising Group B Strep by serotyping, resistance typing, MLST, surface protein typing and penicillin-binding protein typing. It has been adapted from Ben Metcalf's GBS Typer pipeline in Nextflow for portability and reproducibility.

Contents

Running the pipeline

  • Running the pipeline requires an internet connection
  • Currently it supports only paired-end reads

Installation

GBS Typer relies on Nextflow and Docker. Download:

  1. Docker.
  2. Nextflow.
  3. Move nextflow to your PATH
mv nextflow /usr/local/bin/
  1. Clone repository:
git clone https://github.com/sanger-pathogens/GBS-Typer-sanger-nf.git
cd GBS-Typer-sanger-nf

Usage

  • To run on one sample (called sampleID in directory data)
nextflow run main.nf --reads data/sampleID_{1,2}.fastq.gz --results_dir my_results
  • To run on multiple samples in a directory (called data and a results directory called my_results)
nextflow run main.nf --reads 'data/*_{1,2}.fastq.gz' --results_dir my_results

This will create a results directory (here called my_results) containing output files listed below.

Running the pipeline on Sanger farm

Follow these instructions.

Outputs

Main Report

Running the command with will generate the main report gbs_typer_report.txt. This will include the serotype, MLST type, allelic frequencies from MLST, resistance gene incidence, surface protein types and GBS-specific resistance variants. You can find the description for each of the columns in the report here where the category column is in_silico_analysis.

Other Reports

  1. serotype_res_incidence.txt

Gives the serotype and presence/absence (i.e. pos/neg) of antibiotic resistance genes, e.g. Isolate Strep B sample 25292_2#105 has serotype II and have genes: 23S1, 23S3, GYRA, lsaC and tetM

Sample_id Serotype 23S1 23S3 CAT ermB ermT FOSA GYRA lnuB lsaC mefA MPHC MSRA msrD PARC RPOBGBS-1 RPOBGBS-2 RPOBGBS-3 RPOBGBS-4 SUL2 tetB tetL tetM tetO tetS
25292_2#105 II pos pos neg neg neg neg pos neg pos neg neg neg neg neg neg neg neg neg neg neg neg pos neg neg
  1. gbs_res_variants.txt

Gives the SNP variants for GBS-specific resistance genes e.g. Isolate Strep B sample 25292_2#105 have common variants 23S1, 23S3 and GYRA (shown with *), but replacement of amino acid S by Q in position 17 of the PARC protein sequence

Sample_id 23S1 23S3 GYRA PARC
25292_2#105 * * * Q17S
  1. drug_cat_alleles_variants.txt

Gives the GBS-specific variants and other resistance genes and alleles for drug categories: EC (macrolides, lincosamides, streptogramins or oxazolidinones), FQ (fluoroquinolones), OTHER (other antibiotics) and TET (tetracyclines) e.g. Isolate Step B sample 25292_2#105 have GBS-specific variants: erythromycin-resistant 23S1 and 23S3, fluoroquinolone-resistant PARC and GYRA, and other resistance allele tetracycline-resistant tet(M)_1 of gene tet(M) (as specified by gene[allele])

Sample_id EC FQ OTHER TET
25292_2#105 23S1:23S3 PARC-Q17S:GYRA neg tet(M)[tet(M)_1]
  1. new_mlst_alleles.log

Indicates whether new MLST alleles have been found for each sample (where there are mismatches with sufficient read depth at least the value specified --mlst_min_read_depth [Default: 30]).

  1. FASTA file new_mlst_alleles.fasta and a pileup file new_mlst_pileup.txt

If new_mlst_alleles.log includes "New MLST alleles found."

  1. existing_sequence_types.txt

For other samples that have no new MLST alleles and only have existing sequence types. If "None found" for a sample then no sequence types were found (with sufficient read depth).

  1. surface_protein_incidence.txt

This shows the incidence of different surface protein alleles in the Strep B sample(s), e.g.

Sample_id alp1 alp2/3 alpha hvgA PI1 PI2A1 PI2A2 PI2B rib srr1 srr2
26189_8#338 neg pos neg neg pos pos neg neg neg pos neg
  1. surface_protein_variants.txt

This shows all the surface proteins in the Strep B sample(s), e.g.

Sample_id ALPH hvgA PILI SRR
26189_8#338 alp2/3 neg PI1:PI2A1 srr1

Additional options

Inputs

--contigs                       Path of file containing FASTA contigs. Only use when --run_pbptyper is specified. (Use wildcard '*' to specify multiple files, e.g. 'data/*.fa')
--db_version                    Database version. (Default: latest in `db` directory)
--other_res_dbs                 Path to other resistance reference database. Must be FASTA format. (Default: ResFinder)

Parameters

--gbs_res_min_coverage          Minimum coverage for mapping to the GBS resistance database. (Default: 99.9)
--gbs_res_max_divergence        Maximum divergence for mapping to the GBS resistance database. (Default: 5, i.e. report only hits with <5% divergence)
--other_res_min_coverage        Minimum coverage for mapping to other resistance reference database(s). (Default: 70)
--other_res_max_divergence      Maximum divergence for mapping to other resistance reference database. (Default: 30, i.e. report only hits with <30% divergence)
--restyper_min_read_depth       Minimum read depth where mappings to antibiotic resistance genes with fewer reads are excluded. (Default: 30)
--serotyper_min_read_depth      Minimum read depth where mappings to serotyping genes with fewer reads are excluded. (Default: 0)

Other Workflow Options

--run_sero_res                  Run the main serotyping and resistance workflows. (Default: true)
--run_surfacetyper              Run the surface protein typing workflow. (Default: true)
--run_mlst                      Run the MLST workflow to query existing sequence types and new MLST alleles. (Default: true)
--run_pbptyper                  Run the PBP (Penicillin binding protein) allele typer workflow. Must also specify --contigs input. (Default: false)

Other Parameters

--mlst_min_read_depth           Minimum read depth where mappings to alleles in MLST with fewer reads are excluded. Only operational with --run_mlst. (Default: 30)
--pbp_frac_align_threshold      Minimum fraction of sequence alignment length of PBP gene. Only operational with --run_pbptyper. (Default: 0.5)
--pbp_frac_identity_threshold   Minimum fraction of alignment identity between PBP genes and assemblies. Only operational with --run_pbptyper. (Default: 0.5)
--surfacetyper_min_coverage     Minimum coverage for mapping to the GBS surface protein database. Only operational with --run_surfacetyper. (Default: 70)
--surfacetyper_max_divergence   Maximum divergence for mapping to the GBS surface protein database. Only operational with --run_surfacetyper. (Default: 8, i.e. report only hits with <8% divergence)
--surfacetyper_min_read_depth   Minimum read depth for surface protein typing workflow. Only operational with --run_surfacetyper. (Default: 30)

Advanced

PBP (Penicillin-binding protein) Typing Workflow

To enable the PBP typing workflow provide the --run_pbptyper command line argument and specify the contig FASTA files using and --contigs:

nextflow run main.nf --reads 'data/*_{1,2}.fastq.gz' --results_dir my_results --run_pbptyper --contigs 'data/*.fa'

If existing PBP alleles are found, a tab-delimited file is created in the results directory. The file contains the sample IDs (that are determined from the contig FASTA file names e.g. 25292_2#85 from data/25292_2#85.fa), the contig identifiers with start, end and forward(+)/reverse(-) positions, and the PBP allele identifier.

Sample_id Contig PBP_allele
25292_2#85 .25292_2_85.9:39495-40455(+) 2||GBS_1A
26077_6#118 .26077_6_118.11:39458-40418(+) 1||GBS_1A

If a new PBP allele is found in a sample, a FASTA file of amino acids is created in the 'results' directory. For example, if contig .25292_2_85.9 from sample 25292_2#85 contained a new PBP-1A allele, then .25292_2_85.9_GBS1A-1_PBP_new_allele.faa is generated with contents:

>.26077_6_118.11:39458-40418(+)
DIYNSDTYIAYPNNELQIASTIMDATNGKVIAQLGGRHQNENISFGTNQSVLTDRDWGST
MKPISAYAPAIDSGVYNSTGQSLNDSVYYWPGTSTQLYDWDRQYMGWMSMQTAIQQSRNV
PAVRALEAAGLDEAKSFLEKLGIYYPEMNYSNAISSNNSSSDAKYGASSEKMAAAYSAFA
NGGTYYKPQYVNKIEFSDGTNDTYAASGSRAMKETTAYMMTDMLKTVLTFGTGTKAAIPG
VAQAGKTGTSNYTEDELAKIEATTGIYNSAVGTMAPDENFVGYTSKYTMAIWTGYKNRLT
PLYGSQLDIATEVYRAMMSY

Other examples of running the pipeline

It is recommended you use the default parameters for specifying other resistance databases. However, to use different or multiple resistance databases with the GBS-specific resistance database, e.g. ARG-ANNOT and ResFinder in the db/0.2.1 directory, both with a minimum coverage of 70 and maximum divergence of 30:

nextflow run main.nf --reads 'data/*_{1,2}.fastq.gz' --results_dir my_results --other_res_dbs 'db/0.2.1/ARGannot-DB/ARG-ANNOT.fasta db/0.2.1/ResFinder-DB/ResFinder.fasta' --other_res_min_coverage '70 70' --other_res_max_divergence '30 30'

To run only the surface protein typing workflow, use:

nextflow run main.nf --reads 'data/*_{1,2}.fastq.gz' --results_dir my_results --run_sero_res false --run_mlst false

To run only the MLST workflow, use:

nextflow run main.nf --reads 'data/*_{1,2}.fastq.gz' --results_dir my_results --run_sero_res false --run_surfacetyper false

To run only the serotyping and resistance typing workflows, use:

nextflow run main.nf --reads 'data/*_{1,2}.fastq.gz' --results_dir my_results --run_surfacetyper false --run_mlst false

To run only the PBP typing workflow, use:

nextflow run main.nf --results_dir my_results --run_sero_res false --run_surfacetyper false --run_mlst false --run_pbptyper --contigs 'data/*.fa'

Note: The --reads parameter is not needed for the PBP typing workflow.

Troubleshooting for errors

It is possible that the pipeline may not complete successfully due to issues with input files and/or individual steps of the pipeline. To troubleshoot potential issues, you can use the -with-trace parameter e.g. nextflow run main.nf --reads 'data/*_{1,2}.fastq.gz' --results_dir my_results -with-trace to create a trace file in the current directory while the pipeline is running. The trace file will provide the status of the current directory that provides the status and the location of the log files for each sample and step. For example, if the step failed in [00/8803ea], the command and the error from this step can be viewed by cat work/00/8803ea01f8bc0fb43f68335d82831f/.command.log (Hint: while typing work/00/8803ea, the complete file path can be completed with the TAB button.)

Clean up

The work directory keeps the intermediate files of the pipeline. You can use nextflow clean [run_name|session_id] to clean up or rm -rf work only when no other pipelines are running (more dangerous). The run names and session ids can be found by using nextflow log -q

Software dependencies

All pipeline dependencies are built into the quay.io dependencies image, used by the pipeline. The current program versions in this image are as follows:

Program Version
bedtools 2.29.2
biopython 1.78
bowtie 2.2.9
freebayes 1.3.3+
prodigal 1:2.6.3
python 2 2.7
python 3 3.8
samtools 0.1.18
srst2 0.2.0

Acknowledge us

If you use this software, please acknowledge the contributors as appropriate.

Reporting issues

If any questions or problems, please post them under Issues.