/core-phylogenomics

Pipeline for identifying core SNPs and building a phylogenetic tree

Primary LanguagePerl

Core SNP Phylogenomics

The Core SNP phylogenomics pipeline provides a pipeline for identifying high-quality core SNPs among a set of bacterial isolates and generating phylogenetic trees from these SNPs. The pipeline takes as input a reference genome (in FASTA format) and a set of DNA sequencing reads (in FASTQ format) and proceeds through a number of different stages to find core SNPs.

Authors

The Core SNP Pipeline was developed by the following individuals: Aaron Petkau, Gary Van Domselaar, Philip Mabon, and Lee Katz.

Tutorial

For a step-by-step tutorial on how to run the core SNP pipeline with some example data, please see https://github.com/apetkau/microbial-informatics-2014/tree/master/labs/core-snp. A description of the data used for this tutorial as well as a virtual machine to run all the necessary software can be found at https://github.com/apetkau/microbial-informatics-2014/.

An additional tutorial which uses simulated data can be found at https://github.com/apetkau/core-phylogenomics-tutorial.

Quick Start

Command

If you have a set of DNA sequencing reads, fastq_reads/*.fastq, and a reference file containing the DNA sequence of the genome to use for reference mapping, reference.fasta, then the following command can be used to generate a core SNP phylogeny.

snp_phylogenomics_control --mode mapping --input-dir fastq_reads/ --output pipeline_out --reference reference.fasta

Output

Once the pipeline is finished main output files you will want to look at include:

  • pipeline_out/pseudoalign/pseudoalign-positions.tsv: A tab-separated file containing all variants identified and the positions of each variant on the reference genome.

Example:

#Chromosome	Position	Status	Reference	isolate1	isolate2
contig1	20	valid	A	A	T
contig2	30	filtered-coverage	A	-	A
  • pipeline_out/pseudoalign/matrix.csv: A tab-separated file containing a matrix of high-quality SNP distances between isolates.

Example:

strain	isolate1	isolate2
isolate1	0	5
isolate2	5	0
  • pipeline_out/pseudoalign/pseudoalign.phy: An alignment of variants for each input isolate in phylip format.
  • pipeline_out/phylogeny/pseudoalign.phy_phyml_tree.txt: A phylogenetic tree of the above alignment file generated using PhyML.

Alternative Commands

Alternatively, in addition to the above files, if you have a set of assembled contigs from isolates, fasta_contigs/*.fasta, that you wish to include in the analysis, you can run the pipeline with:

snp_phylogenomics_control --mode mapping --input-dir fastq_reads/ --contig-dir fasta_contigs/  --output pipeline_out --reference reference.fasta

In addition, if you have a pre-defined set of positions on the reference genome you wish to exclude (repetitive regions, etc) in a tab-separated values file format (see detailed documentation below for a description of the file format) you can run the pipeline with the command:

snp_phylogenomics_control --mode mapping --input-dir fastq_reads/ --contig-dir fasta_contigs/  --invalid-pos bad_positions.tsv --output pipeline_out --reference reference.fasta

bad_positions.tsv:

#Contig	start	end
contig1	50	100
contig2	75	100

Stages

The core SNP pipeline proceeds through the following stages:

  1. Reference mapping using SMALT.
  2. Variant calling using FreeBayes.
    1. For any assembled contigs passed to the pipeline, generates variant call files (VCF) using MUMMer alignments.
  3. Checking variant calls and depth of coverage using SAMTools.
  4. Aligning high-quality SNPs into a meta-alignment (pseudoalignment) of phylogenetically informative sites.
    1. If an invalid positions file is passed, remove any SNPs within the invalid positions.
  5. Building a phylogenetic tree with PhyML.

Installation

Please refer to the Installation document.

In addition, a virtual machine with all the necessary software to run the pipeline can be installed by following the instructions on https://github.com/apetkau/microbial-informatics-2014#running-the-labs.

Details

The core SNP pipeline has a number of different modes of operation in addition to building core SNP phylogenies from reference mapping and variant calling. These modes of operation are controlled by the --mode parameter. A list of these different modes of operation is given below.

  • prepare-fastq: Can be used to remove low-quality reads from FASTQ files and reduce the data size. This can be used for a basic quality check of reads before performing reference mapping and variant calling.
  • mapping: Builds a core SNP phylogeny using reference mapping and variant calling.
  • orthomcl: Builds a core orthologous gene SNP phylogeny by multiply aligning orthologs identified using OrthoMCL and extracting phylogenetically informative sites.
  • blast: Builds a core orthologous gene SNP phylogeny by multiply aligning orthologs identified using a single-directional BLAST and extracting phylogenetically informative sites.

In addition to the above modes, data analysis can be re-submitted from any stage using the --resubmit parameter.

Mode: Prepare-FASTQ

This mode can be used to do some basic quality checks on FASTQ files before running through the reference mapping pipeline. The stages that are run are as follows:

  1. Removes poor quality reads and trims ends of reads.
  2. Randomly samples reads from FASTQ files to reduce the dataset size to a user-configurable maximum coverage (estimated based on the reference genome length).
  3. Runs FastQC on the quality-filtered and reduced reads dataset.
  4. Generates a report for each of the isolates.

Command

To run this mode, the following command can be used.

snp_phylogenomics_control --mode prepare-fastq --input-dir fastq/ --output cleaned_out --reference reference.fasta --config options.conf

The main output you will want to use includes:

  • cleaned_out/fastqc/fastqc_stats.csv: A tab-deliminated report from FastQC on each isolate.
  • cleaned_out/downsampled_fastq: A directory containing all the cleaned and reduced FASTQ files. This directory can be used as input to the mapping mode of the pipeline.

Input

The following is a list of input files and formats for the prepare-fastq mode of the pipeline.

  • --input-dir fastq_reads/: A directory containing FASTQ-formatted DNA sequence reads. Only one file per isolate (paired-end mapping not supported).

Example:

fastq_reads/
	isolate1.fastq
	isolate2.fastq
	isolate3.fastq
  • --reference reference.fasta: A reference FASTA file. This is used to estimate the coverage of each isolate for reducing the amount of data in each FASTQ file.
  • --config options.conf: A configuration file in YAML format. This is used to defined specific parameters for each stage of the prepare-fastq mode.

Example:

%YAML 1.1
---
max_coverage: 200
trim_clean_params: '--numcpus 4 --min_quality 20 --bases_to_trim 10 --min_avg_quality 25 --min_length 36 -p 1'
drmaa_params:
	general: "-V"
	trimClean: "-pe smp 4"

Output

  • --output cleaned_out: Defines the output directory to store the files for each stage.

The output directory structure looks as follows:

cleaned_out/
	downsampled_fastq/
	fastqc/
	initial_fastq_dir/
	log/
	reference/
	run.properties
	stages/

A description of each of the directories and files are:

  • downsampled_fastq/: A directory containing the quality-filtered and data reduced fastq files.
  • fastqc/: A directory containing any of the FastQC results.
  • initial_fastq_dir/: A directory containing links to the initial input fastq files.
  • log/: A directory containing log files for each of the stages.
  • reference/: A directory containing the input reference file.
  • run.properties: A file containing all the parameters used to quality-filter the fastq files.
  • stages/: A directory containing files used to defined which stages of the prepare-fastq mode have been completed.

Mode: Mapping

Input

The following is a list of input files and formats for the pipeline.

  • --reference reference.fasta: A FASTA file containing the genome to be used for reference mapping.

Example:

>contig1
ATCGATCGATCGATCG
ATCGATCGATCGATCG
  • --input-dir fastq_reads/: A directory containing FASTQ-formatted DNA sequence reads. Only one file per isolate (paired-end mapping not supported). The file name is used as the name in the final phylogenetic tree.

Example:

fastq_reads/
	isolate1.fastq
	isolate2.fastq
	isolate3.fastq
  • --contig-dir contig_fasta/: A directory containing assembled contigs to include for analysis. Variants will be called using MUMMer. Only one file per isolate.

Example:

contig_fasta/
	isolate1.fasta
	isolate2.fasta
	isolate3.fasta
  • --invalid-pos bad_positions.tsv: A tab-separated values file format containing a list of positions to exclude from the analysis. Any SNPs in these positions will be marked as 'invalid' in the variant table and will be excluded from the matrix of SNP distances and the alignment used to generate the phylogeny. The contig IDs used in this file must correspond to the IDs used in the reference FASTA file.

Example:

#ContigID	Start	End
contig1	1	500
contig2	50	100
  • --config options.conf: A configuration file which can be used to override the default parameters for the different stages. This file must be in YAML format.

Example:

%YAML 1.1
---
min_coverage: 15
freebayes_params: '--pvar 0 --ploidy 1 --left-align-indels --min-mapping-quality 30 --min-base-quality 30 --min-alternate-fraction 0.75'
smalt_index: '-k 13 -s 6'
smalt_map: '-n 24 -f samsoft -r -1'
vcf2pseudo_numcpus: 4
vcf2core_numcpus: 24
trim_clean_params: '--numcpus 4 --min_quality 20 --bases_to_trim 10 --min_avg_quality 25 --min_length 36 -p 1'
drmaa_params:
	general: "-V"
	vcf2pseudoalign: "-pe smp 4"
	vcf2core: "-pe smp 24"
	trimClean: "-pe smp 4"

Output

The detailed output directory tree looks as follows:

pipeline_out/
	fastq/
	invalid/
	log/
	mapping/
	mpileup/
	phylogeny/
		pseudoalign.phy_phyml_stats.txt
		pseudoalign.phy_phyml_tree.txt
		pseudoalign.phy_phyml_tree.txt.pdf
	pseudoalign/
		matrix.csv
		pseudoalign.fasta
		pseudoalign.phy
		pseudoalign-positions.tsv
	reference/
	run.properties
	sam/
	stages/
	vcf/
	vcf2core/
		contig1.gff
		contig1.png
	vcf-split/

The description of each of these directories/files are as follows:

  • fastq/: A directory containing links to each of the input fastq files.
  • invalid/: A directory containing the invalid positions file used if it was passed to the pipeline.
  • log/: Log files for every stage of the pipeline.
  • mapping/: Files for each isolate containing the SMALT reference-mapping information.
  • mpileup/: Files generated from 'samtools mpileup' for each isolate.
  • phylogeny/: Files generated from PhyML when building the phylogeny.
    • pseudoalign.phy_phyml_stats.txt: A statistics file generated by PhyML.
    • pseudoalign.phy_phyml_tree.txt: The phylogenetic tree generated by PhyML in Newick format.
    • pseudoalign.phy_phyml_tree.txt.pdf: A PDF of the phylogenetic tree, rendered using Figtree.
  • pseudoalign/: Contains the "pseudoalignment" of only phylogenetically informative sites used to generate the phylogeny, as well as other information about each of the sites.
    • matrix.csv: A matrix of SNP distances between each isolate.
    • pseudoalign.fasta: An alignment of phylogenetically informative sites in FASTA format.
    • pseudoalign.phy: An alignment of phylogenetically informative sites, in phylip format.
    • pseudoalign-positions.tsv: A tab-separated values file containing a list of all positions identified by the pipeline.
  • reference/: A directory containing links to the reference FASTA file used by some of the tools.
  • run.properties: A properties file containing all the parameters used for the pipeline, in YAML format.
  • sam/: SAM formated files generated by SMALT.
  • stages/: A directory of files indicating which stages have been completed by the pipeline.
  • vcf/: The VCF files produced by FreeBayes.
  • vcf2core/: Files used to generate an image of the core genome.
    • contig1.gff: A GFF formatted file listing core genome locations on each contig.
    • contig1.png: An image showing the core genome locations for each contig rendered using GView.
  • vcf-split/: VCF files split up so that one single SNP is represented by one line.

The matrix.csv file lists high-quality SNP distances between each combination of isolates. An example of this file is given below.

Example: matrix.csv

strain	isolate1	isolate2
isolate1	0	5
isolate2	5	0

The pseudoalign-positions.tsv file lists all SNPs found within the pipeline and the corresponding contig/position combination. The status column lists the status of each position. Only valid position statuses are used to generate the alignment files. The filtered-coverage status defines a position (indicated by a - character) which had insufficient coverage to be included as a core SNP. The filtered-mpileup status defines a position (indicated by an N) which had conflicting variant calls between FreeBayes and SAMTools mpileup. The filtered-invalid status indicates that this position was filtered out due to belonging to one of the invalid position regions passed to the pipeline.

Example: pseudoalign-positions.tsv

#Chromosome	Position	Status	Reference	isolate1	isolate2
contig1	20	valid	A	A	T
contig2	5	filtered-coverage	A	-	A
contig2	35	filtered-mpileup	A	N	A
contig2	40	filtered-invalid	A	C	A

The vcf2core/*.gff files list the coordinates that were deteremend to be part of the core genome (based on the minimum coverage).

Example: contig1.gff

task_2	.	region	10	100	100	+	0	
task_2	.	region	150	200	100	+	0

Resubmitting

In order to resubmit a particular run of the pipeline for data analysis from a particular stage the following command can be used.

snp_phylogenomics_control --resubmit output_dir/ --start-stage starting-stage

The output_dir/ is the directory containing all the results of a previous run of the pipeline. The start-stage defines the starting stage for the new analysis. For more details on the particular stages to use please run the command:

snp_phylogenomics_control --help