SNP calling pipeline using mccortex
Inspired by/follow up of https://github.com/iqbal-lab/outbryk.git
vcftools
vcflib
bwa
python3
mccortex
Clone this repository:
git clone https://github.com/hcdenbakker/mcOutbryk
Add the mcOutbryk folder to your path
Clone mccortex:
git clone --recursive https://github.com/mcveanlab/mccortex.git
Compile the mccortex63 binary:
cd mccortex
make MAXK=63 all
make MAXK=31 all
Add the mccortex63 binary and the mccortex/scripts folder to your path
Execute the pipeline in the folder with illumina fastq.gz files of the isolates you are interested in. Create a 1 column list with the sample names and decide on an appropriate reference sequence.
For help run:
MCOutbryk. py -h
MCOutbryk - de novo SNP and ambiguous site Caller
optional arguments:
-h, --help show this help message and exit
--reference REFERENCE [REFERENCE ...]
Reference genome in fasta format, include path to file
if file is not in same directory
--sample_list SAMPLE_LIST [SAMPLE_LIST ...]
Text file, one isolate/sample per line
--results_dir RESULTS_DIR [RESULTS_DIR ...]
Directory which will contain results after the
analysis
--processes PROCESSES [PROCESSES ...]
number of processes to run in parallel
--delete_highly_divergent {yes,no}
delete highly divergent isolates automatically
(default: yes)
--div_cutoff [DIV_CUTOFF]
number of SNPs used as cutoff to define highly
divergent isolates (default: 5000)
--K [K] K-mer size (default: 33)
--SNP_vcf SNP_VCF [SNP_VCF ...]
VCF with sites to be called; if this is given, the
steps to create a SNP sites vcf de novo from the
isolates in the sample list will be skipped.
To run mcOutbryk:
MCOutbryk.py --reference reference.fasta --sample_list your_sample_list.txt --results_dir directory_for_results --processes parallel_processes
Warning: while this pipeline is generally heavy on the RAM usage, buidling the uncleaned graphs takes about 10 Gb of memory per process, it is recommended to perform 1 process per 10 Gb of memory (I run 10 parallel processes on a 128 Gb machine)
The pipeline does the following:
-
Checks if raw data for samples in the list are available in the execution directory
-
Create raw and cleaned graphs
-
Bubble call each individual isolate vs reference (and delete cleaned graphs and some intermediary files)
-
Create vcfs with trusted SNPs and create a 'pseudo-vcf' of all trusted SNPs
-
Use 'pseudo-vcf' to genotype (= call SNPs) of all isolates using their raw graph
-
Create SNP sites multi-fasta file (stored in resultsdir/MC_consensus.fasta; ambiguous calls: N, gaps: -, impossible to call: ?)); this file can be used as direct input for FastTree