This repository contains CWL files needed to run four SNP callers in a single workflow, named snp_callers_workflow.cwl
This workflow is available on Dockstore as dockstore_workflow_snps
Input to the workflow is a JSON format file (see example.json
) with paths to the following:
- A genome in fasta format with a samtools index (
.fai
) and a GATK.dict
file (see below) in the same directory - A tumor sample in bam format with a samtools index (
.bai
) in the same directory - A normal sample from the same patient in bam format with a samtools index (
.bai
) in the same directory - A bed format file with the centromere locations of the genome.
hg38.centromere.bed
contains centromeres for hg38/GRCh38 - A Cosmic vcf format file with known cancer mutations, with a tabix index (
.tbi
), see below - A dbSNP vcf format file, with a tabix index. See for example ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/common_all_20170710.vcf.gz
- The outputfile, which will be in
.tgz
format
To create a .dict
file, install picard-tools and run
java -jar picard.jar CreateSequenceDictionary REFERENCE=<my_genome>.fa OUTPUT=<my_genome>.dict
Note that while .fai
and .bai
extensions are appended to the original filename (normal.bam.bai
), the .dict
extension replaces the .fa
extension.
Warning make sure you do not have other periods in the genome filename, the workflow currently cannot find the .dict
file if you do.
To create .tbi
files, first use bgzip
to compress your file (you may have to gunzip
first), then run
tabix -p vcf cosmic.vcf.gz
Note that you can download a .tbi
file directly from the NCBI ftp site for the dbSNP vcf file.
Output will be tarred, gzipped, and copied to the path you listed in your JSON file. It will unpack into the following files:
muse.filtered.vcf
mutect.vcf
somatic_sniper.vcf
pindel.vcf
Providing you have dockstore and docker installed on your system, run
dockstore workflow launch --entry github.com/BD2KGenomics/dockstore_workflow_snps:master --json my.json
NOTE: This may take more than a day. Use as many processors as possible to speed up the run, and avoid having unnecessary sequences in your genome fasta (chrUn, chr_random).
The workflow calls docker containers maintained by opengenomics and hosted on Dockerhub.
More information about the individual tools can be found by clicking on these links:
MuSE
MuTect
SomaticSniper
Pindel