### INFO muvac is a fully multithreaded pipeline to call, merge and annotate germline variants - from RAW sequencing data in FastQ format (uncompressed, gzip or bzip2), - trimmed using Trimmomatic, - corrected using Rcorrector, - mapped by Segemehl, TopHat2, BWA, - using callers Freebayes, HaplotypeCaller, MuTect2, Samtools, Platypus - and annotation tools Annovar, SnpEff ### INSTALLATION 1) you need to be root to install dependencies, unless already present (to check this go on with step 2) sudo apt-get update sudo apt-get install cmake git g++ gcc make openjdk-8-jdk python-dev unzip 2) create a directory for muvac tools installation, which can be changed here (default: $PWD - absolut path of current working directory) mkdir -p $PWD/muvactools 3) create a shell variable MUVAC assigned to your previous defined installation directory export MUVAC=$PWD/muvactools 4) store the MUVAC variable permanently to ensure that muvac can always find the required tools echo "export MUVAC=$MUVAC" >> ~/.bashrc 5) download the latest muvac release from https://github.com/koriege/muvac/releases wget https://github.com/koriege/muvac/archive/v0.1.9.zip 6) extract the downloaded muvac release unzip v0.1.9.zip 7) enter the muvac directory cd muvac-0.1.9 8) install all tools for variant calling ./muvac -i all 9) (optional) install variant annotation tools for muvac ./muvac -i anno [-t <number>] [-annovar <url>] - use '-t <number>' option to run annotation-database indexing on <number> cpu cores (default: all) - use '-annovar <url>' as optional option to install Annovar (get download url from here http://annovar.openbioinformatics.org/) 10) clean up cd .. && rm -rf muvac-0.1.9 ### UPDATE 1) follow the INSTALLATION steps 5 to 7 ./muvac -i update ### DOWNLOAD HG19/GRCh37.p13 and dbSNP vcf 1) create a genome directory mkdir -p $PWD/genomes 2) download hg19 genome and corresponding dbSNP vcf into your genomes directory ./dlhg19 $PWD/genomes ### RUN - FIRST RUN INFO: genomes will be indexed for chosen mappers => do not run multiple muvac instances in parallel unless indexes are created - make sure to have your genome and dbSNP vcf file in the same directory - make sure your genome file ends with '.fa' and your dbSNP file ends with '.fa.vcf' - make sure the file name prefixes of your genome and dbSNP file are equal (e.g. hg19.fa and hg19.fa.vcf) - make sure the fasta headers in your genome are named and sorted this way '>chrM' '>chr1' ... '>chr22' '>chrX' '>chrY' => running dlhg19 from the DOWNLOAD section will do this for you 1) calling for help? $MUVAC/muvac -h 2) adapt and use the terminal command below $MUVAC/muvac -1 R1.fastq -2 R2.fastq -g genomes/hg19.fa -o results -l results/run.log 3) if no adapter clipping is necessary just call muvac this way (echo 'n') | $MUVAC/muvac -1 R1.fastq -2 R2.fastq -g genomes/hg19.fa -o results -l results/run.log ### INFO: RUN - a comprehensive log file 'run.log' file will be created in your output directory, unless defined otherwise - using the verbose option '-v' will additionally print full log to terminals stdout - example run for multiple patients data to be processed in parallel $MUVAC/muvac -1 patient1_1.fq,patient2_1.fq -2 patient1_2.fq,patient2_2.fq -g genomes/hg19.fa -o results -l results/run.log ### RUN IN BACKGROUND 1) start a screen with a unique name screen -S myMUVACrun 2) run muvac 3) detach the screen by pressing all 3 keys 'ctrl+a+d' 4) list screens running in background screen -ls 4) resume a screen screen -r myMUVACrun ### RESULTS - results can be found in the vcf/(segemehl|tophat|bwa) directories - merged results are named *.merged.vcf - annotation results can be found in the annotation/(segemehl|tophat|bwa)/(annovar|snpeff) directories ### INFO: MERGED RESULTS - bcftools-norm is used to split multiallelic report into single reports with related genotype (GT), genotype quality (GQ) and coverage (DP) => muvac returns the best variants according to GQ and DP - muvac returns the score columns of the first caller in alphabetic order, which is freebayes unless deactivated - '*.reduced.vcf' files contain only the info column of the first caller in alphabetic order, which is freebayes unless deactivated ### INFO: ANNOTATION RESULTS 1) Annovar (see http://annovar.openbioinformatics.org/en/latest/user-guide) - gene annotations (refSeq) are named *.variant_function and *.exonic_variant_function - region annotations (TFBS, miRNA/snoRNA, miRNA-BS, GWAS) are named *.hg19_(tfbsConsSites|wgRna|targetScanS|gwasCatalog) - filtered annotations (NHLBI-ESP, 1000 Genomes, ExAC, dbSNP, NSFP, ClinVar) are named *.hg19_(esp6500siv2_all|1000g2015aug_all|exac03|avsnp147|dbnsfp33a|clinvar_20170130)_dropped - merged annotations are named *.hg19_multianno.vcf 2) SnpEff (see http://snpeff.sourceforge.net/SnpSift.html) - gene annotations (Ensembl, TFBS, Nextprot) are named *.annotated.vcf - region annotations (Ensembl promoter, Ensembl miRNA TSS) are named *.(promoter|miRNApromoter).vcf - filtered annotations (ExAC, dbSNP, NSFP, ClinVar) are named *.(exac|dbsnp|dbnsfp|clinvar).vcf ### INTERMEDIATE RESULTS (in processing order) - qualities directory: raw read quality statistics by FastQC - converted directory: phred64 to phred33 quality converted raw reads by fastx_toolkit - trimmed directory: quality trimmed reads by Trimmomatic - rawcorrected directory: error corrected reads by Rcorrector - mapped/(segemehl|tophat|bwa) directory: sorted mapping data '*.bam' by Segemehl|TopHat2|BWA - noduplicates/(segemehl|tophat|bwa) directory: duplicate marked mapping data '*.bam' by Picard - errorcorrected/(segemehl|tophat|bwa) directory: relaigned and reaclibrated mapping data '*.reordered.bam' by GATK ### GENERAL CALLER PARAMETER if adjustable, all callers are constrained by min reads >= 10 allele frequency >= 0.1 alternative alleles <= 3 min base quality >= 0 max variances = all downsampling = off ### REFERENCES (c) Konstantin Riege konstantin{a}bioinf{.}uni-leipzig{.}de