muvac: A C repository from holgerman

### INFO
muvac is a fully multithreaded pipeline to call, merge and annotate germline variants
- from RAW sequencing data in FastQ format (uncompressed, gzip or bzip2),
- trimmed using Trimmomatic,
- corrected using Rcorrector,
- mapped by Segemehl, TopHat2, BWA,
- using callers Freebayes, HaplotypeCaller, MuTect2, Samtools, Platypus
- and annotation tools Annovar, SnpEff

### INSTALLATION
1) you need to be root to install dependencies, unless already present (to check this go on with step 2)
sudo apt-get update
sudo apt-get install cmake git g++ gcc make openjdk-8-jdk python-dev unzip
2) create a directory for muvac tools installation, which can be changed here (default: $PWD - absolut path of current working directory)
mkdir -p $PWD/muvactools
3) create a shell variable MUVAC assigned to your previous defined installation directory
export MUVAC=$PWD/muvactools
4) store the MUVAC variable permanently to ensure that muvac can always find the required tools
echo "export MUVAC=$MUVAC" >> ~/.bashrc
5) download the latest muvac release from https://github.com/koriege/muvac/releases
wget https://github.com/koriege/muvac/archive/v0.1.9.zip
6) extract the downloaded muvac release
unzip v0.1.9.zip
7) enter the muvac directory
cd muvac-0.1.9
8) install all tools for variant calling
./muvac -i all
9) (optional) install variant annotation tools for muvac 
./muvac -i anno [-t <number>] [-annovar <url>]
- use '-t <number>' option to run annotation-database indexing on <number> cpu cores (default: all)
- use '-annovar <url>' as optional option to install Annovar (get download url from here http://annovar.openbioinformatics.org/)
10) clean up
cd .. && rm -rf muvac-0.1.9

### UPDATE
1) follow the INSTALLATION steps 5 to 7
./muvac -i update

### DOWNLOAD HG19/GRCh37.p13 and dbSNP vcf
1) create a genome directory
mkdir -p $PWD/genomes
2) download hg19 genome and corresponding dbSNP vcf into your genomes directory
./dlhg19 $PWD/genomes

### RUN
- FIRST RUN INFO: genomes will be indexed for chosen mappers
=> do not run multiple muvac instances in parallel unless indexes are created
- make sure to have your genome and dbSNP vcf file in the same directory
- make sure your genome file ends with '.fa' and your dbSNP file ends with '.fa.vcf'
- make sure the file name prefixes of your genome and dbSNP file are equal (e.g. hg19.fa and hg19.fa.vcf)
- make sure the fasta headers in your genome are named and sorted this way '>chrM' '>chr1' ... '>chr22' '>chrX' '>chrY'
=> running dlhg19 from the DOWNLOAD section will do this for you
1) calling for help?
$MUVAC/muvac -h
2) adapt and use the terminal command below
$MUVAC/muvac -1 R1.fastq -2 R2.fastq -g genomes/hg19.fa -o results -l results/run.log
3) if no adapter clipping is necessary just call muvac this way
(echo 'n') | $MUVAC/muvac -1 R1.fastq -2 R2.fastq -g genomes/hg19.fa -o results -l results/run.log

### INFO: RUN
- a comprehensive log file 'run.log' file will be created in your output directory, unless defined otherwise
- using the verbose option '-v' will additionally print full log to terminals stdout
- example run for multiple patients data to be processed in parallel
$MUVAC/muvac -1 patient1_1.fq,patient2_1.fq -2 patient1_2.fq,patient2_2.fq -g genomes/hg19.fa -o results -l results/run.log

### RUN IN BACKGROUND
1) start a screen with a unique name
screen -S myMUVACrun
2) run muvac
3) detach the screen by pressing all 3 keys 'ctrl+a+d'
4) list screens running in background
screen -ls
4) resume a screen 
screen -r myMUVACrun

### RESULTS
- results can be found in the vcf/(segemehl|tophat|bwa) directories
- merged results are named *.merged.vcf
- annotation results can be found in the annotation/(segemehl|tophat|bwa)/(annovar|snpeff) directories

### INFO: MERGED RESULTS
- bcftools-norm is used to split multiallelic report into single reports with related genotype (GT), genotype quality (GQ) and coverage (DP)
=> muvac returns the best variants according to GQ and DP
- muvac returns the score columns of the first caller in alphabetic order, which is freebayes unless deactivated
- '*.reduced.vcf' files contain only the info column of the first caller in alphabetic order, which is freebayes unless deactivated

### INFO: ANNOTATION RESULTS
1) Annovar (see http://annovar.openbioinformatics.org/en/latest/user-guide)
- gene annotations (refSeq) are named 
  *.variant_function and *.exonic_variant_function
- region annotations (TFBS, miRNA/snoRNA, miRNA-BS, GWAS) are named 
  *.hg19_(tfbsConsSites|wgRna|targetScanS|gwasCatalog) 
- filtered annotations (NHLBI-ESP, 1000 Genomes, ExAC, dbSNP, NSFP, ClinVar) are named 
  *.hg19_(esp6500siv2_all|1000g2015aug_all|exac03|avsnp147|dbnsfp33a|clinvar_20170130)_dropped
- merged annotations are named *.hg19_multianno.vcf
2) SnpEff (see http://snpeff.sourceforge.net/SnpSift.html)
- gene annotations (Ensembl, TFBS, Nextprot) are named *.annotated.vcf
- region annotations (Ensembl promoter, Ensembl miRNA TSS) are named *.(promoter|miRNApromoter).vcf
- filtered annotations (ExAC, dbSNP, NSFP, ClinVar) are named *.(exac|dbsnp|dbnsfp|clinvar).vcf

### INTERMEDIATE RESULTS (in processing order)
- qualities directory: raw read quality statistics by FastQC
- converted directory: phred64 to phred33 quality converted raw reads by fastx_toolkit
- trimmed directory: quality trimmed reads by Trimmomatic
- rawcorrected directory: error corrected reads by Rcorrector
- mapped/(segemehl|tophat|bwa) directory: sorted mapping data '*.bam' by Segemehl|TopHat2|BWA
- noduplicates/(segemehl|tophat|bwa) directory: duplicate marked mapping data '*.bam' by Picard 
- errorcorrected/(segemehl|tophat|bwa) directory: relaigned and reaclibrated mapping data '*.reordered.bam' by GATK

### GENERAL CALLER PARAMETER
if adjustable, all callers are constrained by
min reads >= 10
allele frequency >= 0.1
alternative alleles <= 3
min base quality >= 0
max variances = all
downsampling = off

### REFERENCES
(c) Konstantin Riege
konstantin{a}bioinf{.}uni-leipzig{.}de
holgerman/muvac