######################################################### Multi-nucleotide Variation Annotation Corrector (MAC)

This document provides information on how to run MAC to screen through a list of SNVs called by any existing SNV-based variant caller to identify and fix incorrect amino acid prediction caused by multi-nucleotide variations (MNV).


MAC is implemented in Perl and tested with Perl v5.16.1, Bio::DB::Sam 1.36, Linux version 2.6.18. MAC can be run under any *nix OS. Before running MAC, please make sure the following packages are properly installed:

1. Perl v5 (tested with Perl v5.16.1). Please ensure your running environment 
	points to the right Perl.
2. Bio::DB::Sam (http://search.cpan.org/~lds/Bio-SamTools/), which also 
	requires SamTools library (http://sourceforge.net/projects/samtools/files/).
3. IPC::System::Simple 

To run with existing annotators, additional packages/files will be needed 
depending on which annotator(s) will be used:

4a. to run with Annovar

4b. to run with SnpEff
	Ensembl gene annotation GTF file:   
	* Please note that Java is required when running SnpEff.

4c. to run with VEP
	Ensembl gene annotation GTF file:

The users can install any one or all of the above 3 programs.


MAC requires 3 input files: 1. a list of SNVs, each line contains one SNV in the format of "chr.posi.ref.mut", e.x. "17.7577084.T.A". Please make sure that the nomenclature of chromosome name matches to the BAM and reference files, such as with or without "chr". Currently only SNVs supported, any Indels or MNVs need to be removed from the input file. 2. the BAM file from which these SNVs were detected. The BAM file must be sorted and indexed. 3. the reference genome sequence file in fasta format, e.x. hg19.fasta. Must be the same as the reference of above BAM file.


MAC can be run either with or without a selected annotation program. For user's convenience, we precompiled our program to work with three popular annotators: ANNOVAR, SnpEff and VEP (Cingolani, et al., 2012; McLaren, et al., 2010; Wang, et al., 2010). If the user prefers another program, simply not specify any annotator, the program will output identified Block of Mutations(BM), defined as a group of mutations where every mutation contains at least one read or mate pair (regardless of mutation status) that is shared with at least one other mutation in that group, in the format of haplotypes with read counts. The users can then annotate any contained MNVs using their own selection of annotator. Before running MAC, please make sure all required perl modules are contained in @INC. One simple way is to put all .pm and .pl scripts in the same directory and run from there. The additional program packages for Annovar/SnpEff/VEP can be installed in separate location, with the full path of required programs/files provided when running MAC.

Output file name prefix: MAC use input SNV file name as prefix by default. User can change it with "-o" option.

MAC's 4 features: 1. run without annotation (output without information about codons/amino acid prediction, user can then do annotation using their own selection of annotator): ./MAC_v1.0.pl -i input_SNVs.txt -bam sample.bam -r hg19.fasta

#Output format when running MAC without annotation:
Output file name: ${prefix}_raw_haplotype_counts.txt
	BM_mutations	Haplotype_index	Joint_status	N_reads	Annotation
	17.7577084.T.A,17.7577085.C.G	1/3	11	17	Mutant
	17.7577084.T.A,17.7577085.C.G	2/3	00	18	Wt
	17.7577084.T.A,17.7577085.C.G	3/3	0-	1	Incomplete
Each row contains one haplotype of one BM (Block of Mutations).
Description of columns:
	BM_mutations: all SNVs contained in that BM. In the above example, 
		this BM contains 2 SNVs: 17.7577084.T.A and 17.7577085.C.G. Please 
		note these SNVs may or may not share any protein codon, such 
		information must be obtained by using one of the annotators.
	Haplotype_index: in example "1/3", contains the index number of 
		current haplotype (1) and the number of total haplotypes (3)
	Joint_status: the joint statuses of individual SNVs in current 
		haplotype. The total number of digits in "Joint_status" should 
		equal to the total number of SNVs in "BM_mutations", with each 
		digit corresponding to one SNV, in the same order. There are 3 
		possible statuses: 1, 0, and dash (-), to represent mutant, 
		non-mutant, and unknown, respectively. For example, a typical 
		dinucleotide SNV will be reported as 11, while two consecutive 
		SNVs on different reads will be reported as “10” or “01”.
	N_reads: total number of unique read pairs supporting the current 
	Annotation: when run without an annotator, MAC only reports Mutant, Mt,
		 or Incomplete (any haplotype containing unknown "-" status)

#run with one of the three precompiled annotation programs	
2. Annovar
./MAC_v1.0.pl -i input_SNVs.txt -bam sample.bam -r hg19.fasta -annotator 
	annovar -annovar_annotate_variation FULLPATH/annotate_variation.pl 
	-annovar_coding_change FULLPATH/coding_change.pl -annovar_refgene 
	FULLPATH/hg19_refGene.txt -annovar_refmrna FULLPATH/hg19_refGeneMrna.fa
Please note the following programs/files are from ANNOVAR package:
	*	FULLPATH/annotate_variation.pl (MAC tested with "Revision: 527")
	*	FULLPATH/coding_change.pl (MAC tested with "Revision 
	*	FULLPATH/hg19_refGene.txt
	*	FULLPATH/hg19_refGeneMrna.fa

3. SnpEff
./MAC_v1.0.pl -i input_SNVs.txt -bam sample.bam -r hg19.fasta -annotator 
	snpeff -snpeff_path_to_snpEff FULLPATH/snpEff.jar 
	-ensembl_Homo_sapiens_GRCh37_75_gtf FULLPATH/Homo_sapiens.GRCh37.75.gtf
Please note the following program is from SnpEff package:
	*	FULLPATH/snpEff.jar (MAC tested with "v3.6c")
Please note the following file is from Ensembl website:
	*	FULLPATH/Homo_sapiens.GRCh37.75.gtf

4. VEP
./MAC_v1.0.pl -i input_SNVs.txt -bam sample.bam -r hg19.fasta -annotator 
	vep -vep_path_to_variant_effect_predictor 
	FULLPATH/variant_effect_predictor.pl -ensembl_Homo_sapiens_GRCh37_75_gtf
Please note the following program is from VEP package:
* FULLPATH/variant_effect_predictor.pl (MAC tested with "Version 75")
Please note the following file is from Ensembl website:
* FULLPATH/Homo_sapiens.GRCh37.75.gtf

#Output format when running MAC with a selected annotator
Output file name: ${prefix}_updated_haplotype_anno_by_${annotator}.txt, 
	where ${annotator} will be one of the 3 selected ones: annovar/snpeff/vep
Example (using SnpEff as annotator):
	BMC_mutations	Haplotype_index	Joint_status	N_reads	Annotation
	17.7577084.T.A,17.7577085.C.G	1/2	11	17	TP53:ENST00000269305:Glu285Leu,TP53:ENST00000359597:Glu285Leu,TP53:ENST00000420246:Glu285Leu,TP53:ENST00000445888:Glu285Leu,TP53:ENST00000455263:Glu285Leu,TP53:ENST00000509690:Glu153Leu
	17.7577084.T.A,17.7577085.C.G	2/2	00	18	Wt
The description of columns are the same as the above section of 
"running MAC without annotation". The only difference is the column 
"Annotation" now contains amino acid prediction, in the format of 
"gene:mRNA transcript:amino acid change" (e.x. TP53:ENST00000269305:Glu285Leu).
 When multiple transcripts are available, all transcripts will be reported 
and deliminated by comma.
*Please note that when running with an annotator, by default "Incomplete" 
haplotypes ("0-" in the above example) are not reported. The user can use 
the option of "--print_incomplete_haplotype" to include those.

Additional options -o:output prefix, default is to use input SNV file name as prefix. --print_incomplete_haplotype: select this option to keep haplotypes containing unknown status in annotation --max_allowed_adjacent_distance: max allowed distance between two adjacent SNVs in one Block of Mutation: for MNV annotation, use 2; for SNP phasing, use 1000 (and do not select any annotator). If set at 1, only continous SNVs are considered as BM. --printlog: print number counts at each step for debugging purpose --h: print brief help information --man: Print the man page --usage: Print usage information.


An example data set is included for testing purpose under "./example/".

Input: "input_SNVs.txt" a testing input SNV file containing two consecutive SNVs "sample.bam" a mini BAM file containing NGS reads overlapping with the site of the input SNVs "sample.bam.bai" the index file for the above mini BAM file

Output: When successfully finished, the expected output files are included under "./example/output/":

#when running MAC without annotation

#when running MAC with ANNOVAR

#when running MAC with SnpEff

#when running MAC with VEP


