- Overview
- [Usage] (#usage)
- [Change log & Download] (#change)
- [Release version 1.2 on Jan 15, 2014] (#v1.2)
- [Release version 1.1 on Feb 15, 2013] (#v1.1)
- [Release version 1.0 on Dec 14, 2012] (#v1.0)
- Statistical framework for heteroplasmy detection
- Fisher's exact test
- [Empirical Bayesian for Binomial proportion with conjugate Beta prior] (#bayes)
- Prerequisites
- [Step1: Intall perl packages required by circos] (#step1)
- [Step2: Intall perl packages required by MitoSeek] (#step2)
- [Step3: Build samtools] (#step3)
- [Others] (#Others)
- [Configure your perl environment] (#perlsetup)
- [Mitochondrial genome information (hg19/rCRS)] (#mitogenome)
- [Mitochondrial genome reference] (#mitoreference)
- [Mitochondrial genome annotation] (#mitoanno)
- [Known pathogenic mutations] (#pathogenic)
- [Whole genome exon coordinates] (#exon)
MitoSeek is an open-source software tool to reliably and easily extract mitochondrial genome information from exome sequencing data. MitoSeek evaluates mitochondrial genome alignment quality, estimates relative mitochondrial copy numbers, and detects heteroplasmy, somatic mutation, and structural variance of the mitochondrial genome.
## Usage Example code for running MitoSeek with given toy dataset could be ```bash perl mitoSeek.pl -i Examples/brca_tumor.bam -j Examples/brca_normal.bam -t 4 -sb 0 -hp 1 -d 5 -str 4 -sp 1 -sa 0 ```This example report could be accessed at Here
Details for each parameters are here
Usage: perl mitoSeek.pl -i inbam
-i [bam] Input bam file
-j [bam] Input bam file2, if this file is provided, it will conduct somatic mutation mining, and it will be
taken as normal tissue.
-t [input type] Type of the bam files, the possible choices are 1=exome, 2=whole genome, 3= RNAseq, 4 = mitochondria only,default = 1
-d [int] The minimum recommended depth requirement for detecting heteroplasmy. Lower depth will severely damage the
confidence of heteroplasmy calling, default=50
-ch Produce circos plot input files and circos plot figure for heteroplasmic mutation,
(-noch to turn off and -ch to turn on), default = on
-hp [int] Heteroplasmy threshold using [int] percent alternative allele observed, default = 5
-ha [int] Heteroplasmy threshold using [int] allele observed, default = 0
-alpha [float] Shape1 parameter of Beta prior distribution, default is 3.87 which is estimated from 600 BRCA samples
-beta [float] Shape2 parameter of Beta prior distribution, default is 174.28, which is estimated from 600 BRCA samples
-A If - A is used, the total read count is the total allele count of all allele observed.
Otherwise, the total read count is the sum of major and minor allele counts. Default = off
-mmq [int] Minimum map quality, default =20
-mbq [int] Minimum base quality, default =20
-sb [int] Remove all sites with strand bias score in the top [int] %, default = 10
-cn Estimate relative copy number of input bam(s), does not work with mitochondria targeted sequencing bam files,
(-noch to turn off and -ch to turn on) default = off.
-sp [int] Somatic mutation detection threshold,int = percent of alternative allele observed in tumor, default int=5
-sa [int] Somatic mutation detection threshold,int = number of alternative allele observed in tumor, default int=3
-cs Produce circos plot input files and circos plot figure for somatic mutation,
(-nocs to turn off and -cs to turn on), default = off
-r [ref] The reference used in the bam file, the possible choices are hg19 and rCRS, default=hg19
-R [ref] The reference used in the output files, the possible choices are hg19 and rCRS, default=hg19
-str [int] Structure variants cutoff for those discordant mapping mates,
int = number of spanning reads supporting this structure variants, default = 2
-strf [int] Structure variants cutoff for those large deletions,
int = template size in bp, default=500
-QC Produce QC result, (--noQC to turn off and -QC to turn on), default=on
-samtools[samtools] Tell where is the samtools program, default is your mitoseek directory/Resources/samtools/samtools
-bwa [bwa] Tell where is the bwa program, default value is 'bwa' which is in your $PATH
-bwaindex [bwaindex] Tell where is the bwa index of non-mitochondrial human genome, no default value
-advance Will get mitochondrial reads in an advanced way, generally followed by 1) Initially extract mitochrodrial reads from
a bam file, then 2) remove those could be remapped to non-mitochondrial human genome by bwa. Advanced extraction needs
-bwaindex option. Default extraction without removing step.
- Download Zip
- [Browse Code ] (https://github.com/riverlee/MitoSeek/tree/v1.2)
Changes are here:
- Reading options from a configure instead of reading from commmand line. The new program is called mitoSeek_new.pl, an example of configure file is called 'para.txt'. Meanwhile, the original program which reads options from command line is also kept, called mitoSeek.pl.
- Download Zip
- [Browse Code ] (https://github.com/riverlee/MitoSeek/tree/v1.1)
Changes are here:
- Add option -advance to improve mitochondrial reads extraction, which is done by 1) Initially extract mitochrodrial reads from a bam file, then 2) remove those could be remapped to non-mitochondrial human genome by bwa. The version v1.0 only implementes the step 1
- Add Statistical framework for heteroplasmy detection, which are fisher test and [empirical bayesian] (#bayes).
- Add option -samtools for people who would like to use their specified samtools
- Add options -bwa and --bwaindex which are needed if use -advance option
- In the heteroplasmy output, it will contain columns of fisher.pvalue, fisher.adjust.pvalue,fisher.phred.score,empirical.probability, and empirical.phred.score
- Thanks to Peter's code for beta calculation in perl at http://home.online.no/~pjacklam/perl/modules/
- Download Zip
- [Browse Code ] (https://github.com/riverlee/MitoSeek/tree/964cd2e61735a60283f0280020cadbb53be3617e)
The phred score of heteroplasmy for Fisher's exact test calucated as
phred.score.fisher = - log10 * log10(fisher.pvalue)
hp
_ ---
/ 100
likelihood.of.heteroplasmy = 1 - | f(x) dx
_/ 0
where
1 a + n12 - 1 b + n11 - 1
f(x) = -------------------------- * x * (1 - x)
beta(n12 + a ,n11 + b )
a and b are share parameters of Beta prior distribution, which are estimated from 600 BRCA samples.
The phred score of heteroplasmy for Empirical Bayesian approach is calculated as
phred.score.empirical = - log10 * log10(1 - likelihood.of.heteroplasmy)
MitoSeek runs on 32-bit or 64-bit GNU/Linux and request perl packages like GD::Graph::boxplot, etc. Here are the steps you need to do to install packages needed for the MitoSeek program.
### Step1: Intall perl packages required by [circos](http://circos.ca/) MitoSeek utilizes [circos](http://circos.ca/) to plot heteroplasmy and somatic mutation, thus, perl packages required by [circos](http://circos.ca/) needed to been installed first.#1) go to the circos package folder which is included as part of the MitoSeek
cd Resources/circos-0.56/bin
#2) Check whether all the packages needed by circos plot are intalled on your PC,
#if this does not work, try 'chmod +x test.modules'
./test.modules
If all the packages are installed on your PC, it will look like this.
ok Carp
ok Config::General
ok Cwd
ok Data::Dumper
ok Digest::MD5
ok File::Basename
ok File::Spec::Functions
ok File::Temp
ok FindBin
ok GD
ok GD::Image
ok GD::Polyline
ok Getopt::Long
ok IO::File
ok List::MoreUtils
ok List::Util
ok Math::Bezier
ok Math::BigFloat
ok Math::Round
ok Math::VecStat
ok Memoize
ok Params::Validate
ok Pod::Usage
ok POSIX
ok Readonly
ok Regexp::Common
ok Set::IntSpan
ok Storable
ok Sys::Hostname
ok Text::Format
ok Time::HiRes
Otherwise, it may look like this.
ok Carp
fail Config::General is not usable (it or a sub-module is missing)
ok Cwd
ok Data::Dumper
ok Digest::MD5
ok File::Basename
ok File::Spec::Functions
ok File::Temp
ok FindBin
fail GD is not usable (it or a sub-module is missing)
fail GD::Image is not usable (it or a sub-module is missing)
fail GD::Polyline is not usable (it or a sub-module is missing)
ok Getopt::Long
ok IO::File
fail List::MoreUtils is not usable (it or a sub-module is missing)
ok List::Util
fail Math::Bezier is not usable (it or a sub-module is missing)
ok Math::BigFloat
fail Math::Round is not usable (it or a sub-module is missing)
fail Math::VecStat is not usable (it or a sub-module is missing)
ok Memoize
ok Params::Validate
ok Pod::Usage
ok POSIX
fail Readonly is not usable (it or a sub-module is missing)
fail Regexp::Common is not usable (it or a sub-module is missing)
fail Set::IntSpan is not usable (it or a sub-module is missing)
ok Storable
ok Sys::Hostname
fail Text::Format is not usable (it or a sub-module is missing)
fail Time::HiRes is not usable (it or a sub-module is missing)
If this happens, try to install the missing packages by cpan (If you you don't have root previlege, please look at [here] (#perlsetup) to set up your own perl environment).
#run in root if the packages will be installed in the system path
#like /usr/local/lib64/perl5
./test.modules |grep fail|cut -f2 -d" "|xargs -I {} cpan {}
Or you can change the mitoSeek.pl script and tell the program where is the samtools on your PC. Here is the line you need to modify.
my $samtools = "$FindBin::Bin/Resources/samtools/samtools"; #Where is the samtools file
#If your own perl libary top folder is ~/perllib
#Then add the following lines in your ~/.bashrc
#because the subfolders could be different due to different versions of perl
#and linux, so to save time by adding subfolders in the PERL5LIB, we use 'for'
#to add all the subfolders.
PERL5LIB=~/perllib
if [ -d ~/perllib ]; then
for i in `find ~/perllib -type d`
do
PERL5LIB=$i:$PERL5LIB
done
fi
export $PERL5LIB
#type cpan into cpan environment
cpan
Then
#In cpan environment
o conf makepl_arg PREFIX="~/perllib"
o conf commit
#exit the cpan environment
exit
Files:
- hg19.fasta
- rCRS.fasta
Links
#1.1) rCRS assembly (rCRS.fasta)
http://www.ncbi.nlm.nih.gov/nuccore/251831106
#1.2) hg19 assembly (hg19.fasta)
http://www.ncbi.nlm.nih.gov/nuccore/NC_001807.4?report=genbank
result folder: yourmitoseek/Resources/
Files:
- hg19_annotation.gb
- rCRS_annotation.gb
Links
#1.1) rCRS assembly (download as genbank format)
http://www.ncbi.nlm.nih.gov/nuccore/251831106
#1.2) hg19 assembly (downlaod as genbank format)
http://www.ncbi.nlm.nih.gov/nuccore/NC_001807.4?report=genbank
Known pathogenic mutations on mitochondria comes from OMIM and it will be used by class/package Mitoanno (Mitoanno.pm)
File Name: yourmitoseek/Resources/OMIMpathogenic.txt
Whole genome exon coordinates are used to estimate the relative mitochondrial copy numbers when the input is exon/RNA-Seq sequencing.
Here is the scripts we download and prepare the exon bed file
#Wkdir: yourmitseek/Resources/genome
#refGene annotation file (refGene.txt)
#description of refGene is here at
#https://cgwb.nci.nih.gov/cgi-bin/hgTables?hgsid=79902&hgta_doSchemaDb=hg18&hgta_doSchemaTable=refGene
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz
gunzip refGene.txt.gz
#get exon region on chromosome 1-22, and X,Y (with prefix chr or without)
perl -lane 'if($F[2]=~/chr\d+$|chrX|chrY/) {@start=split ",",$F[9]; @end=split ",",$F[10]; for($i=0;$i<$F[8];$i++) { print join "\t",$F[2],$start[$i],$end[$i],$F[12]."|".$F[1]}}' refGene.txt >refGeneExon_withChr.bed
perl -lane 'if($F[2]=~/chr\d+$|chrX|chrY/) {$F[2]=~s/chr//g;@start=split ",",$F[9]; @end=split ",",$F[10]; for($i=0;$i<$F[8];$i++) { print join "\t",$F[2],$start[$i],$end[$i],$F[12]."|".$F[1]}}' refGene.txt >refGeneExon_withoutChr.bed
#How to get the total bases from a bed file (getTotalBasesFromBed.R)
./getTotalBasesFromBed.R genome_withChr.bed