A Julia package to represent biological sequences as Markov chains
BioMarkovChains is a Julia Language package. To install BioMarkovChains, please open Julia's interactive session (known as REPL) and press ] key in the REPL to use the package mode, then type the following command
pkg> add BioMarkovChains
An important step before developing several gene finding algorithms consist of having a Markov chain representation of the DNA. To do so, we implemented the BioMarkovChain
method that will capture the initials and transition probabilities of a DNA sequence (LongSequence
) and will create a dedicated object storing relevant information of a DNA Markov chain. Here an example:
Let find one ORF in a random LongDNA
:
using BioSequences, GeneFinder, BioMarkovChains
sequence = randdnaseq(10^3)
orfdna = getorfdna(sequence, min_len=75)[1]
If we translate it, we get a 69aa sequence:
translate(orfdna)
69aa Amino Acid Sequence:
MSCGETTVSPILSRRTAFIRTLLGYRFRSNLPTKAERSRFGFSLPQFISTPNDRQNGNGGCGCGLENR*
Now supposing I do want to see how transitions are occurring in this ORF sequence, the I can use the BioMarkovChain
method and tune it to 2nd-order Markov chain:
BioMarkovChain(orfdna, 2)
BioMarkovChain with DNAAlphabet{4}() Alphabet:
- Transition Probability Matrix -> Matrix{Float64}(4 × 4):
0.2123 0.2731 0.278 0.2366
0.2017 0.3072 0.2687 0.2224
0.1978 0.2651 0.2893 0.2478
0.2013 0.3436 0.2431 0.212
- Initial Probabilities -> Vector{Float64}(4 × 1):
0.2027
0.2973
0.2703
0.2297
- Markov Chain Order -> Int64:
2
This is useful to later create HMMs and calculate sequence probability based on a given model, for instance we now have the E. coli CDS and No-CDS transition models or Markov chain implemented:
ECOLICDS
BioMarkovChain with DNAAlphabet{4}() Alphabet:
- Transition Probability Matrix -> Matrix{Float64}(4 × 4):
0.31 0.224 0.199 0.268
0.251 0.215 0.313 0.221
0.236 0.308 0.249 0.207
0.178 0.217 0.338 0.267
- Initial Probabilities -> Vector{Float64}(4 × 1):
0.245
0.243
0.273
0.239
- Markov Chain Order -> Int64:
1
What is then the probability of the previous random Lambda phage DNA sequence given this model?
dnaseqprobability(orfdna, ECOLICDS)
7.466531836596359e-45
This is off course not very informative, but we can later use different criteria to then classify new ORFs. For a more detailed explanation see the docs