Representing biological sequences as Markov chains
BioMarkovChains
A Julia package to represent biological sequences as Markov chains
Installation
BioMarkovChains is a Julia Language package. To install BioMarkovChains, please open Julia's interactive session (known as REPL) and press ] key in the REPL to use the package mode, then type the following command
pkg> add BioMarkovChains
Creating transition models out of DNA sequences
An important step before developing several gene finding algorithms consist of having a Markov chain representation of the DNA. To do so, we implemented the transition_model
method that will capture the initials and transition probabilities of a DNA sequence (LongSequence
) and will create a dedicated object storing relevant information of a DNA Markov chain. Here an example:
Let find one ORF in a random LongDNA
:
using BioSequences, GeneFinder
sequence = randdnaseq(10^3)
orfdna = getorfdna(sequence, min_len=75)[1]
If we translate it, we get a 69aa sequence:
translate(orfdna)
69aa Amino Acid Sequence:
MSCGETTVSPILSRRTAFIRTLLGYRFRSNLPTKAERSRFGFSLPQFISTPNDRQNGNGGCGCGLENR*
Now supposing I do want to see how transitions are occurring in this ORF sequence, the I can use the transition_model
method and tune it to 2nd-order Markov chain:
BioMarkovModel(orfdna, 2)
BioMarkovModel:
- Transition Probability Matrix -> Matrix{Float64}(4 × 4):
0.278 0.272 0.228 0.222
0.276 0.273 0.231 0.22
0.286 0.269 0.242 0.204
0.266 0.275 0.236 0.224
- Initial Probabilities -> Vector{Float64}(4 × 1):
0.277
0.272
0.233
0.218
- Markov Chain Order:1
This is useful to later create HMMs and calculate sequence probability based on a given model, for instance we now have the E. coli CDS and No-CDS transition models implemented:
ECOLICDS
BioMarkovChain:
- Transition Probability Matrix -> Matrix{Float64}(4 × 4):
0.31 0.224 0.199 0.268
0.251 0.215 0.313 0.221
0.236 0.308 0.249 0.207
0.178 0.217 0.338 0.267
- Initial Probabilities -> Vector{Float64}(4 × 1):
0.245
0.243
0.273
0.239
- Markov Chain Order:1
What is then the probability of the previous random Lambda phage DNA sequence given this model?
dnaseqprobability(orfdna, ECOLICDS)
1.9478492511798446e-125
This is off course not very informative, but we can later use different criteria to then classify new ORFs. For a more detailed explanation see the docs
License
MIT License