/BioMarkovChains.jl

Representing biological sequences as Markov chains

Primary LanguageJuliaMIT LicenseMIT


Representing biological sequences as Markov chains

DOI


BioMarkovChains

A Julia package to represent biological sequences as Markov chains

Installation

BioMarkovChains is a   Julia Language   package. To install BioMarkovChains, please open Julia's interactive session (known as REPL) and press ] key in the REPL to use the package mode, then type the following command

pkg> add BioMarkovChains

Creating transition models out of DNA sequences

An important step before developing several gene finding algorithms consist of having a Markov chain representation of the DNA. To do so, we implemented the transition_model method that will capture the initials and transition probabilities of a DNA sequence (LongSequence) and will create a dedicated object storing relevant information of a DNA Markov chain. Here an example:

Let find one ORF in a random LongDNA :

using BioSequences, GeneFinder

sequence = randdnaseq(10^3)
orfdna = getorfdna(sequence, min_len=75)[1]

If we translate it, we get a 69aa sequence:

translate(orfdna)
69aa Amino Acid Sequence:
MSCGETTVSPILSRRTAFIRTLLGYRFRSNLPTKAERSRFGFSLPQFISTPNDRQNGNGGCGCGLENR*

Now supposing I do want to see how transitions are occurring in this ORF sequence, the I can use the transition_model method and tune it to 2nd-order Markov chain:

BioMarkovModel(orfdna, 2)
BioMarkovModel:
  - Transition Probability Matrix -> Matrix{Float64}(4 × 4):
    0.278	0.272	0.228	0.222
    0.276	0.273	0.231	0.22
    0.286	0.269	0.242	0.204
    0.266	0.275	0.236	0.224
  - Initial Probabilities -> Vector{Float64}(4 × 1):
    0.277	
    0.272
    0.233
    0.218
  - Markov Chain Order:1

This is useful to later create HMMs and calculate sequence probability based on a given model, for instance we now have the E. coli CDS and No-CDS transition models implemented:

ECOLICDS
BioMarkovChain:
  - Transition Probability Matrix -> Matrix{Float64}(4 × 4):
    0.31	0.224	0.199	0.268
    0.251	0.215	0.313	0.221
    0.236	0.308	0.249	0.207
    0.178	0.217	0.338	0.267
  - Initial Probabilities -> Vector{Float64}(4 × 1):
    0.245
    0.243
    0.273
    0.239
  - Markov Chain Order:1

What is then the probability of the previous random Lambda phage DNA sequence given this model?

dnaseqprobability(orfdna, ECOLICDS)
1.9478492511798446e-125

This is off course not very informative, but we can later use different criteria to then classify new ORFs. For a more detailed explanation see the docs

License

MIT License