/test

Representing biological sequences as Markov chains

Primary LanguageJuliaMIT LicenseMIT


Representing biological sequences as Markov chains

Documentation Latest Release DOI
CI Workflow License Work in Progress Downloads


BioMarkovChains

A Julia package to represent biological sequences as Markov chains

Installation

BioMarkovChains is a   Julia Language   package. To install BioMarkovChains, please open Julia's interactive session (known as REPL) and press ] key in the REPL to use the package mode, then type the following command

pkg> add BioMarkovChains

Creating Markov chain out of DNA sequences

An important step before developing several gene finding algorithms consist of having a Markov chain representation of the DNA. To do so, we implemented the BioMarkovChain method that will capture the initials and transition probabilities of a DNA sequence (LongSequence) and will create a dedicated object storing relevant information of a DNA Markov chain. Here an example:

Let find one ORF in a random LongDNA :

using BioSequences, GeneFinder, BioMarkovChains

sequence = randdnaseq(10^3)
orfdna = getorfdna(sequence, min_len=75)[1]

If we translate it, we get a 69aa sequence:

translate(orfdna)
69aa Amino Acid Sequence:
MSCGETTVSPILSRRTAFIRTLLGYRFRSNLPTKAERSRFGFSLPQFISTPNDRQNGNGGCGCGLENR*

Now supposing I do want to see how transitions are occurring in this ORF sequence, the I can use the BioMarkovChain method and tune it to 2nd-order Markov chain:

BioMarkovChain(orfdna, 2)
BioMarkovChain with DNAAlphabet{4}() Alphabet:
  - Transition Probability Matrix -> Matrix{Float64}(4 × 4):
   0.2123  0.2731  0.278   0.2366
   0.2017  0.3072  0.2687  0.2224
   0.1978  0.2651  0.2893  0.2478
   0.2013  0.3436  0.2431  0.212
  - Initial Probabilities -> Vector{Float64}(4 × 1):
   0.2027
   0.2973
   0.2703
   0.2297
  - Markov Chain Order -> Int64:
   2

This is useful to later create HMMs and calculate sequence probability based on a given model, for instance we now have the E. coli CDS and No-CDS transition models or Markov chain implemented:

ECOLICDS
BioMarkovChain with DNAAlphabet{4}() Alphabet:
  - Transition Probability Matrix -> Matrix{Float64}(4 × 4):
   0.31    0.224   0.199   0.268
   0.251   0.215   0.313   0.221
   0.236   0.308   0.249   0.207
   0.178   0.217   0.338   0.267
  - Initial Probabilities -> Vector{Float64}(4 × 1):
   0.245
   0.243
   0.273
   0.239
  - Markov Chain Order -> Int64:
   1

What is then the probability of the previous random Lambda phage DNA sequence given this model?

dnaseqprobability(orfdna, ECOLICDS)
7.466531836596359e-45

This is off course not very informative, but we can later use different criteria to then classify new ORFs. For a more detailed explanation see the docs