/BaMMmotif

Bayesian Markov Model motif discovery - An expectation maximization algorithm for the de novo discovery of enriched motifs as modelled by higher-order Markov models.

Primary LanguageC++GNU General Public License v3.0GPL-3.0

BaMM!motif

Bayesian Markov Model motif discovery software.

Requirements

To compile from source you need

  • GCC compiler version 5.x or lower (we suggest GCC-5.x)
  • CMake 2.8 or later

To plot BaMM logos you need

  • R 2.14.1 or later

How to compile BaMM!motif?

Linux

  mkdir build
  cd build
  cmake ..
  make

OS X

OS X ships clang instead of gcc. We recommend using Homebrew to install gcc.

Having installed Homebrew, all required dependencies can be installed using the brew command

  brew tap homebrew/versions
  brew tap homebrew/science
  brew install gcc5 cmake R

Finally this will compile BaMM!motif

  export CXX=g++-5
  export CC=gcc-5
  export LDFLAGS="-static-libgcc -static-libstdc++"

  mkdir build
  cd build
  cmake ..
  make

How to use BaMM!motif from the command line?

SYNOPSIS

  BaMMmotif DIRPATH FILEPATH [OPTIONS]

DESCRIPTION

  Bayesian Markov Model motif discovery software.

  DIRPATH
      Output directory for the results.

  FILEPATH
      FASTA file with positive sequences of equal length.

OPTIONS

Sequence options

  --negSequenceSet <FILEPATH>
      FASTA file with negative/background sequences used to learn the
      (homogeneous) background BaMM. If not specified, the background BaMM
      is learned from the positive sequences.

  --reverseComp
      Search motifs on both strands (positive sequences and reverse
      complements). This option is e.g. recommended when using sequences
      derived from ChIP-seq experiments.

Options to initialize a single BaMM from file

  --bindingSiteFile <FILEPATH>
      File with binding sites of equal length (one per line).

  --markovModelFile <FILEPATH>
      File with BaMM probabilities as obtained from BaMM!motif (omit
      filename extension).

Options to initialize one or more BaMMs from XXmotif PWMs

  --minPWMs <INTEGER>
      Minimum number of PWMs. The options --maxPValue and --minOccurrence
      are ignored. The default is 1.

  --maxPWMs <INTEGER>
      Maximum number of PWMs.

  --maxPValue <FLOAT>
      Maximum p-value of PWMs. This filter is not applied to the top
      minimum number of PWMs (see --minPWMs). The default is 1.0.

  --minOccurrence <FLOAT>
      Minimum fraction of sequences that contain the motif. This filter is
      not applied to the top minimum number of PWMs (see --minPWMs). The
      default is 0.05.

  --rankPWMs <INTEGER> [<INTEGER>...]
      PWM ranks in XXmotif results. The former options to initialize BaMMs
      from PWMs are ignored.

Options for (inhomogeneous) motif BaMMs

  -k <INTEGER>
      Order. The default is 2.

  -a|--alpha <FLOAT> [<FLOAT>...]
      Order-specific prior strength. The default is 1.0 (for k = 0) and
      20 x 3^(k-1) (for k > 0). The options -b and -g are ignored.

  -b|--beta <FLOAT>
      Calculate order-specific alphas according to beta x gamma^(k-1) (for
      k > 0). The default is 20.0.

  -g|--gamma <FLOAT>
      Calculate order-specific alphas according to beta x gamma^(k-1) (for
      k > 0). The default is 3.0.

  --extend <INTEGER>{1,2}
      Extend BaMMs by adding uniformly initialized positions to the left
      and/or right of initial BaMMs. Invoking e.g. with --extend 0 2 adds
      two positions to the right of initial BaMMs. Invoking with --extend 2
      adds two positions to both sides of initial BaMMs. By default, BaMMs
      are not being extended.

Options for the (homogeneous) background BaMM

  -K <INTEGER>
      Order. The default is 2.

  -A|--Alpha <FLOAT>
      Prior strength. The default is 10.0.

EM options

  -q <FLOAT>
      Prior probability for a positive sequence to contain a motif. The
      default is 0.9.

  -e|--epsilon <FLOAT>
      The EM algorithm is deemed to be converged when the sum over the
      absolute differences in BaMM probabilities from successive EM rounds
      is smaller than epsilon. The default is 0.001.

XXmotif options

  --XX-ZOOPS
      Use the zero-or-one-occurrence-per-sequence model (default).

  --XX-MOPS
      Use the multiple-occurrence-per-sequence model.

  --XX-OOPS
      Use the one-occurrence-per-sequence model.

  --XX-seeds ALL|FIVEMERS|PALINDROME|TANDEM|NOPALINDROME|NOTANDEM
      Define the nature of seed patterns. The default is to start using ALL
      seed pattern variants.

  --XX-gaps 0|1|2|3
      Maximum number of gaps used for seed patterns. The default is 0.

  --XX-pseudoCounts <FLOAT>
      Percentage of pseudocounts. The default is 10.0.

  --XX-mergeMotifsThreshold LOW|MEDIUM|HIGH
      Define the similarity threshold used to merge PWMs. The default is to
      merge PWMs with LOW similarity in order to reduce runtime.

  --XX-maxPositions <INTEGER>
      Limit the number of motif positions to reduce runtime. The default is
      17.

  --XX-noLengthOptimPWMs
      Omit the length optimization of PWMs.

  --XX-K <INTEGER>
      Order of the (homogeneous) background BaMM. The default is either 2
      (when learned on positive sequences) or 8 (when learned on background
      sequences).

  --XX-A <FLOAT>
      Prior strength of the (homogeneous) background BaMM. The default is
      10.0.

  --XX-jumpStartPatternStage <STRING>
      Jump-start pattern stage using an IUPAC pattern string.

  --XX-jumpStartPWMStage <FILEPATH>
      Jump-start PWM stage reading in a PWM from file.

  --XX-localization
      Calculate p-values for positional clustering of motif occurrences in
      positive sequences of equal length. Improves the sensitivity to find
      weak but positioned motifs.

  --XX-localizationRanking
      Rank motifs according to localization statistics.

  --XX-downstreamPositions <INTEGER>
      Distance between the anchor position (e.g. the transcription start
      site) and the last positive sequence nucleotide. Corrects motif
      positions in result plots. The default is 0.

  --XX-batch
      Suppress progress bars.

Options to score sequences

  --scorePosSequenceSet
      Score positive (training) sequences with optimized BaMMs.

  --scoreNegSequenceSet
      Score background (training) sequences with optimized BaMMs.

  --scoreTestSequenceSet <FILEPATH> [<FILEPATH>...]
      Score test sequences with optimized BaMMs. Test sequences can be
      provided in a single or multiple FASTA files.

Output options

  --saveInitBaMMs
      Write initialized BaMM(s) to disk.

  --saveBaMMs
      Write optimized BaMM(s) to disk.

  --verbose
      Verbose terminal printouts.

  -h, --help
      Printout this help.

BaMM flat file format

BaMMs are written to flat file when invoking BaMM!motif with the output option --saveInitBaMMs and/or --saveBaMMs. In this case, BaMM!motif generates three files for each (inhomogeneous) BaMM – one containing the probabilities (filename extension: probs), one containing the conditional probabilities (filename extension: conds), and one containing the background frequencies of mononucleotides in the positive sequences (file extension: freqs). The format is the same for the first two. While blank lines separate BaMM positions, lines 1 to k+1 of each BaMM position contain the (conditional) probabilities for order 0 to order k. For instance, the format for a BaMM of order 2 and length W is as follows:

Filename extension: probs

P1(A) P1(C) P1(G) P1(T)
P1(AA) P1(AC) P1(AG) P1(AT) P1(CA) P1(CC) P1(CG) ... P1(TT)
P1(AAA) P1(AAC) P1(AAG) P1(AAT) P1(ACA) P1(ACC) P1(ACG) ... P1(TTT)

P2(A) P2(C) P2(G) P2(T)
P2(AA) P2(AC) P2(AG) P2(AT) P2(CA) P2(CC) P2CG) ... P2(TT)
P2(AAA) P2(AAC) P2(AAG) P2(AAT) P2(ACA) P2(ACC) P2(ACG) ... P2(TTT)
...

PW(A) PW(C) PW(G) PW(T)
PW(AA) PW(AC) PW(AG) PW(AT) PW(CA) PW(CC) PWCG) ... PW(TT)
PW(AAA) PW(AAC) PW(AAG) PW(AAT) PW(ACA) PW(ACC) PW(ACG) ... PW(TTT)

Filename extension: conds

P1(A) P1(C) P1(G) P1(T)
P1(A|A) P1(C|A) P1(G|A) P1(T|A) P1(A|C) P1(C|C) P1(G|C) ... P1(T|T)
P1(A|AA) P1(C|AA) P1(G|AA) P1(T|AA) P1(A|AC) P1(C|AC) P1(G|AC) ... P1(T|TT)

P2(A) P2(C) P2(G) P2(T)
P2(A|A) P2(C|A) P2(G|A) P2(T|A) P2(A|C) P2(C|C) P2(G|C) ... P2(T|T)
P2(A|AA) P2(C|AA) P2(G|AA) P2(T|AA) P2(A|AC) P2(C|AC) P2(G|AC) ... P2(T|TT)
...

PW(A) PW(C) PW(G) PW(T)
PW(A|A) PW(C|A) PW(G|A) PW(T|A) PW(A|C) PW(C|C) PW(G|C) ... PW(T|T)
PW(A|AA) PW(C|AA) PW(G|AA) PW(T|AA) PW(A|AC) PW(C|AC) PW(G|AC) ... PW(T|TT)

Filename extension: freqs

P(A) P(C) P(G) P(T)

Note that contexts are restricted to the binding site. For instance, P1(G|AC) and P2(G|AC) are defined as P1(G) and P2(G|C), respectively.

In addition, BaMM!motif generates three files for the (homogeneous) background BaMM – one containing the probabilities (filename extension: probsBg), one containing the conditional probabilities (filename extension: condsBg), and one containing the background frequencies of mononucleotides (file extension: freqs). For instance, the format for a background BaMM of order 2 is as follows:

Filename extension: probsBg

P(A) P(C) P(G) P(T)
P(AA) P(AC) P(AG) P(AT) P(CA) P(CC) P(CG) ... P(TT)
P(AAA) P(AAC) P(AAG) P(AAT) P(ACA) P(ACC) P(ACG) ... P(TTT)

Filename extension: condsBg

P(A) P(C) P(G) P(T)
P(A|A) P(C|A) P(G|A) P(T|A) P(A|C) P(C|C) P(G|C) ... P(T|T)
P(A|AA) P(C|AA) P(G|AA) P(T|AA) P(A|AC) P(C|AC) P(G|AC) ... P(T|TT)

Filename extension: freqsBg

P(A) P(C) P(G) P(T)

Note that the background frequencies of mononucleotides are identical to the probabilities of mononucleotides in the other two files.

How to plot BaMM logos?

R scripts are provided in directory R to plot the BaMM logo from a BaMM flat file. To create a BaMM logo, edit the parameter setting in plotBaMM.wrapper.R and source the code in the R session using

source( "plotBaMM.wrapper.R" )

Please find comments on available plotting options in the wrapper.

License

BaMM!motif is released under the GNU General Public License v3 or later. See LICENSE for more details.