/GREMLIN_CPP

GREMLIN - learn MRF/potts model from input multiple sequence alignment! Implementation now available in C++ and Tensorflow/Python!

Primary LanguageJupyter Notebook

GREMLIN_CPP v1.0

Installation

$ g++ -O3 -std=c++0x -o gremlin_cpp gremlin_cpp.cpp -fopenmp

invoke -fopenmp to allow usage of multiple CPU(s), the code is openmp friendly.

Usage

Note, openmp uses the system variable OMP_NUM_THREADS to decide how many threads/CPU(s) to use.

$ export OMP_NUM_THREADS=16
$ ./gremlin_cpp -i alignment_file -o results
# ---------------------------------------------------------------------------------------------
#                                GREMLIN_CPP v1.0                                              
# ---------------------------------------------------------------------------------------------
#   -i            input alignment (either one sequence per line or in fasta format)
#   -o            save output to
# ---------------------------------------------------------------------------------------------
#  Optional settings                                                                           
# ---------------------------------------------------------------------------------------------
#   -only_neff    only compute neff (effective num of seqs)      [Default=0]
#   -only_v       only compute v (1body-term)                    [Default=0]
#   -gap_cutoff   remove positions with > X fraction gaps        [Default=0.5]
#   -alphabet     select: [protein|rna|binary]                   [Default=protein]
#   -eff_cutoff   seq id cutoff for downweighting similar seqs   [Default=0.8]
#   -lambda       L2 regularization weight                       [Default=0.01]
#   -mrf_i        load MRF
#   -mrf_o        save MRF
#   -pair_i       load list of residue pairs (one pair per line, index 0)
# ---------------------------------------------------------------------------------------------
#  Minimizer settings                                                                          
# ---------------------------------------------------------------------------------------------
#   -min_type     select: [lbgfs|cg|none]                        [Default=lbfgs]
#   -max_iter     number of iterations                           [Default=100]
# ---------------------------------------------------------------------------------------------

parsing output

i j raw apc ii jj
i = index i
j = index j
raw = l2norm(W)
apc = raw - mean(row) * mean(col) / mean(all)
ii = char-position i
jj = char-position j

The out MRF contains 21 values for each position (V) and 21 x 21 values for each pair of positions (W).

The order of the values is as follows: "ARNDCQEGHILKMFPSTWYV-" (where "-" is the gap).

For RNA the order is "AUCG-", with 5 values for V and 5x5 for W.

For Binary the order is "01-", with 3 values for V and 3x3 for W.

Alternative implementations