Dependencies (installation instructions detailed below):
This repository contains a C++ reimplementation of bmDCA adapted from the original code. Method is described in:
Figliuzzi, M., Barrat-Charlaix, P. & Weigt, M. How Pairwise Coevolutionary Models Capture the Collective Residue Variability in Proteins? Molecular Biology and Evolution 35, 1018–1027 (2018).
This code is designed to eliminate the original's excessive file I/O and to parallelize the MCMC in the inference loop.
GCC is used to compile the source code (and dependencies, if necessary).
The code relies on the fopenmp
flag for parallelization, so GCC is preferred
over Clang. It also needs support for the C++11 standard, so any GCC later than
version 4.2 will suffice.
AutoTools are a set of programs used to generate makefiles for cross-platform compilation and installation.
pkg-config is a program that provides a simple interface between installed programs (e.g. libraries and header files) and the compiler. It's used by AutoTools to check for dependencies before compilation.
Armadillo is a C++ linear algebra library. It's used for storing data in matrix structures and performing quick computations in the bmDCA inference loop. To install, again look to your package repository.
To install the dependencies in Linux, simply use your distributions package manager. Commands for Debian/Ubuntu and Arch Linux are provided below:
Run:
sudo apt-get update
sudo apt-get install git gcc g++ automake autoconf pkg-config \
libarmadillo-dev libopenblas-dev libarpack++2-dev
For Arch Linux, GCC should have been installed with the base
and base-devel
metapackages (sudo pacman -S base base-devel
), but if not installed, run:
sudo pacman -S gcc automake autoconf pkgconf
For Arch, Armadillo is not in the package repositories. You will need to check the AUR.
First, install the SuperLU library:
git clone https://aur.archlinux.org/superlu.git
cd superlu
makepkg -si
cd ..
SuperLU is a fast matrix factorization library required as a build dependency
for Armadillo. Other build dependencies will be installed via makepkg
from
the official repositories.
Now, download and install Armadillo:
git clone https://aur.archlinux.org/armadillo.git
cd armadillo
makepkg -si
cd ..
The macOS instructions rely on Xcode developer tools and Homebrew for package management. All commands will be entered into the Terminal.
First, install Xcode developer tools. Open the 'Terminal' application from the launcher and run:
xcode-select --install
This may already have this installed.
Next, install Homebrew. From the online instructions, run:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
If you run into permissions errors when installing Homebrew, complaining that
root owns the /usr/local/
directory, you can change the ownership by running:
sudo chown -R <user> /usr/local/
where <user>
should be substituted with your username, e.g. john
.
Once Homebrew is installed, run:
brew install gcc automake autoconf pkg-config armadillo
This will install the most recent GCC (9.3.0 as of writing) along with AutoTools and pkg-config.
IMPORTANT: The default gcc
, located in /usr/bin/gcc
is actually aliased
to clang
, which is another compiler. While in principle this is not an issue,
this version of Clang is not compatible with the fopenmp
compiler flag that
is used to enable parallelization of the MCMC sampler. Additionally, libraries
(see Armadillo in the next step) installed via Homebrew are not by default
known to pkg-config
or the linker.
Addressing all of these issues involves overriding the CC
and CXX
environmental variables with the new GCC, updating PKG_CONFIG_PATH
with paths
to any relevant *.pc files, and updating LD_LIBRARY_PATH
with any shared
object library linked at compile time.
Doing this for the first time is a bit bewildering, so for convenience, use the
rcparams
file in the tools
directory in this repository. In it are a few
helper functions and aliases. Simply append the contents of that file to your
shell run commands. If you don't know what shell you're using, run:
echo $SHELL
For bash, copy the contents of rcparams
to ${HOME}/.bashrc
, and for zsh,
copy to ${HOME}/.zshrc
. The general idea is that macOS versions <=10.14
(Mojave and earlier), uses bash as the default shell, and for >=10.15 (Catalina
and later), Apple switched the default shell to zsh.
You can append the rcparams
file by copy-pasting the code in your favorite
text editor. You could also do something like cat tools/rcparams >> ${HOME}/.bashrc
, for example.
As a side note, your macOS may not actually source the .bashrc
file by
default. If you notice that adding the rcparams
function has not effect in
new terminals, check that the ${HOME}/.bash_profile
file exists. In it, there
should be a line like [ -f $HOME/.bashrc ] && . $HOME/.bashrc
. (If the
.bashrc
file exists, use source
on it.) If no such like is there, add it
and reload your terminal.
The libraries and headers will be found via the pkgconfig_find()
and
ld_lib_add()
functions specified in the rcparams
file.
Note: Run commands are executed when the shell starts, not when the files are edited. To update your shell to reflect changes, you can either run:
source ${HOME}/.bashrc
Or simply open a new shell. (For remote systems, you can just log out and log in again.)
Before starting, install MSYS2. This program is a package distribution for GNU/Unix tools that can be used to build programs for Windows.
The installer defaults work fine, and if prompted, open the "MSYS2" shell in the dialog window.
Once MSYS2 is installed and open, update the base libraries by running:
pacman -Syu
This will download and install some packages. You will then be prompted to close the terminal. Close it and open it again. Then, again run:
pacman -Syu
This will upgrade the remaining packages packaged in the installer to their most recent versions.
Next, install the dependencies for bmDCA:
pacman -S nano vim git \
autoconf automake-wrapper pkg-config make \
mingw-w64-x86_64-toolchain \
mingw-w64-x86_64-openmp \
mingw-w64-x86_64-arpack \
mingw-w64-x86_64-lapack \
mingw-w64-x86_64-openblas \
mingw-w64-x86_64-armadillo
The above command will installed the required programs in the /mingw64/bin
directory. Unfortunately, this directory is not on the default PATH. You will
need to add it manually.
Open your .bashrc
file in a text editor (e.g. vim ~/.bashrc
). Nano and Vim
were installed in the above command block.
Once open, add the line (at the end of the file):
export PATH="/mingw64/bin:$PATH"
Then, close and open the MSYS2 terminal again.
Optionally, edit the /etc/pacman.conf
file. Uncomment the line #Color
and
add the line ILoveCandy
. Just a cosmetic flourish for pacman
.
Now that all the dependencies have been installed, compile and install bmDCA
globally (default: /usr/local
) by running:
git clone https://github.com/ranganathanlab/bmDCA.git
cd bmDCA
./autogen.sh --prefix=/usr/local && \
make -j4 && \
make install
cd ..
Depending on your platform, the make install
command may fail due to
permissions issues. To remedy this you can either run sudo make install
instead, or you can specify a different installation directory that does not
require administrator privileges. The latter option is particularly useful when
working on remote system not under your control.
Should you want to specify a local directory, for example $HOME/.local
, run:
./autogen.sh --prefix=${HOME}/.local && \
make -j4 && \
make install
You can replace the value to the right of --prefix=
with any other path.
Note, that you should check that it is on your system PATH.
In the event you with to uninstall bmDCA
, simply run sudo make uninstall
or make uninstall
as appropriate.
Test the installation by running in the terminal:
bmdca
If the installation worked correctly, this will print the usage information, e.g.:
bmdca usage:
(e.g. bmdca -i <input MSA> -r -d <directory> -c <config file>)
-i: input MSA (FASTA format)
-d: destination directory
-r: re-weighting flag
-n: numerical MSA
-w: sequence weights
-c: config file
-h: print usage (i.e. this message)
-f: force a restart of the inference loop
Though bmdca
should now be easily invoked from within MSYS2, one can update
the system PATH variable to make the binaries accessible system-wide, such as
from the command prompt or other terminal emulators. To update the PATH:
- Type 'env' in the start search bar.
- Click 'Edit the system environment variables'.
- Click on 'Environment Variables...' toward the bottom of the window that opens.
- Select 'Path' in one of the two selection windows (either 'User variables' or 'System variables' is fine)
- Once 'Path' is highlighted, click 'Edit...'
- Enter the
/usr/local/bin
as a new PATH entry. You can either:- Click 'New' in the new window and enter the path to
/usr/local/bin
in the MSYS2 installation folder (default:C:\msys64\usr\local\bin
). - Click the 'Browse...' button and navigate to the
C:\msys64\usr\local\bin
directory.
- Click 'New' in the new window and enter the path to
- When the new entry is added, click 'OK' on all the opened windows to set all the changes. You will need to close and re-open terminals for the changes to be reflected.
This step will take an input multiple sequence alignment (MSA) and a config file specifying learning parameters and options and then run an inference loop to fit values to a Potts model for amino acid frequencies at positions (Potts fields) and pairs of frequencies at pairs of positions (Potts couplings).
The command line flags are:
-i
: input MSA, FASTA format-d
: directory where output files are written-r
: (optional) flag to compute re-weighting coefficients for each sequence in the alignment, with the goal to not unduly bias inference by highly similar sequences arising from the phylogeny (default:false
)-c
: (optional) config file for bmDCA run hyperparameters, such asexample/bmdca.conf
-t
: threshold for computing default sequence weights (default:0.8
)-n
: input MSA, numerical format-w
: file containing sequence weights-h
: print usage information-f
: force a restart of inference loop (i.e., start at step 1)
If -r
is not specified, each sequence will be equally weighted, and if no
config file is supplied, the run will default to hyperparameters hard-coded in
the initializeParameters()
function defined in src/run.cpp
. The default
number of iterations for the Boltzmann machine is 2000.
The mapping from amino acids to integers is defined in the following way. Amino acids are ordered as in the following string "-ACDEFGHIKLMNPQRSTVWY". They are then mapped to the integer corresponding to their position in the string, minus one. The gap symbol is mapped to 0, A is mapped to 1, etc...
Important: The MSA processing function does not handle gaps represented by '.' characters.
To learn a FASTA-formatted multiple sequence alignment (with re-weighting) and a config file:
bmdca -i <input_alignment.fasta> -d <output_directory> -r -c <config_file.conf>
If you already have a numerically-formatted alignment (gaps are 0) and set of per-sequence weights, run:
bmdca -n <numerical_alignment.txt> -w <sequence_weights.txt>
-d <output_directory> -c <config_file.conf>
Take, for example, this command:
bmdca -i <input_alignment.fasta> -d <output_directory> -r -c <config_file.conf>
If the run is stopped before the maximum number of steps is reached, simply invoke the same command again to restart it:
bmdca -i <input_alignment.fasta> -d <output_directory> -r -c <config_file.conf>
The inference loop will pick up from where it left off.
IMPORTANT To guarantee that inferences loops produce the same results irrespective of whether they were stopped and restarted or ran continuously, bmDCA will check that the hyperparameters used previously and presently match. The only fields that may be adjusted before restarting are:
save_parameters
step_max
- the max number of steps (increase to continue a loop after it ends)error_max
- the convergence threshold (lower it to continue)
The hyperparameters used by any given run are stored in the bmdca_params.conf
file. For boolean configuration options, it is not to specify 1
or 0
instead of true
and false
.
Use a Monte-Carlo sampler to draw sequences from the model specified by the learned parameters.
Run:
bmdca_sample -p <parameters.txt> -d <output_directory> \
-o <output_file.txt> -n <number_of_sequences> \
-r <number_of_indep_sampling_runs> -c <config_file.conf>
If instead, you save parameters in binary format, run:
bmdca_sample -p <parameters_h.bin> -P <parameters_J.bin> \
-d <output_directory> -o <output_file.txt> \
-n <number_of_sequences> -r <number_of_indep_sampling_runs> \
-c <config_file.conf>
The command line flags are:
-p
: input parameters, text format or fields (h) parameters file, binary format-P
: (optional) couplings (J) parameters file, binary format-d
: directory where output files are written-c
: (optional) config file for bmDCA run hyperparameters, e.g.example/bmdca.conf
-o
: name of the output file for the sequences-n
: number of sequences to sample in each independent run (default: 1000)-r
: number of independent sequencing runs (default: 10)
Note, you only need to specify the -p
option if the bmdca
output is stored
in text files. For binary parameters, which are stored in two _h_%d.bin
and
_J_%d.bin
files, pass the fields (h) file to -p
and couplings (J) file to
-P
.
If using the output_binary=true
flag during the inference step, the output
will be stored in an Armadillo-specific binary format. While this allows for
reproducible outputs in stopped-and-restarted inference runs, the format is not
accessible for other programs. You can use the arma2ascii
tool to convert
binary-stored outputs to ASCII.
To convert parameters:
arma2ascii -p <parameters_h.bin> -P <parameters_J.bin>
To convert stats files:
arma2ascii -s <MC_stat_file.bin>
The output fill be stored in a .txt
file.
Inference and sampling runs can be configured using a text file (see
example/bmdca.conf
). The fields in the file are as follows:
lambda_reg1
- L2 regularization strength for fields, h (default: 0.01)lambda_reg2
- L2 regularization strength for couplings, J (default: 0.01)step_max
- maximum number of iterations for Boltzmann learning process (default: 2000)error_max
- error convergence criterion for stopping (default: 1e-05)save_parameters
- save parameters everysave_parameters
number of steps (default: 100)save_best_steps
- save steps that yield the lowest RMSE of the gradient (default: false)random_seed
- initial seed for the random number generator (default: 1)use_reparametrization
- use the re-parametrized model for inference (default: true)epsilon_0_h
- initial learning rate for fields (default: 0.01)epsilon_0_J
- initial learning rate for couplings (default: 0.001)adapt_up
- multiple by which to increase Potts (J and h) gradient (default: 1.5)adapt_down
- multiple by which to decrease Potts (J and h) gradient (default: 0.6)min_step_h
- minimum learning rate for h (default: 0.001)max_step_h
- maximum learning rate for h (default: 2.5)min_step_J
- minimum learning rate for J (default: 1e-05)max_step_J_N
- maximum learning rate for J, scaled by effective number of sequences (default: 2.5)error_min_update
- threshold for differences in MSA and MCMC frequencies above which parameters (J and h) are updated (default: -1)t_wait_0
- initial burn-in time (default: 10000)delta_t_0
- initial wait time between sampling sequences (default: 100)check_ergo
- flag to check MCMC sample energies and autocorrelations, without which wait and burn-in times are not updated (default: true)adapt_up_time
- multiple to increase MCMC wait/burn-in time (default: 1.5)adapt_down_time
- multiple to decrease MCMC wait/burn-in time (default 0.6)step_important_max
- maximum number of importance sampling steps (default: 1, i.e.importance sampling disabled)coherence_min
- (default=.9999)M
- number of sequences to sample for each MCMC replicate (default: 1000)count_max
- number of independent MCMC replicates (default: 10)init_sample
- flag for whether of not to use seed sequence for initializing the MCMC (default: false)init_sample_file
- file containing the MCMC seed sequences (default: "")sampler
- sampler mode, 'mh' for Metropolis-Hastings and 'z-sqrt' or 'z-barker' for Zanella, 2019. 'z-sqrt' corresponds to a balancing function ofsqrt(t)
, and 'z-barker' corresponds tot/(1+t)
. (default: "mh")output_binary
- flag to output data in binary format, which is faster and more precise (default: true)
random_seed
- initial seed for the random number generator (default: 1)t_wait_0
- initial burn-in time (default: 100000)delta_t_0
- initial wait time between sampling sequences (default: 1000)check_ergo
- flag to check MCMC sample energies and autocorrelations, without which wait and burn-in times are not updated (default: true)adapt_up_time
- multiple to increase MCMC wait/burn-in time (default: 1.5)adapt_down_time
- multiple to decrease MCMC wait/burn-in time (default 0.6)sampler
- sampler mode, 'mh' for Metropolis-Hastings and 'z-sqrt' or 'z-barker' for Zanella, 2019. 'z-sqrt' corresponds to a balancing function ofsqrt(t)
, and 'z-barker' corresponds tot/(1+t)
. (default: "mh")temperature
- temperature at which to sample sequences (default: 1.0)
bmdca
will output files during the course of its run:
bmdca_params.conf
: a list of the hyperparameters used in the learning procedure.energy_%d.dat
: mean and std dev over replicates for sample sequence energies at each step of the Markov chainergo_%d.dat
: set of autocorrelation calculations for sampled sequences used for deciding whether to increase/decrease MCMC wait intervals and burn-in times- correlation of sequences 1 wait interval apart.
- correlation of sequences M/10 wait intervals apart. (M = # sequences)
- cross correlation of sequences
- standard deviation of correlations 1 wait intervals apart
- standard deviation of correlations M/10 intervals apart
- standard deviation of cross correlations
- combined deviation of cross and autocorrelations (1 wait interval)
- combined deviation of cross and autocorrelations (M/10 wait intervals)
- combined deviation of autocorrelations 1 and M/10 intervals apart
MC_energies_%d.txt
: energies of each MCMC sequence, grouped by replicateMC_samples_%d.txt
: sequences sampled from MCMC, grouped by replicatemsa_numerical.txt
: numerical representation on input MSAmy_energies_cfr_%d.txt
: statistics of energies over replicates, used for deciding whether to increase/decrease MCMC wait intervals and burn-in times:- number of replicates
- average over replicates of the energies of starting MCMC sequences
- standard deviation over replicates of energies of starting MCMC sequences
- number of replicates
- average over replicates of the energies of ending MCMC sequences
- standard deviation over replicates of energies of sending MCMC sequences
my_energies_cfr_err_%d.txt
: additional energies statistics- average over replicates of the energies of starting MCMC sequences
- average over replicates of the energies of ending MCMC sequences
- combined std dev of energies for staring and ending MCMC sequences
my_energies_end_%d.txt
: energies of ending MCMC sequence for each replicatemy_energies_start_%d.txt
: energies of starting MCMC sequence for each replicateoverlap_%d.txt
: overlap of pairs of MCMC sequences- number of steps apart (in units of wait time)
- mean overlap for all sequences %d steps apart
- standard deviation of overlaps for all sequences %d steps apart
parameters_%d.txt
: learned Potts model parameters (J and h)parameters_h_%d.bin
andparameters_J_%d.bin
: learned Potts model parameters (J and h), stored in arma binary format (seeoutput_binary=true
flag from config file).rel_ent_grad_align_1p.txt
: relative entropy gradient for each amino acid at each positionsequence_weights.txt
: weights for each sequence, either a number between 0 and 1 based on sequence similarity or 1 if re-weighting was not specifiedstat_align_1p
: table of frequencies for each amino acid at each position in the MSAstat_align_2p
: table of frequencies for pairs of amino acids at each pair of positions in the MSA (due to symmetry, only the 'upper triangle' of positions is stored)stat_MC_1p_%d.txt
: table of frequencies for each amino acid at each position of the set of MCMC-sampled sequences.stat_MC_1p_sigma_%d.txt
: table of standard deviation of frequencies over replicates for each amino acid at each position of the set of MCMC-sampled sequences.stat_MC_2p_%d.txt
: table of frequencies for pairs of amino acids at each pair of positions from the set of MCMC-sampled sequencesstat_MC_2p_sigma_%d.txt
: table of standard deviation over replicates of frequencies for pairs of amino acids at each pair of positions from the set of MCMC-sampled sequences
The final outputs will be stored with a _final
suffix in the file name before
the file extension. For example, the final learned parameters will be stored in
parameters_final.txt
. Use this file or the latest learned parameters for
sampling synthetic sequences.
Depending how many times you configure bmdca
to save steps to disk, the total
data generated can be substantial ( > 1 Gb). At present, the only way to
disable writing of a particular log file is to comment out the code in the
Sim::run()
function defined in src/run.cpp
.
This file is a space-delimited file, e.g.:
4914 53 21
0 2 10 10 13 16 1 7 6 13 2 1 12 19 17 17 15 19 20 5 18 6 18 18 6 15 2 12 15 5 19 20 6 6 2 7 6 12 9 12 16 5 1 16 4 4 4 2 11 15 18 2 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 5 6 20 9 6 2 6 6 12 9 12 12 5 3 17 3 6 17 2 17 16 10 2 9
The first line is:
- Number of sequences (M)
- Number of positions (N)
- Size of amino acid alphabet (all AAs + 1 for gaps) (Q)
By default, bmdca
will save learned parameters in binary format. These files
(parameters_h_%d.bin
and parameters_J_%d.bin
) cannot be directly viewed by
a text editor. To view the contents, convert the files to ASCII by using the
provided arma2ascii
tool.
See the above usage section for how to use arma2ascii
. Parameters converted
by the program will match the format for parameters generated when
output_binary=false
is specified in the config file. See the below section
for details.
Learned parameters saved in text files are called parameters_%d.txt
. They
contain the parameters for both J and h, formatted as follows:
J [position index i] [position index j] [amino acid index a] [amino acid index b]
.
.
.
h [position index i] [amino acid index a]
.
.
.
The position indices go from 0 to N-1 (N = # positions), and the amino acid indices go from 0 to 20 (21 amino acids total, including gaps). 0 corresponds to a gap.
The sequence statistics files (e.g. stat_align_1p.txt
and
stat_align_2p.txt
) have a different format.
For 1 position (1p) frequencies:
[position index] [amino acid frequencies (21)]
.
.
.
where [amino acid frequencies (21)]
is a row of frequencies for each of the
21 positions.
For 2 position (2p) frequencies:
[position index i] [position index j] [amino acid frequencies (21x21)]
.
.
.
where [amino acid frequencies (21x21)]
is a row that corresponds to the
frequencies of the 21x21
pairs of amino acids at positions i and j.
For users of shared resources: OpenMP will default to the number of
available cores, so if the bmDCA programs are run on a shared resource, say a
cluster, all cores will be engaged, starving other processes of resources or
getting you booted off the system. To prevent this, use the OMP_NUM_THREADS
environmental variable.
You can either set it at runtime:
OMP_NUM_THREADS=4 bmdca -i ...
Or, you can set it globally, for example in your shell rc file.
export OMP_NUM_THREADS=4
The above examples will limit OpenMP to 4 threads.
You don't need to worry about this if submitting jobs through a workload
manager, such as Slurm or Sun Grid Engine. The manager will limit bmDCA to
the number of cores specified, so manipulating OMP_NUM_THREADS
is not needed.