/swmutsel

Implementation of the site-wise mutation-selection model (swMutSel) described in Tamuri et al. (2012, 2014) , and Tamuri and dos Reis (2022)

Primary LanguageJavaApache License 2.0Apache-2.0

This is swMutSel, a program to estimate fitnesses of amino acids in protein-
coding genes using the evolution model of Halpern and Bruno (1998) and Tamuri et
al. (2012, 2014). The program takes as input an alignment of protein-coding
gene sequences and a phylogeny (tree) of the sequences, and outputs the
fitnesses of each amino acid at each location in the protein-coding gene.


SYNOPSIS

       Analyse Data Using the SwMutSel Model:

       java -jar swmutsel.jar
           -name <run_name>
           -sequences <sequence_file_name>
           -tree <tree_file_name | tree_newick_string>
           -geneticcode <standard | vertebrate_mit | plastid>
           [-penalty mvn,<sigma> | dirichlet,<alpha>]
           [-kappa <kappa>]
           [-pi <T>,<C>,<A>,<G>]
           [-scaling <branch_scaling_factor>]
           [-fitness <site>,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V [-fitness ...], ...]
           [-fix mutation|branches|all [-fix mutation|branches|all], ...]
           [-threads <cpu_cores>]
           [-distributed -host <host>:<port> [-host <host>:<port>], ...]
           [-sites <site>|<site_range>]
           [-restart-opt <no_of_restarts> [-restart-int <n_iterations>]]
           [-clademodel clade_label,clade_label[,clade_label[,...]]]
           [-hessian]
           [-help]

        Simulate Data Using the SwMutSel Model:

        java -jar swmutsel.jar
           -simulate
           -name <run_name>
           -tree <tree_file_name | tree_newick_string>
           -geneticcode <standard | vertebrate_mit | plastid>
           -sites <number_of_sites>
           -kappa <kappa>
           -pi <T>,<C>,<A>,<G>
           -scaling <branch_scaling_factor>
           [-fitness A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V [-fitness ...], ...]
           [-fitnessfile <filename> [-fitnessfile <filename>], ...]
           [-clademodel <clade_labels>]
           [-shiftfrac <percentage>]

OPTIONS
   Required
       -n, -name
              Specifies name for the run. Output files are prefixed with this name.
              Be careful! The program will overwrite files with the same name.

       -s, -sequences
              Coding sequences alignment file name in PHYLIP format. Spaces are
              not allowed in sequence names.

       -t, -tree
              Newick-formatted tree file name. The tree string can be supplied
              instead e.g. "-tree (A:0.1,(B:0.1,C:0.1));". Spaces are not allowed
              in the string.

       -gc, -geneticcode
              The genetic code for the coding-sequences:

              -gc standard       : The Standard Code
              -gc vertebrate_mit : The Vertbrate Mitochondrial Code
              -gc plastid        : The Bacterial, Archaeal and Plant Plastid Code

   Model Parameters
       -p, -penalty
              The penalty to use for the penalised likelihood method. If not
              supplied, the usual (unpenalised) maximum likelihood method is used.
              Valid options are:

              -p mvn,<s>       : Multivariate normal penalty with variance 2*<s>^2
              -p dirichlet,<a> : Dirichlet-based penalty with shape <a>

       -k, -kappa
              The starting parameter value for the transition-transversion rate
              ratio. If you "-fix mutation" the parameter will not be estimated.
              DEFAULT: 1.0

       -pi
              The starting parameter value for nucleotide base frequencies. Must
              be comma-separated with order T,C,A,G. The values are normalised to
              sum to 1. If you "-fix mutation" the parameter will not be estimated.
              DEFAULT: [0.25]

       -c, -scaling
              The starting parameter value for branch scaling factor (applied to
              all branches). If you "-fix mutation" the parameter will not be
              estimated.
              DEFAULT: 1.0

       -f, -fitness
              Comma-separated fitness parameters in canonical amino acid order.
              It is recommended that you do not construct these by hand but
              rather use the output generated by the program itself.
              DEFAULT: [0]

   Optimisation
       -fix
              Indicate whether you want the program to skip estimation of
              mutational parameters, branch lengths or all parameters. For example,
              if you want to calculate fitness only: "-fix mutation -fix branches"

              -fix mutation          : Fix the values (k, pi, c) of the mutational
                                       model.

              -fix branches          : Fix the branch lengths on the tree.

              -fix all               : Calculate the log-likelihood only.

       -restart-opt
              Specifies the number of optimiser restarts for site-wise fitness
              parameter estimation. The is to prevent estimates being stuck at a
              local optima. The program will restart fitness estimation, with
              random initial values, the specified number of times.
              DEFAULT: 1

       -restart-int
              Specify how often to estimate fitness parameters with multiple
              restarts. Restarting the fitness parameter estimation is expensive
              and, in many cases, not necessary. The value supplied here defines
              how frequently to perform the robust fitness estimation, where a
              single round is one iteration of mutation, branch length and fitness
              estimation.
              DEFAULT: 5

       -sites
              Specify a single site, or a range of sites, for site-wise fitness
              estimation. If you provide this option, you implicitly fix the
              mutation and branch length parameters, "-fix mutation,branches".
              A range is specified using a dash e.g. "-sites 10-20" will estimate
              the site-wise fitnesses for all sites between site 10 and site 20,
              inclusive.

   Parallelisation
       -T, -threads
              Specify the number of cores to use for multi-threaded operation.

       -D, -distributed
              Indicate the program will run in distributed mode. This requires the
              initialisation of (usually) multiple slaves. Each slave will have
              an associated IP address (or hostname) and port (which you supply
              using "-H")

       -H, -host
              If the program is running in distributed mode (using the "-D" option),
              supply slaves' host IP and port using "-H <slave_ip>:<port>"

USAGE

       Simplest example:

          java -jar swmutsel.jar -n test -s aln.phy -t aln.tree -gc standard

       Options can also be placed in file but each argument should be on a new
       line. For example:

          cat > test_options.txt
          -s
          aln.phy
          -t
          aln.tree
          -gc
          standard
          ^D

       Note each space in a typical command-line argument becomes a newline. You
       can now run the program using:

          java -jar swmutsel.jar @test_options.txt -n test

       During parameter optimisation, the program will write checkpoint files.
       You can restart the program from a saved checkpoint, for example:

          java -jar swmutsel.jar @test_CHKPNT_9.txt -n test_restart

CITATION

       Halpern AL and Bruno WJ. (1998) Evolutionary distances for protein-coding
       sequences: modeling site-specific residue frequencies. Molecular Biology
       and Evolution, 15: 910-917.

       Tamuri AU, dos Reis M and Goldstein R. (2012) Estimating the distribution
       of selection coefficients from phylogenetic data using sitewise mutation-
       selection models. Genetics, 190: 1101-1115.

       Tamuri AU, Goldman N and dos Reis M. (2014) A penalized likelihood method
       for estimating the distribution of selection coefficients from
       phylogenetic data. Genetics, 197: 257-271.