This is swMutSel, a program to estimate fitnesses of amino acids in protein- coding genes using the evolution model of Halpern and Bruno (1998) and Tamuri et al. (2012, 2014). The program takes as input an alignment of protein-coding gene sequences and a phylogeny (tree) of the sequences, and outputs the fitnesses of each amino acid at each location in the protein-coding gene. SYNOPSIS Analyse Data Using the SwMutSel Model: java -jar swmutsel.jar -name <run_name> -sequences <sequence_file_name> -tree <tree_file_name | tree_newick_string> -geneticcode <standard | vertebrate_mit | plastid> [-penalty mvn,<sigma> | dirichlet,<alpha>] [-kappa <kappa>] [-pi <T>,<C>,<A>,<G>] [-scaling <branch_scaling_factor>] [-fitness <site>,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V [-fitness ...], ...] [-fix mutation|branches|all [-fix mutation|branches|all], ...] [-threads <cpu_cores>] [-distributed -host <host>:<port> [-host <host>:<port>], ...] [-sites <site>|<site_range>] [-restart-opt <no_of_restarts> [-restart-int <n_iterations>]] [-clademodel clade_label,clade_label[,clade_label[,...]]] [-hessian] [-help] Simulate Data Using the SwMutSel Model: java -jar swmutsel.jar -simulate -name <run_name> -tree <tree_file_name | tree_newick_string> -geneticcode <standard | vertebrate_mit | plastid> -sites <number_of_sites> -kappa <kappa> -pi <T>,<C>,<A>,<G> -scaling <branch_scaling_factor> [-fitness A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V [-fitness ...], ...] [-fitnessfile <filename> [-fitnessfile <filename>], ...] [-clademodel <clade_labels>] [-shiftfrac <percentage>] OPTIONS Required -n, -name Specifies name for the run. Output files are prefixed with this name. Be careful! The program will overwrite files with the same name. -s, -sequences Coding sequences alignment file name in PHYLIP format. Spaces are not allowed in sequence names. -t, -tree Newick-formatted tree file name. The tree string can be supplied instead e.g. "-tree (A:0.1,(B:0.1,C:0.1));". Spaces are not allowed in the string. -gc, -geneticcode The genetic code for the coding-sequences: -gc standard : The Standard Code -gc vertebrate_mit : The Vertbrate Mitochondrial Code -gc plastid : The Bacterial, Archaeal and Plant Plastid Code Model Parameters -p, -penalty The penalty to use for the penalised likelihood method. If not supplied, the usual (unpenalised) maximum likelihood method is used. Valid options are: -p mvn,<s> : Multivariate normal penalty with variance 2*<s>^2 -p dirichlet,<a> : Dirichlet-based penalty with shape <a> -k, -kappa The starting parameter value for the transition-transversion rate ratio. If you "-fix mutation" the parameter will not be estimated. DEFAULT: 1.0 -pi The starting parameter value for nucleotide base frequencies. Must be comma-separated with order T,C,A,G. The values are normalised to sum to 1. If you "-fix mutation" the parameter will not be estimated. DEFAULT: [0.25] -c, -scaling The starting parameter value for branch scaling factor (applied to all branches). If you "-fix mutation" the parameter will not be estimated. DEFAULT: 1.0 -f, -fitness Comma-separated fitness parameters in canonical amino acid order. It is recommended that you do not construct these by hand but rather use the output generated by the program itself. DEFAULT: [0] Optimisation -fix Indicate whether you want the program to skip estimation of mutational parameters, branch lengths or all parameters. For example, if you want to calculate fitness only: "-fix mutation -fix branches" -fix mutation : Fix the values (k, pi, c) of the mutational model. -fix branches : Fix the branch lengths on the tree. -fix all : Calculate the log-likelihood only. -restart-opt Specifies the number of optimiser restarts for site-wise fitness parameter estimation. The is to prevent estimates being stuck at a local optima. The program will restart fitness estimation, with random initial values, the specified number of times. DEFAULT: 1 -restart-int Specify how often to estimate fitness parameters with multiple restarts. Restarting the fitness parameter estimation is expensive and, in many cases, not necessary. The value supplied here defines how frequently to perform the robust fitness estimation, where a single round is one iteration of mutation, branch length and fitness estimation. DEFAULT: 5 -sites Specify a single site, or a range of sites, for site-wise fitness estimation. If you provide this option, you implicitly fix the mutation and branch length parameters, "-fix mutation,branches". A range is specified using a dash e.g. "-sites 10-20" will estimate the site-wise fitnesses for all sites between site 10 and site 20, inclusive. Parallelisation -T, -threads Specify the number of cores to use for multi-threaded operation. -D, -distributed Indicate the program will run in distributed mode. This requires the initialisation of (usually) multiple slaves. Each slave will have an associated IP address (or hostname) and port (which you supply using "-H") -H, -host If the program is running in distributed mode (using the "-D" option), supply slaves' host IP and port using "-H <slave_ip>:<port>" USAGE Simplest example: java -jar swmutsel.jar -n test -s aln.phy -t aln.tree -gc standard Options can also be placed in file but each argument should be on a new line. For example: cat > test_options.txt -s aln.phy -t aln.tree -gc standard ^D Note each space in a typical command-line argument becomes a newline. You can now run the program using: java -jar swmutsel.jar @test_options.txt -n test During parameter optimisation, the program will write checkpoint files. You can restart the program from a saved checkpoint, for example: java -jar swmutsel.jar @test_CHKPNT_9.txt -n test_restart CITATION Halpern AL and Bruno WJ. (1998) Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Molecular Biology and Evolution, 15: 910-917. Tamuri AU, dos Reis M and Goldstein R. (2012) Estimating the distribution of selection coefficients from phylogenetic data using sitewise mutation- selection models. Genetics, 190: 1101-1115. Tamuri AU, Goldman N and dos Reis M. (2014) A penalized likelihood method for estimating the distribution of selection coefficients from phylogenetic data. Genetics, 197: 257-271.