Taxon- and phylogenetic-based beta diversity measures.
EBD is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
EBD is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with EBD. If not, see http://www.gnu.org/licenses/.
EBD is a command-line program written in C++. To install EBD, download and uncompress it with the unzip command:
unzip 1_0_8.zip
To compile EBD on OSX or Linux simply type 'make' from within the source directory of EBD. The resulting executable will be in the bin directory. A precompiled executables for Windows is provided in the bin directory. Please note that under Windows, EBD must be run from the command-line (i.e., the DOS prompt).
Usage: ExpressBetaDiversity [OPTIONS]
Calculates taxon- and phylogenetic-based beta diversity measures.
Options:
-h, --help Produce help message.
-l, --list-calc List all supported calculators.
-u, --unit-tests Execute unit tests.
-t, --tree-file Tree in Newick format (if phylogenetic beta-diversity is desired).
-s, --seq-count-file Sequence count file.
-p, --output-prefix Output prefix (default = output).
-g, --clustering Hierarchical clustering method: UPGMA, SingleLinkage, CompleteLinkage, NJ (default = UPGMA).
-j, --jackknife Number of jackknife replicates to perform (default = 0).
-d, --seqs-to-draw Number of sequence to draw for jackknife replicates.
-z, --sample-size Print number of sequences in each sample.
-c, --calculator Desired calculator (e.g., Bray-Curtis, Canberra).
-w, --weighted Indicated if sequence abundance data should be used.
-m, --mrca Apply 'MRCA weightings' to each branch (experimental).
-r, --strict-mrca Restrict calculator to MRCA subtree.
-y, --count Use count data as opposed to relative proportions.
-x, --max-data-vecs Maximum number of profiles (data vectors) to have in memory at once (default = 1000).
-a, --all Apply all calculators and cluster calculators at the specified threshold.
-b, --threshold Correlation threshold for clustering calculators (default = 0.8).
-o, --output-file Output file for cluster of calculators (default = clusters.txt).
-v, --verbose Provide additional information on program execution.
Example of applying a specific calculator:
./ExpressBetaDiversity -t input.tre -s seq.txt -p bray_curtis -c Bray-Curtis -w
which will result in two output files, the raw dissimilarity matrix in bray_curtis.diss and a UPGMA hierarchical cluster tree in bray_curtis.tre.
Example of querying number of sequences in each sample:
./ExpressBetaDiversity -s seq.txt -z
which will result in the number of sequences in each sample being written to standard out.
Example of applying a specific calculator with jackknife replicates:
./ExpressBetaDiversity -t input.tre -s seq.txt -p bray_curtis -c Bray-Curtis -w -j 100 -d 500
which will result in two output files, the raw dissimilarity matrix in bray_curtis.diss and a UPGMA hierarchical cluster tree in bray_curtis.tre with jackknife support values.
Example of applying all calculators and clustering these based on their Pearson correlation:
./ExpressBetaDiversity -t input.tre -s seq.txt -a -b 0.9 -o clusters.txt
which will result in the output file clusters.txt (see file format below).
A set of unit tests is included to verify proper installation of the EBD software. The unit tests can be run with:
./ExpressBetaDiversity -u
The software should not be used if any of the unit tests fail.
EBD uses Newick formatted trees as input. Information on this tree format can be found at: http://evolution.genetics.washington.edu/phylip/newicktree.html. Here is a simple Newick tree with three leaf nodes labelled A, B, and C:
(A:1,(B:1,C:1):1);
Taxon-based beta-diversity is calculated if an input tree is not specified.
Sequence count information must be specified as a tab-delimited table where each row is a sample and each column is the name of a leaf node in the provided tree. Data must be provided for all leaf nodes in the tree. Consider the following example:
A B C
Sample1 1 2 3
Sample2 10 1 0
Sample3 0 0 1
The first row begins indicates each leaf node in the tree seperated by a tab. Please note that this line MUST start with a tab. The number of sequences associated with each leaf node is then indicated for each sample on a seperate row. In this example, the first sample is labelled 'Sample1' and contains 1 instance of sequence/OTU A, 2 instances of B, and 3 instances of C. Sample3 contains only instances of C, but note that zeros must be specified for the other sequence/OTU types.
Example input files are avaliable in the unit-tests directory.
The script convertToEBD.py in the scripts directory can be used to convert sparse or dense UniFrac-style OTU tables into the format required by EBD. The UniFrac format is used by many popular services including the UniFrac web services and QIIME. EBD uses a different input file format in order to efficently handle data sets consisting of thousands of samples. The script can be run as follows:
./convertToEBD.py <input file> <ouput file>
For reference, sparse UniFrac-style OTU tables look like this (3 columns tab delimited: sequence, sample, count).
leaf2 sample1 1
leaf3 sample1 1
leaf3 sample2 2
leaf2 sample2 1
leaf4 sample2 1
The resulting dissimilarity between samples is written as a tab-delimited, lower-triangular dissimilarity matrix with the first line indicating the number of samples. Consider the following output:
3
A
B 1
C 2 3
The first line indicates that there are 3 samples. The dissimilarity between samples A and B is 1, A and C is 2, and B and C is 3.
An EBD dissimilarity matrix can be converted to a full dissimilarity matrix using the convertToFullMatrix.py script in the scripts directory.
The clustering file indicates clusters of calculators which are correlated. The clustering threshold is specified by the user with the --threshold (-b) parameter. All calculators in a cluster will be at least as correlated as the specified threshold. Results are reported as follows:
Minimum r Calculators
[0.0] uChi-squared;
[0.86] Canberra;CS;uCanberra;uCS;uGower;uManhattan;
[0.91] uBray-Curtis;uSoergel;uKulczynski;
[0.81] Bray-Curtis;Kulczynski;Soergel;
...
Complete linkage cluster tree (branch lengths are 1 - Pearson's correlation):
((('Bray-Curtis':5.60596e-006,'Kulczynski':5.60596e-006):4.13975e-005 ...
The first line indicates the column headers. Each subsequent line indicates a cluster of calculators. The number within the brackets indicates that minimum Pearson's correlation between any pair of calculators in the cluster. A semicolon seperated list indicates which calculators are in the cluster. The last line of the file gives the complete linkage tree used to cluster measures. This can be copied into a seperate file and visualized in any program which can read a Newick tree file.
The dissimilarity matrix for calculator X is saved to the file 'X.cluster.diss' within the same directory as the EBD executable.
If you use EBD in your research, please cite:
Parks, D.H. and Beiko, R.G. 2013. Measures of phylogenetic differentiation provide robust and complementary insights into microbial communities. ISME J, 7:173-83.
Donovan Parks donovan.parks@gmail.com
Robert Beiko beiko@cs.dal.ca