A repository to implement UniFrac, but on functional profiles of metagenomic data.
To install the package, run the following command:
conda create -n fununifrac -c bioconda -c conda-forge -c conda --file requirements.txt
conda activate fununifrac
The underlying graph structure on which the earth mover's distance is based on is obtained from the KEGG database. The KEGG database extraction repo is used to create the edge-list of this graph. In particular: get_ko_hierarchy.py
Currently, to keep things simple, I am dealing with just subtrees of the KEGG database.
As such, you need to pick a parent BRITE node to start with. Later, we will extract the tree
of all the children of this parent node. For example, you can choose the BRITE: ko00001 which
will correspond to "KEGG Orthology (KO)". In the following, call the resulting tree the "KO tree".
In the end, you should have a file like kegg_ko_edge_df_br:ko00001.txt
.
To infer edge lengths on this tree, you will need a matrix of all pairwise distances between the KEGG leaves of the KO tree. This can be accomplished via:
- Sketch all the KOs (note, this will take forever, so just use the pre-computed ones on the server:
#!/usr/bin/env bash
set -e
set -u
CSVINPUT=/data/shared_data/KEGG_data/PartitionByKO/data/KOs_for_sketch_fromfile.txt
OUTDIR=/data/shared_data/KEGG_data/sourmash_sketches/output_KOs/KOs_sketched
#for scaleFactor in 10 100 500 1000
for scaleFactor in 1
do
/usr/bin/time sourmash sketch fromfile -p protein,k=5,k=7,k=11,abund,scaled=${scaleFactor} -o ${OUTDIR}/KOs_sketched_scaled_${scaleFactor}.sig.zip $CSVINPUT
done
- Compute the pairwise distances between all the KOs
#!/usr/bin/env bash
set -e
set -u
DATADIR=/data/shared_data/KEGG_data/sourmash_sketches/output_KOs/KOs_sketched
for kSize in 5 7 11 15
do
sourmash compare --protein --no-dna -o ${DATADIR}/compare_k_${kSize} --ani -k ${kSize} -p 50 ${DATADIR}/KOs_sketched_scaled_10.sig.zip
done
At the end of this, you will have a pairwise distance matrix for all the KOs, along with a text file telling you which
KO corresponds to which row/column in the matrix. Eg. of such files are: KOs_sketched_scaled_10_compare_5
and KOs_sketched_scaled_10_compare_5.labels.txt
.
In order to infer the edge lengths, we need to know which edges are traversed between all pairs of KOs. For example, if we have a binary tree with 2 leaves:
1
L1 / \ L2
2 3
and edge lengths L1
and L2
respectively, then the distance between the two leaves is L1 + L2
, resulting in a
matrix equation of the form:
[1, 1] [L1;L2] = [d(a,b)]
where d(a,b)
is the distance between the two leaves a
and b
derived in the previous section.
The coefficient matrix of all edges traversed between all pairs of KOs is obtained by running:
./scripts/./graph_to_path_matrix.py -e <edge list: kegg_ko_edge_df_br:ko00001.txt> -d <distance matrix:
KOs_sketched_scaled_10_compare_5> -o <output directory> -b <BRITE: ko00001>
The resulting matrix will have a name such as: ko00001_KOs_sketched_scaled_10_compare_5_A.npz
and will be a
massive (233M rows, ~20K columns) sparse matrix called A
below.
We will employ a regularized, randomized nonnegative least squares approach to infer the edge lengths.
In practice, we've observed that the matrix A
is rank deficient, yet over-determined (more rows than columns).
Thus there is no unique solution to Ax=y
where y
is the vector of pairwise distances between all KOs. We can,
however, find a unique "shortest length" solution by adding a regularization term to the objective function:
min ||Ax - y||^2 + lambda * ||x||^2
s.t. x>=0
where lambda
is a regularization parameter. This can be cast as a NNLS problem via:
min ||A'x - y'||^2
s.t. x>=0
where
A' = [A; sqrt(lambda)*ones(1, n)]
y' = [y; 0]
We can then solve this problem via a randomized approach, selecting enough rows of A'
to ensure that the rank is
large enough. After a bunch of iterates, the average is then taken. An example of running the code is:
scripts/./create_edge_lengths.py -e kegg_ko_edge_df_br\:ko00001.txt -d KOs_sketched_scaled_10_compare_5 -A
ko00001_KOs_sketched_scaled_10_compare_5_A.npz -b ko00001 -n 100 -f 10 -r 1 --force -o kegg_ko_edge_df_br\:ko00001.txt
The output will be an edge list with an additional column specifying the length of that edge. Eg. kegg_ko_edge_df_br:ko00001.txt
To run the tests, you will need to install the pytest
package (via conda). Then, you can run the tests via:
cd test
pytest -v .