very different weighted unifrac values for qiime2 versus phyloseq

Question

very different weighted unifrac values for qiime2 versus phyloseq

nick-youngblut opened this issue 6 years ago · 31 comments

A couple of members of my lab and I have been getting very different results when using qiime2 versus phyloseq for calculating weighted unifrac values. It seems that the default parameters used by these tools (qiime diversity core-metrics-phylogenetic versus phyloseq::distance) generate very different weighted unifrac values, even when pre-rarefying the dataset (since qiime2 automatically rarefies).

I've already posted this issue with a reproducible example with the GlobalPatterns dataset on the qiime2 forum.

Any idea why there's such a big difference between the qiime2 and phyloseq defaults for calculating weighted unifrac values? Moreover, which method is generating "correct" weighted unifrac values? At least from analyzing my own microbiome dataset, the conclusions are completely different depending on if I use the qiime2- versus phyloseq-generated weighted unifrac values.

Answer 1 · 2018-06-18T20:39:30.000Z

Rarefaction in Phyloseq vs rarefaction in qiime 2?

Answer 2 · 2018-06-19T06:31:40.000Z

The OTU table was already rarefied before beta-diversity calculation with either phyloseq or qiime2. Qiime2 rarefies without replacement, so using a rarefying sampling depth set to the same value as the sample sums for the pre-rarefied OTU table is equivalent to not rarefying (at least, it should be).

Answer 3 · 2018-07-20T17:45:47.000Z

Hi,

I would guess that you have been using an un-rooted phylogenetic tree when calculating the Unifrac distances. To calculate the UniFrac distances you need a rooted tree, and I'm 95% sure that Qiime2 and phyloseq are using different ways to root the phylogenetic tree.

Best,
Martin

Answer 4 · 2018-07-20T17:59:39.000Z

I believe that all of the trees that I've been using for testing are with rooted phylogenies. I've posted a Jupyter notebook on the qiime2 forum if you'd like to see an example with the GlobalPatterns dataset.

Answer 5 · 2018-08-06T06:52:31.000Z

Dear Nick

Did you hear anything from the Phyloseq guys or the Qiime2 team?
We are having a similar issue.

Answer 6 · 2018-08-06T07:10:09.000Z

I think that Greg Caporaso was going to look into it, but I haven't heard anything about it. I just ended up using qiime2 instead of phyloseq, because it generated more realistic results for my dataset.

Answer 7 · 2019-05-04T20:08:16.000Z

I have exactly the same issue. Moreover, there are two different ways in phyloseq to get wunifrac distance matrix, and both give different results (UniFrac(X, weighted=TRUE, normalized=TRUE) vs. distance(X, method="wunifrac")).

Answer 8 · 2019-05-04T20:16:54.000Z

@siberianhigh how do the two phyloseq methods compare to the weighted Unifrac that QIIME2 is producing?

Answer 9 · 2019-05-04T20:23:13.000Z

@cjfields they both are different from qiime (and much more similar to each other than to qiime). I still prefer QIIME1, but in my experience there is no difference with QIIME2.

Answer 10 · 2019-05-04T20:29:08.000Z

Moreover, "Depending on the method argument, distance() wraps one of UniFrac, DPCoA, JSD, vegdist, betadiver, designdist, or dist." So, I don't understand the origin of these differences..

Answer 11 · 2019-09-19T13:16:40.000Z

My lab has been dealing with this as well. I tried to make a reproducible test based off skbio's python Unifrac (which is what QIIME runs under the hood, I think), and I actually can't get it to run in phyloseq.

skbio documentation: http://scikit-bio.org/docs/0.4.2/generated/generated/skbio.diversity.beta.unweighted_unifrac.html

Code adapted to phyloseq

# Reproducible example taken from skbio's Unifract function
#   http://scikit-bio.org/docs/0.4.2/generated/generated/skbio.diversity.beta.unweighted_unifrac.html

library(phyloseq)
library(ape)

# Make OTU table
u_counts = c(1, 0, 0, 4, 1, 2, 3, 0)
v_counts = c(0, 1, 1, 6, 0, 1, 0, 0)
table=data.frame(u=u_counts, v=v_counts, row.names=paste("OTU", 1:8, sep=""))
tree = read.tree(text="(((((OTU1:0.5,OTU2:0.5):0.5,OTU3:1.0):1.0):0.0,(OTU4:0.75,(OTU5:0.5,((OTU6:0.33,OTU7:0.62):0.5,OTU8:0.5):0.5):0.5):1.25):0.0)root;")
mydata = phyloseq(otu_table(table, taxa_are_rows=TRUE), phy_tree(tree))

# Unifrac; unweighted should be 0.37
UniFrac(mydata, weighted=FALSE)

The result of this is Error in node.desc[i - ntip, ] : subscript out of bounds . This appears to be from line 616 in distance-methods.R (starting at 614 for context):

	# Loop over internal nodes, summing their descendants to get that nodes count
	for(i in ord.node){
	  edge_array[i,] <- colSums(edge_array[node.desc[i-ntip,], , drop=FALSE], na.rm = TRUE)
	}

I don't know the guts of how UniFrac is calculated, but it seems that the code is making an assumption that isn't always satisfied. (That there are fewer tips than internal nodes?)

Given that I also ran into an independent error (#1215) using data phyloseq had filtered, I think it's fair to say that the phyloseq UniFrac implementation still has some bugs in it. Unfortunately, I don't know phylogenetic trees and UniFrac well enough to suggest fixes or I'd try to do so. The QIIME developers claims to benchmark their UniFrac against the original PyCogent implementation every build, so it's probably the safer bet to use QIIME for UniFrac until this is resolved. (If need be, you can import the resulting distance matrix into R for downstream analysis.)

Answer 12 · 2019-09-19T13:43:42.000Z

@wallacelab have you tried other UniFrac implementations within R, such as unifrac from the picante package? I'm wondering whether the simplest short-term fix would be to switch to this if the results are comparable to QIIME2's (and if QIIME2's implementation has been tested).

Answer 13 · 2019-10-08T14:14:25.000Z

@cjfields I finally got time to test this, comparing unifrac from phyloseq, rbiom, picante, and GUniFrac

# ##########
# Install needed libraries
# ##########

install.packages(c('devtools', 'picante', 'phyloseq', 'rbiom'))
devtools::install_github("cmmr/rbiom")


# ############
# Benchmark dataset (very simple)
# ############

# Make OTU table
u_counts = c(1, 0, 0, 4, 1, 2, 3, 0)
v_counts = c(0, 1, 1, 6, 0, 1, 0, 0)
counts=as.matrix(data.frame(u=u_counts, v=v_counts, row.names=paste("OTU", 1:8, sep="")))
tree = read.tree(text="(((((OTU1:0.5,OTU2:0.5):0.5,OTU3:1.0):1.0):0.0,(OTU4:0.75,(OTU5:0.5,((OTU6:0.33,OTU7:0.62):0.5,OTU8:0.5):0.5):0.5):1.25):0.0)root;")


# ########
# Generate UniFrac distances of everything
# ########

# Phyloseq
library(phyloseq)
phyloseq_data = phyloseq(otu_table(counts, taxa_are_rows=TRUE), tree)
phyloseq_weighted = phyloseq::UniFrac(phyloseq_data, weighted=TRUE)
phyloseq_unweighted = phyloseq::UniFrac(phyloseq_data, weighted=FALSE)

# rbiom
rbiom_weighted = rbiom::unifrac(counts, weighted=TRUE, tree=tree)
rbiom_unweighted = rbiom::unifrac(counts, weighted=FALSE, tree=tree)

# GUniFrac
gunifracs = GUniFrac::GUniFrac(t(counts), tree=tree, alpha=1)$unifracs
guni_weighted = gunifracs[, , "d_1"]
guni_unweighted = gunifracs[, , "d_UW"]

# picante (unweighted unifrac only, and definitely the slowest)
picante_unweighted = picante::unifrac(t(counts), tree)

# Weighted values
rbiom_weighted
guni_weighted

# Unweighted values
rbiom_unweighted
guni_unweigthed
picante_unweighted

Short version is that phyloseq still fails on this, but the other three give identical unweighted unifrac distances (0.3692308). The weighted value is different from rbiom (1.543434) to GUniFrac (0.3275034), which I suspect is due to the latter normalizing it, but I haven't had a chance to confirm.

Still working to check with QIIME, but I think your idea of just using a different library is probably a good one. (Note: picante only does unweighted, I think, and in my tests was significantly slower than the others, so I'd recommend rbiom or GUniFrac. )

Answer 14 · 2019-10-09T19:32:01.000Z

Okay, I did some more digging to compare to QIIME.

First, bash script to get sample data and run UniFrac (data is QIIME's own Moving Pictures tutorial):

# Download data
wget https://docs.qiime2.org/2018.11/data/tutorials/moving-pictures/table-deblur.qza # OTU Table
wget https://docs.qiime2.org/2018.11/data/tutorials/moving-pictures/rooted-tree.qza # Tree

# Variables
orig_table=table-deblur.qza
filtered_table=table-filtered.qza
tree=rooted-tree.qza

# Remove OTUs not present in phylogeny
qiime phylogeny filter-table \
    --i-table $orig_table \
    --i-tree $tree \
    --o-filtered-table $filtered_table

# Calculate both weighted and unweighted unifrac
for metric in weighted_unifrac unweighted_unifrac; do

    qiime diversity beta-phylogenetic \
        --i-table $filtered_table \
        --i-phylogeny $tree \
        --p-metric $metric \
        --o-distance-matrix $metric.qza \
        --verbose

    # Export to text
    qiime tools export \
        --input-path $metric.qza \
        --output-path $metric

    # Move distance matrix file to more sensible location
    mv $metric/distance-matrix.tsv $metric.tsv
    rmdir $metric
done


# Export table and tree for comparison with R functions
qiime tools export --input-path $filtered_table --output-path ./
qiime tools export --input-path $tree --output-path ./
 
# Convert biom table to text (less likely to have import issues)
biom convert -i feature-table.biom -o feature-table.biom.txt --to-tsv

Now, R script to compare UniFrac measures:


# Install needed packages
install.packages(c('devtools', 'picante', 'phyloseq', 'rbiom', 'ape'))
devtools::install_github("cmmr/rbiom")

# Load QIIME output files
counts=as.matrix(read.delim("feature-table.biom.txt", header=T, row.names=1, skip=1))
tree=ape::read.tree("tree.nwk")
ape::is.rooted(tree) # Confirm tree is already rooted


# ########
# Generate UniFrac distances of everything
# ########

# QIIME
qiime_weighted = as.dist(read.delim("weighted_unifrac.tsv", row.names=1))
qiime_unweighted = as.dist(read.delim("unweighted_unifrac.tsv", row.names=1))

# Phyloseq
library(phyloseq)
phyloseq_data = phyloseq(otu_table(counts, taxa_are_rows=TRUE), tree)
phyloseq_weighted = phyloseq::UniFrac(phyloseq_data, weighted=TRUE)
phyloseq_unweighted = phyloseq::UniFrac(phyloseq_data, weighted=FALSE)

# rbiom
rbiom_weighted = rbiom::unifrac(counts, weighted=TRUE, tree=tree)
rbiom_unweighted = rbiom::unifrac(counts, weighted=FALSE, tree=tree)

# GUniFrac
gunifracs = GUniFrac::GUniFrac(t(counts), tree=tree, alpha=1)$unifracs
guni_weighted = gunifracs[, , "d_1"]
guni_unweighted = gunifracs[, , "d_UW"]

# picante (unweighted unifrac only, and definitely the slowest)
picante_unweighted = picante::unifrac(t(counts), tree)


# ###########
# Compare methods
# ###########

compare_dists = function(mydists){
    comparison = matrix(NA, nrow=length(mydists), ncol=length(mydists), dimnames=list(names(mydists), names(mydists)))
    for(i in 1:length(mydists)){
        for(j in i:length(mydists)){
            comparison[i,j] = cor(mydists[[i]], mydists[[j]])
        }
    }
    return(comparison)
}

weighted = list(qiime=qiime_weighted, phyloseq = phyloseq_weighted, rbiom = rbiom_weighted, gunifrac = as.dist(guni_weighted))
unweighted = list(qiime=qiime_unweighted, phyloseq = phyloseq_unweighted, rbiom = rbiom_unweighted, gunifrac = as.dist(guni_unweighted), picante=picante_unweighted)
compare_dists(weighted)
compare_dists(unweighted)

Of note, phyloseq gives gives the following warning:

Warning message:
In matrix(tree$edge[order(tree$edge[, 1]), ][, 2], byrow = TRUE,  :
  data length [895] is not a sub-multiple or multiple of the number of rows [448]

The final matrix of comparisons (= correlations among the distance matrices) is:

> compare_dists(weighted)
         qiime  phyloseq     rbiom  gunifrac
qiime        1 0.3009552 1.0000000 0.9757110
phyloseq    NA 1.0000000 0.3009552 0.4489365
rbiom       NA        NA 1.0000000 0.9757110
gunifrac    NA        NA        NA 1.0000000

> compare_dists(unweighted)
         qiime  phyloseq     rbiom  gunifrac   picante
qiime        1 0.9791214 0.9998755 0.9998755 1.0000000
phyloseq    NA 1.0000000 0.9799686 0.9799686 0.9791214
rbiom       NA        NA 1.0000000 1.0000000 0.9998755
gunifrac    NA        NA        NA 1.0000000 0.9998755
picante     NA        NA        NA        NA 1.0000000

So in short, rbiom has a perfect correlation with QIIME for weighted unifrac and an almost perfect one for unweighted. (Not sure why the difference.) picante has a perfect correlation for unweighted.

So it seems if you have to pick one, pick rbiom, and if you can pick two, rbiom for weighted and picante for unweighted. (Of course, this is all aside from the fact that 4-5 different packages provide nearly as many different answers for what is supposed to be a deterministic calculation, but that's another issue.)

Answer 15 · 2019-10-10T14:10:51.000Z

Update: I realized I hadn't rarefied the data before doing UniFrac. Rarefaction makes the Unweighted Unifac equal across everything (except phyloseq), but weighted still only matches for rbiom:

         qiime  phyloseq     rbiom  gunifrac
qiime        1 0.5098069 1.0000000 0.9906098
phyloseq    NA 1.0000000 0.5098069 0.5729023
rbiom       NA        NA 1.0000000 0.9906098
gunifrac    NA        NA        NA 1.0000000

> compare_dists(unweighted)
         qiime phyloseq    rbiom gunifrac  picante
qiime        1 0.975154 1.000000 1.000000 1.000000
phyloseq    NA 1.000000 0.975154 0.975154 0.975154
rbiom       NA       NA 1.000000 1.000000 1.000000
gunifrac    NA       NA       NA 1.000000 1.000000
picante     NA       NA       NA       NA 1.000000

So at least as far as matching QIIME goes, rbiom seems the most reliable. (Incidentally, I also tried this with the benchmark data skbio uses for unit testing, and in that case everything but phyloseq matches 100% for both metrics. Looks like there's some complication in real data that the benchmark isn't capturing.)

Answer 16 · 2019-10-10T15:52:58.000Z

@wallacelab I'm able to reproduce this, great walkthrough! I suspect the fast UniFrac method implemented in phyloseq has issues with weighting.

It does look like UniFrac distance is supposedly tested?

https://github.com/joey711/phyloseq/blob/5c532790722d839ed2a6e991d67fa6c96f8c9745/tests/testthat/test-distance.R

Answer 17 · 2019-10-10T20:26:02.000Z

Yes, and that testing seems to work. (I plugged the data into my comparison and it came out fine; code at end):

> compare_dists(weighted)
           gp500_wufu gp500_wuf  phyloseq     rbiom  gunifrac
gp500_wufu          1 0.9765762 0.9765762 1.0000000 0.9765762
gp500_wuf          NA 1.0000000 1.0000000 0.9765762 1.0000000
phyloseq           NA        NA 1.0000000 0.9765762 1.0000000
rbiom              NA        NA        NA 1.0000000 0.9765762
gunifrac           NA        NA        NA        NA 1.0000000

> compare_dists(unweighted)
          gp500_uuf phyloseq rbiom gunifrac picante
gp500_uuf         1        1     1        1       1
phyloseq         NA        1     1        1       1
rbiom            NA       NA     1        1       1
gunifrac         NA       NA    NA        1       1
picante          NA       NA    NA       NA       1

It looks like some programs have different defaults for normalization, but otherwise everything matches. So it looks like the test data lacks whatever feature is causing issues for phyloseq. (I suspect it's with the phylogenetic tree having an odd number of total edges, given the error it spits out.) Notably, the Global Patterns tree used in the unit test has 998 edges (even) and works, but the Moving Pictures one has 1515 (odd) and fails.

Code for above comparison:

# # Install needed packages
# install.packages(c('devtools', 'picante', 'phyloseq', 'rbiom', 'ape'))
# devtools::install_github("cmmr/rbiom")

library(phyloseq)

# Sample Global Patterns subset data from the phyloseq unit tests
treeFile = system.file("extdata", "GP_tree_rand_short.newick.gz", package="phyloseq")
GP500File = system.file("extdata", "GP_otu_table_rand_short.txt.gz", package = "phyloseq")
GP500 = import_qiime(GP500File, treefilename = treeFile)

# Now import the results with read.table()
gp500_uuf = read.csv(system.file("extdata", "gp500-uuf.csv", package = "phyloseq"), header = FALSE, fill = TRUE)
gp500_wuf = read.csv(system.file("extdata", "gp500-wuf.csv", package = "phyloseq"), header = FALSE, fill = TRUE)
gp500_wufu = read.csv(system.file("extdata", "gp500-wufu.csv", package = "phyloseq"), header = FALSE, fill = TRUE)
# Add the sample names
colnames(gp500_uuf) <- rownames(gp500_uuf) <- colnames(gp500_wuf) <- rownames(gp500_wuf) <- colnames(gp500_wufu) <- rownames(gp500_wufu) <- sample_names(GP500)
# Coerce to Distance Matrices for comparison `"dist"` class
gp500_wufu <- as.dist(gp500_wufu)
gp500_wuf <- as.dist(gp500_wuf)
gp500_uuf <- as.dist(gp500_uuf)


# ########
# Generate UniFrac distances of everything
# ########

counts = otu_table(GP500)
tree = phy_tree(GP500)


# Phyloseq
phyloseq_data = phyloseq(otu_table(counts, taxa_are_rows=TRUE), tree)
phyloseq_weighted = phyloseq::UniFrac(phyloseq_data, weighted=TRUE)
phyloseq_unweighted = phyloseq::UniFrac(phyloseq_data, weighted=FALSE)

# rbiom
rbiom_weighted = rbiom::unifrac(counts, weighted=TRUE, tree=tree)
rbiom_unweighted = rbiom::unifrac(counts, weighted=FALSE, tree=tree)

# GUniFrac
gunifracs = GUniFrac::GUniFrac(t(counts), tree=tree, alpha=1)$unifracs
guni_weighted = gunifracs[, , "d_1"]
guni_unweighted = gunifracs[, , "d_UW"]

# picante (unweighted unifrac only, and definitely the slowest)
picante_unweighted = picante::unifrac(t(counts), tree)


# ###########
# Compare methods
# ###########

compare_dists = function(mydists){
    comparison = matrix(NA, nrow=length(mydists), ncol=length(mydists), dimnames=list(names(mydists), names(mydists)))
    for(i in 1:length(mydists)){
        for(j in i:length(mydists)){
            comparison[i,j] = cor(mydists[[i]], mydists[[j]])
        }
    }
    return(comparison)
}

weighted = list(gp500_wufu=gp500_wufu, gp500_wuf=gp500_wuf, phyloseq = phyloseq_weighted, rbiom = rbiom_weighted, gunifrac = as.dist(guni_weighted))
unweighted = list(gp500_uuf=gp500_uuf, phyloseq = phyloseq_unweighted, rbiom = rbiom_unweighted, gunifrac = as.dist(guni_unweighted), picante=picante_unweighted)
compare_dists(weighted)
compare_dists(unweighted)

Answer 18 · 2019-10-11T20:26:06.000Z

@wallacelab odd, this almost sounds like a fence-post error but only for weighted UniFrac. I see how it fell through the cracks; definitely worth adding a simple test case for catching this.

Answer 19 · 2019-11-09T22:13:08.000Z

Hi @wallacelab and @cjfields, if you want to compare phyloseq's weighted unifrac calculation to rbiom's, you should set normalized = FALSE. I.e.,

phyloseq_weighted = phyloseq::UniFrac(phyloseq_data, weighted=TRUE, normalized = FALSE)

Then running @wallacelab 's code gives near perfect similarity between phyloseq and rbiom's weighted calculations.

Rbiom is computing the quantity u in DOI.org/10.1128/AEM.01996-06, whereas I believe phyloseq with the normalized = TRUE option (the default) is returning u / D and with normalized = FALSE is returning u.

(Please let me know if I've missed something here)

Answer 20 · 2019-12-28T03:54:38.000Z

Ping. Interested in any updates here. Is this just a normalization difference?

Answer 21 · 2019-12-30T16:14:10.000Z

Ping. Interested in any updates here. Is this just a normalization difference?

No, it's not. Different normalization defaults are a minor part of it, but the bigger issue seems to be that Phyloseq is making assumptions about the data without checking if those assumptions are true.

The two outstanding problems I see are:

As mentioned in #936 by @PandengWang, phyloseq assumes a dichotomous tree, but doesn't seem to have anything to check if the tree actually is dichotomous or to correct it if it is. (His post includes quick code to use ape to coerce a non-dichotomous tree into one.) This seems to be the worse problem, because the code will run (with a warning) but generate highly incorrect results since the parent-child relationships get scrambled. This problem seems to be the one that created this Issue in the first place.

phyloseq still throws a mysterious error while trying to run very basic SKBio test data (reproduced below for convenience). (See my above post on 19 Sept 2019 for the line of code that seems to be causing it.)

# Reproducible example taken from skbio's Unifract function
#   http://scikit-bio.org/docs/0.4.2/generated/generated/skbio.diversity.beta.unweighted_unifrac.html

library(phyloseq)
library(ape)

# Make OTU table
u_counts = c(1, 0, 0, 4, 1, 2, 3, 0)
v_counts = c(0, 1, 1, 6, 0, 1, 0, 0)
table=data.frame(u=u_counts, v=v_counts, row.names=paste("OTU", 1:8, sep=""))
tree = read.tree(text="(((((OTU1:0.5,OTU2:0.5):0.5,OTU3:1.0):1.0):0.0,(OTU4:0.75,(OTU5:0.5,((OTU6:0.33,OTU7:0.62):0.5,OTU8:0.5):0.5):0.5):1.25):0.0)root;")
mydata = phyloseq(otu_table(table, taxa_are_rows=TRUE), phy_tree(tree))

# Unifrac; unweighted should be 0.37
UniFrac(mydata, weighted=FALSE)

> Error in node.desc[i - ntip, ] : subscript out of bounds

So no, not just normalization. (Though it would be nice to include a message to the user when UniFrac is run so they know what the default is.)

Answer 22 · 2020-03-09T19:11:16.000Z

@wallacelab do you have a comparison of the speeds for each implementation of unifrac that you list above? I'm processing a large OTU table right now with GUniFrac, and it's been running for ~72 hours, so I'm looking for a faster implementation.

Answer 23 · 2020-03-10T14:55:10.000Z

@nick-youngblut I didn't, but it's not too hard to test. (see https://www.alexejgossmann.com/benchmarking_r/ for benchmarking tutorial)

# # Install needed packages
# install.packages(c('devtools', 'picante', 'phyloseq', 'rbiom', 'ape'))
# devtools::install_github("cmmr/rbiom")
# devtools::install_github("eddelbuettel/rbenchmark")

library(phyloseq)

# Sample Global Patterns subset data from the phyloseq unit tests
treeFile = system.file("extdata", "GP_tree_rand_short.newick.gz", package="phyloseq")
GP500File = system.file("extdata", "GP_otu_table_rand_short.txt.gz", package = "phyloseq")
GP500 = import_qiime(GP500File, treefilename = treeFile)

# ########
# Benchmark different methods
# ########

counts = otu_table(GP500)
tree = phy_tree(GP500)
phyloseq_data = phyloseq(otu_table(counts, taxa_are_rows=TRUE), tree)

# benchmark
rbenchmark::benchmark("phyloseq"={phyloseq::UniFrac(phyloseq_data, weighted=TRUE)},
                      "rbiom" = {rbiom::unifrac(counts, weighted=TRUE, tree=tree)},
                      "gunifrac" = {GUniFrac::GUniFrac(t(counts), tree=tree, alpha=1)},
                      replications=10,
                      columns=c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))

Which results (on my computer) in

      test replications elapsed relative user.self sys.self
3 gunifrac           10  98.359  322.489    98.292    0.036
1 phyloseq           10   0.755    2.475     0.755    0.000
2    rbiom           10   0.305    1.000     0.105    0.080

So looks like rbiom is fastest (~2.5x faster than phyloseq and ~300x faster than gunifrac)

Answer 24 · 2020-03-10T15:01:22.000Z

I can back up those results. With an OTU table of approx. 6000 x 3500, GUniFrac didn't finish after 72 hours, while rbiom (with 4 threads) finished in < 1 hour. I didn't try phyloseq due to the questionable results that it can generate.

Answer 25 · 2021-04-01T15:46:21.000Z

Any updates on this?

Answer 26 · 2021-04-02T02:05:03.000Z

For phyloseq unifrac I wonder if this will solve any of the discrepancies?

I was having issues with edges in the tree and doing this resolved them. But, never validated answers against other programs calculating unifrac

tree = read.tree("FinalRFiles/SalAMPtree.nwk")

This fixes issue with tree$edge issue and unifrac calculations later

see #936

new_tre <- ape::multi2di(tree)

Which is mentioned above, but sounds like this may be a bad fix?

Answer 27 · 2021-07-01T17:21:36.000Z

I cannot add much to this thread but all I can say is I have a massive dataset of vertebrate gut microbiomes and the phyloseq unifrac ordination leads to a drastically different biological conclusion from the qiime2 ordination. I made to sure to rarefy in phyloseq as well. Based off this thread I will be sticking with the qiime2 output but I hope this gets fixed!

Answer 28 · 2022-02-02T15:14:18.000Z

all I can say is I have a massive dataset of vertebrate gut microbiomes and the phyloseq unifrac ordination leads to a drastically different biological conclusion from the qiime2 ordination

It's getting close to 4 years since my original post of this issue, and it seems that qiime (qiime2 now) is still the way to go for calculating (phylogenetic) beta diversity.

It might be best to include large warnings in the documentation that qiime2 and phyloseq can generate very different beta diversity values

Answer 29 · 2022-02-02T15:24:15.000Z

@nick-youngblut I agree re: warnings. Unfortunately (on the surface at least) it appears there has been very little additional maintenance or development on phyloseq, though hopefully that will change!

Answer 30 · 2022-02-04T17:18:32.000Z

Hi @nick-youngblut - could you comment if this is happening with unweighted unifrac also, or only weighted unifrac. I have noticed that weighted unifrac in phyloseq seems very different in biological conclusion from other measures such as bray-curtis.

Answer 31 · 2022-02-04T17:22:15.000Z

@CarlyMuletzWolz see the above posts on unweighted unifrac