vegdist() in disagreement with designdist()
aloboa opened this issue · 2 comments
Given:
> x <- rbind(c(0,1,1,1), c(0,1,0,1))
> x
[,1] [,2] [,3] [,4]
[1,] 0 1 1 1
[2,] 0 1 0 1
and the derived 2x2 frequency table
> xf
[,1] [,2]
[1,] 2 1
[2,] 0 1
which corresponds to
a b
c d
I get:
> vegdist(xf, method="jaccard")
1
2 0.6666667
which is in agreement with the definition a/(a+b+c)
> 2/(2+1+0)
[1] 0.6666667
But do not get the same results using designdist()
> designdist(xf, method="a/(a+b+c)", abcd=TRUE)
1
2 0.5
> designdist(xf, method="(A+B-2*J)/(A+B-J)", abcd=FALSE)
1
2 0.5
Am I not understanding designdist() or is there a problem with that function?
The Jaccard distance is not defined as a / (a + b + c + d)
. It is is (b + c) / (a + b + c)
. You are not supposed to form the two-way table and then run that through vegdist()
. The correct way is to pass the actual data, in your case this is x
:
> vegdist(x, method = "jaccard", binary = TRUE)
1
2 0.3333333
which corresponds with (1 + 0) / (2 + 1 + 0) = (b + c) / (a + b + c)
:
> (1 + 0) / (2 + 1 + 0)
[1] 0.3333333
and which corresponds with designdist()
:
> designdist(x, method = "(A+B-2*J)/(A+B-J)")
1
2 0.3333333
I don't actually know what coefficient a/(a+b+c+d)
yields but it certainly isn't the Jaccard distance nor is it the simple matching coefficient, and my copy of Legendre & Legendre (where I have to look these things up) is at home just now so I can't ask it.
Thanks but note that, in the code, I am not actually using
a/(a+b+c+d)
but
a/(a+b+c)
(which is 2/(2+1+0))
It was a typo in the text, which I correct.
And you are totally right that I was confusing similarity with distance
( 1 - a/(a+b+c) = (b+c)/(a+b+c)
which actually clarifies the issue.