About Pearson similarity
KunlinY opened this issue · 7 comments
Hi! When I looked into the implementation of Pearson similarity in Similarity.h, I am confused by the following code:
https://github.com/gasevi/pyreclab/blob/master/algorithms/Similarity.h#L92
`
for( ind = v1.begin() ; ind != end ; ++ind )
{
double rv1 = *ind;
double rv2 = v2.get( ind.index() );
if( rv2 > 0 )
{
devp += ( rv1 - mean1 ) * ( rv2 - mean2 );
dev1 += ( rv1 - mean1 ) * ( rv1 - mean1 ); // Why put this line here??
dev2 += ( rv2 - mean2 ) * ( rv2 - mean2 );
}
}
`
If the index of rv1 is not in rv2, then the dev1 will not be increased. Is there something wrong?
Hi @KunlinY ,
first of all, the variables devp, dev1 and dev2 can be directly interpreted from the Pearson similarity definition. devp is the numerator and sqrt( dev1 ) * sqrt( dev2 ) is the denominator. I am not sure why dev1 seems to be wrong for you.
In relation to your final question, to calculate the Pearson similarity we only have to consider the terms that are not zero, i.e., the dimensions that have been rated by both users. Take a look to the paper "Collaborative Filtering Recommender Systems", J. Ben Schafer et al. on page 12 ( 302 ). It say: "The Pearson correlation coefficient is calculated by comparing ratings for all items rated by both the target user and the neighbor".
So you can ask me, why rv2 and not rv1 ? ... just a design criteria.
Can I ask you why you need so much information ?
Best,
GSV.
Thank for your quick reply! I got your meaning @gasevi. However, I would like to know why dev1 increases only when rv2 is not zero, i.e., when user2 rate item ind. I think the calculation of dev1 should be independent of the value of rv2 and the increase of dev1 might need to be moved out of the IF condition.
In my case, the code should be changed to:
for( ind = v1.begin() ; ind != end ; ++ind )
{
double rv1 = *ind;
double rv2 = v2.get( ind.index() );
dev1 += ( rv1 - mean1 ) * ( rv1 - mean1 ); // moved from IF block
if( rv2 > 0 )
{
devp += ( rv1 - mean1 ) * ( rv2 - mean2 );
// dev1 += ( rv1 - mean1 ) * ( rv1 - mean1 ); // Why put this line here??
dev2 += ( rv2 - mean2 ) * ( rv2 - mean2 );
}
}
Hi @KunlinY ,
I think you are not considering that the iterator of v1 only gets elements different to zero, so rv1 are always going to take a valid rating value. does it make sense for you now ?.
Best,
GSV.
True. But when rv2 is 0 but rv1 not, dev1 misses the increment. I know here omits zero value for efficiency and that is totally fine. However, in Pearson alg, all the value in v1 should be summed up, no matter whether rv2 is 0 or not.
"But when rv2 is 0 but rv1 not, dev1 misses the increment" ... Right !, because you only calculate the value of deviations when BOTH values ( rv1 AND rv2 ) are different to zero.
"The Pearson correlation coefficient is calculated by comparing ratings for all items rated by BOTH the target user and the neighbor"
Turns out that I misunderstand the Pearson similarity:(
Appreciate your detailed answer!!!
This is a special case because zeros are not really zeros as a value, but they are a "not information" value, so you can't use them to calculate anything. That's why you need both values are different to zero at the same time in order the result make sense.
I think this is not related to the Pearson definition, but it is related to how to apply it to sparse vectors.