Bergvca/string_grouper

Function of simply calculating pairwise similarity without matching

Closed this issue · 3 comments

Hi,
Is there a function that can simply calculate pairwise (cosine) similarity (for already constructed pairs, no need to match)? If not, please add this function.
In some cases, I already have pairs (e.g. var1 and var2, in the same data file), and I want to calculate the pairwise similarity. For instance, I first use String Grouper to match two variables and get a file of pairs. Then I modify two string variables (such as further excluding prefixes and suffixes). I need to get the similarity of the two modified string variables, and compare two similarity scores.
Under the current function of String Grouper, I have to perform matching, and then merge matches to my pairs.
matches = match_strings(data["var1"], data["var2"], min_similarity=0.80)
This requires much more computation than what I actually need.

Hi @KiraJYQiu,

This requires much more computation than what I actually need.

You are right. In computational terms, what you want are only the diagonal elements of the product of two sparse matrices of the same dimensions/shape. See PR #40 for a possible solution.

I'm not sure exactly how you do it, but I feel I should warn you in advance (so ignore this paragraph if it is not relevant) — modifying the string variables, in general, changes the corpus used to vectorize the strings whose pair-similarities are to be computed. So the new similarity-scores of the modified strings would be based on a corpus different from the one on which the original string-similarities were based and therefore the old and new similarity-scores should not be directly compared. Instead, all the original strings should also be included together with the modified strings for the pairwise similarity-computation. I hope this helps.

Thank you for your reply
First, for your solution in #40, I just want to make sure that compute_pairwise_similarities uses the same options as other functions in string_grouper. By default it should also exclude "[,-./]|\s" before calculating similarity so that their algorithms are consistent.

Second, I know what I do. I generate new variables for modifying the original strings. For some databases that I need to link via fuzzy match, the entity names can be very noisy (e.g. FEC data’s employer name includes occupation/job titles). Excluding this noise by a few sentences of codes can’t be exactly accurate: it can exclude what should be excluded but also exclude what should be kept. Thus I may need two similarity scores (for nosier and cleaner strings) for comparison. Again thanks for such detailed concerns.

I just want to make sure that compute_pairwise_similarities uses the same options as other functions in string_grouper. By default it should also exclude "[,-./]|\s" before calculating similarity so that their algorithms are consistent.

Yes, it does, as you can confirm by inspecting the source code.

All the best as you cleanup your databases!