Bergvca/string_grouper

Duplicate (but swapped) right and left

mustafa0x opened this issue · 4 comments

Data

world-universities.txt

id,country,name,url
1,AD,University of Andorra,http://www.uda.ad/
2,AE,Abu Dhabi University,http://www.adu.ac.ae/
3,AE,Ajman University of Science & Technology,http://www.ajman.ac.ae/
4,AE,Alain University of Science and Technology,http://www.alainuniversity.ac.ae/
5,AE,Al Ghurair University,http://www.agu.ae/

Code

import pandas as pd
import numpy as np
from string_grouper import match_strings, match_most_similar, group_similar_strings, StringGrouper

data = pd.read_csv('world-universities.txt')
matches = match_strings(data['name'])
matches[matches.left_side != matches.right_side].head()

Output

                             left_side                          right_side  similarity
11      American University of Sharjah               University of Sharjah    0.800178
36               University of Sharjah      American University of Sharjah    0.800178
43  Aria Institute of Higher Education  Rana Institute of Higher Education    0.844736
76  Rana Institute of Higher Education  Aria Institute of Higher Education    0.844736
85                     Academy of Arts            National Academy of Arts    0.800615

Issue

1 and 2 are the same, but swapped, and same for 3 and 4. How to avoid these duplicates?

Also: retaining ID

The records have an ID column, how to retain that in the final output?

Thank you!

Hi @mustafa0x ,

For the first issue - this is done by design. If you want to deduplicate all the matches you probably want to look at the group_similar_strings function.

For the second question, this is currently not build in. However you could use the code in the following comment to get the same additional columns:

#8 (comment)

Thank you for the reply!

Similar strings will be manually reviewed to verify that they are indeed duplicates, so group_similar_strings doesn't seem to be the right solution.

Thanks!

Did you have a look at the StringGrouper class:

https://github.com/Bergvca/string_grouper#the-stringgrouper-class

This describes a workflow that might be of use for you.

Thank you.