Duplicate (but swapped) right and left
mustafa0x opened this issue · 4 comments
Data
id,country,name,url
1,AD,University of Andorra,http://www.uda.ad/
2,AE,Abu Dhabi University,http://www.adu.ac.ae/
3,AE,Ajman University of Science & Technology,http://www.ajman.ac.ae/
4,AE,Alain University of Science and Technology,http://www.alainuniversity.ac.ae/
5,AE,Al Ghurair University,http://www.agu.ae/
Code
import pandas as pd
import numpy as np
from string_grouper import match_strings, match_most_similar, group_similar_strings, StringGrouper
data = pd.read_csv('world-universities.txt')
matches = match_strings(data['name'])
matches[matches.left_side != matches.right_side].head()
Output
left_side right_side similarity
11 American University of Sharjah University of Sharjah 0.800178
36 University of Sharjah American University of Sharjah 0.800178
43 Aria Institute of Higher Education Rana Institute of Higher Education 0.844736
76 Rana Institute of Higher Education Aria Institute of Higher Education 0.844736
85 Academy of Arts National Academy of Arts 0.800615
Issue
1 and 2 are the same, but swapped, and same for 3 and 4. How to avoid these duplicates?
Also: retaining ID
The records have an ID column, how to retain that in the final output?
Thank you!
Hi @mustafa0x ,
For the first issue - this is done by design. If you want to deduplicate all the matches you probably want to look at the group_similar_strings function.
For the second question, this is currently not build in. However you could use the code in the following comment to get the same additional columns:
Thank you for the reply!
Similar strings will be manually reviewed to verify that they are indeed duplicates, so group_similar_strings
doesn't seem to be the right solution.
Thanks!
Did you have a look at the StringGrouper class:
https://github.com/Bergvca/string_grouper#the-stringgrouper-class
This describes a workflow that might be of use for you.
Thank you.