Indices not matching original data.
Closed this issue · 1 comments
@mnijhuis-dnb
Thank you so much for this library. I'm very new to Python and this is one of my first projects. Your library was extremely clear, useful and easy to understand and follow. It's great work so I just wanted to mention that first.
I've spent the better part of a month building a database/matching process. I'm attempting to match a list of names from a database table called company_directory against names in an Excel file (which are imported via a custom method). Everything seems to be working correctly, and the match names appear to be the right matches, however, the index is always off from the original data. I can't seem to find any consistency with why that's happening (ie it's not off by a certain number in every instance).
I'm not sure if this is a known error or something wrong I'm doing on my end, but this is the absolute last piece of the puzzle for me, so if I can figure this out, it'll essentially complete my project. Any assistance would go such a long way. I'd be happy to pay hourly to set up a screenshare to walk through it as well if that's preferred as I dont want to take advantage of anyone's time.
Thank you so much!!
def match_names_to_db(fileloc, user, pw, host, db):
# pull the names to be matched from database
db_pull = DatabaseUpdater(user=user, pw=pw, host=host, db=db)
database_names = db_pull.fetch_columns_from_table(table_name='company_directory', column_names=['id', 'company_name'])
database_names.set_index('id', inplace=True)
# get names to be matched from Excel file
tracker_names = data_frame_from_xlsx_range(fileloc, 'tracker_names_to_match')
tracker_names_unchanged = tracker_names.copy(deep=True)
# initialize and run name matcher
matcher = NameMatcher(top_n=50, lowercase=True, punctuations=True, remove_ascii=True, legal_suffixes=True,
common_words=True, number_of_matches=5)
matcher.set_distance_metrics(['overlap',
'weighted_jaccard',
'ratcliff_obershelp',
'fuzzy_wuzzy_token_sort',
'editex',
'discounted_levenshtein'])
matcher.load_and_process_master_data('company_name', database_names, transform=True)
matches = matcher.match_names(to_be_matched=tracker_names, column_matching='Tracker_Name')
# sort the database returned by NameMatcher
matches.to_excel('test_with_db_pull1.xlsx')
Thank you for pointing out the error! In the current version the match index is an index running from 0 to the number of rows. So in order to have your index corrected you could use the following line of the matches you got from the code are called matches and your original data is called data
matches.match_index = data.index[matches[‘match_index’].astype(int)]
In a future version this should be fixed