Including an ID column in String Grouper output

Question

Including an ID column in String Grouper output

Closed this issue 4 years ago · 6 comments

Firstly, thank you for developing the String Grouper function. It's fast!

I have a question regarding the functionality of String Grouper, hopefully it will be possible to do this.

Requirement

A very common scenario is to have a column (series) of values like, for example, the name of a company together with a column with IDs for each data row.

I have seen that String Group does (well and very fast) matching of a column of string data to find closely matched values.

What we would like to be able to achieve is to have the IDs related to the string value in a row considered when doing the string matches so they are returned in the String Grouper output Pandas dataframe.

Example

Assume a data file, 'accounts.csv', with the following data:

account_id,name
A000123,ABC Enterprises Inc.
B000987,XYZ Co.
C000456,ABC Enterprises Incorporated
D000345,XYX Company
E000678,XYX Co
F000345,ABC Enterprises
G000876,The Big Group

If a typical import is done to a Pandas dataframe we will get:

accounts = pd.read_csv('accounts.csv')
accounts

	account_id	name
0	A000123	ABC Enterprises Inc.
1	B000987	XYZ Co.
2	C000456	ABC Enterprises Incorporated
3	D000345	XYX Company
4	E000678	XYX Co
5	F000345	ABC Enterprises
6	G000876	The Big Group

If the account_id is made the index column we will get:

accounts = pd.read_csv('accounts.csv', index_col='account_id')
accounts

account_id	name
A000123	ABC Enterprises Inc.
B000987	XYZ Co.
C000456	ABC Enterprises Incorporated
D000345	XYX Company
E000678	XYX Co
F000345	ABC Enterprises
G000876	The Big Group

This, we believe, would be the better option as the index value is fixed. I note also that Pandas reports this as a DataFrame even though it has only one value column.

String Grouper function

Currently I use the first import method accounts = pd.read_csv('accounts.csv') and then the following code

matches = match_strings(accounts['name'], min_similarity = 0.79)
dupes = matches[matches.left_side != matches.right_side]
dupes

which returns something like this

	left_side	right_side	similarity
0	ABC Enterprises Inc.	ABC Enterprises Incorporated	0.912345
3	ABC Enterprises	ABC Enterprises Inc.	0.922345
8	ABC Enterprises Incorporated	ABC Enterprises	0.892345

etc.

The Pandas index column value is dynamically assigned.

Question

Would it be possible to have the match_strings function (or an addition function) return the IDs of the row the string value is in, so:

	id	left_side	id	right_side	similarity
0	A000123	ABC Enterprises Inc.	C000456	ABC Enterprises Incorporated	0.912345
3	F000345	ABC Enterprises	A000123	ABC Enterprises Inc.	0.922345
8	C000456	ABC Enterprises Incorporated	F000345	ABC Enterprise s	0.892345

or even dispense with the text values and just return the ID, because the ID in the resulting String Grouper dataframe can always be linked to the ID of the original dataframe to retrieve the text value in the relevant row, so:

	left_side	right_side	similarity
0	A000123	C000456	0.912345
3	F000345	A000123	0.922345
8	C000456	F000345	0.892345

To do this would mean the ID values would have to be 'included' in the processing. That is a dataframe, not a series, would be passed to String Grouper, even though the IDs would not be evaluated. I see that the current function uses a Pandas Series. I don't have enough knowledge/experience in the code to know if this is possible to pass a dataframe and 'take along' the ID matching the string of the row - would it be?

Current work around

I use Pandas merge to join the two dataframes:

result = pd.merge(dupes, accounts, left_on="left_side", right_on="name")
result

to give me the following columns:

left_side | right_side | similarity | account_id | name

from which I then make a series with unique account_id:

account_id_list = pd.Series(result.account_id.unique())

Answer 1 · 2021-02-02T20:36:18.000Z

Hi @justasojourner,

Thanks for the great write-up and work around. Yes adding an ID field is possible and indeed, a lot of people seem to have this question. I also think the most obvious way is to use the index of the input series for this. There should be no extra calculations needed.

Right now the StringGrouper object is dropping and resetting the existing indices.Internally it keeps track of a "matches list" that can be seen as a big sparse matrix with column indices and row indices of the matches. These are mapped back to the actual strings in the function get_matches, which accesses them by index.

I think that (but have not tested) your achieved by:

Not resetting the indices in the constructor
Updating the get_matches to access by position (e.g. .iloc) and not drop indices
Adding the indices of the left_side and right_side Series to the resulting output.

Answer 2 · 2021-02-02T21:16:00.000Z

Hi @Bergvca

Thanks for getting back. I was me emailing about this also.

I have been discussing this with a friend who has training in mathematics and data science and he actually put together (coded) the principle of using two Pandas columns. The code is working. I hacked his code into the Python package and added functions to init and called it and it works.

It would be great to extend String Grouper with this functionality as a selectable option. It would be good for you obviously to review and incorporate as you believe best. In his code he did not use the index but another column in the DataFrame which means the DataFrame has to have a two column structure. If the index was setup with the ID values then the function could chose which column to use for text matching.

How would you like to get the code? Doing a pull might not be the best, I can drop it into a message here.

Answer 3 · 2021-02-02T21:24:44.000Z

Screenshot using code with IDs. My friend made a new function 'match_strings_with_ids'.

(personal data blanked out In screenshot)

Answer 4 · 2021-02-02T21:53:48.000Z

I think the best setup (of course what I would love to have) would be a config setting say

DEFAULT_USE_INDEX: bool = False

where the ID column values from the import are the Series/DataFrame index.

And then in the function call pass the DataFrame column to do matches on (as it is now) and set the option.

Answer 5 · 2021-02-03T15:43:21.000Z

Hi @Bergvca

I have had further discussion with my friend about extending string_grouper.py — the match_strings function, the StringGrouperConfig class and the StringGrouper class.

He thought it would be difficult to have/use IDs in the Pandas DataFrame index in the StringGrouper class. I suggested adding another option via **kwargs.

Defined Constant
DEFAULT_ID_COL: str = str() # No named column used for ID by default

(the constant could possibly be left out as it has nothing to start with)

class StringGrouperConfig(NamedTuple):
id_col: str = DEFAULT_ID_COL

Example call - where 'accounts' is the pd.DataFrame
matches = match_strings(accounts['name'], id_col = accounts['cust_id'], min_similarity = 0.79)

Main thing being it would be cleaner to extend the function and class rather than making extra ones. Anyway he is interested to do the coding (his previous code works, I think though this above is better as it is flexible) and is doing it now.

If he is successful, and it works in my testing, would you be happy for it to be submitted via a Pull request for you to review?

Answer 6 · 2021-02-08T21:06:56.000Z

Merged and new version uploaded to pypi:

https://pypi.org/project/string-grouper/0.2.2/