Including an ID column in String Grouper output
Closed this issue · 6 comments
Firstly, thank you for developing the String Grouper function. It's fast!
I have a question regarding the functionality of String Grouper, hopefully it will be possible to do this.
Requirement
A very common scenario is to have a column (series) of values like, for example, the name of a company together with a column with IDs for each data row.
I have seen that String Group does (well and very fast) matching of a column of string data to find closely matched values.
What we would like to be able to achieve is to have the IDs related to the string value in a row considered when doing the string matches so they are returned in the String Grouper output Pandas dataframe.
Example
Assume a data file, 'accounts.csv', with the following data:
account_id,name
A000123,ABC Enterprises Inc.
B000987,XYZ Co.
C000456,ABC Enterprises Incorporated
D000345,XYX Company
E000678,XYX Co
F000345,ABC Enterprises
G000876,The Big Group
If a typical import is done to a Pandas dataframe we will get:
accounts = pd.read_csv('accounts.csv')
accounts
account_id | name | |
---|---|---|
0 | A000123 | ABC Enterprises Inc. |
1 | B000987 | XYZ Co. |
2 | C000456 | ABC Enterprises Incorporated |
3 | D000345 | XYX Company |
4 | E000678 | XYX Co |
5 | F000345 | ABC Enterprises |
6 | G000876 | The Big Group |
If the account_id
is made the index column we will get:
accounts = pd.read_csv('accounts.csv', index_col='account_id')
accounts
account_id | name |
---|---|
A000123 | ABC Enterprises Inc. |
B000987 | XYZ Co. |
C000456 | ABC Enterprises Incorporated |
D000345 | XYX Company |
E000678 | XYX Co |
F000345 | ABC Enterprises |
G000876 | The Big Group |
This, we believe, would be the better option as the index value is fixed. I note also that Pandas reports this as a DataFrame even though it has only one value column.
String Grouper function
Currently I use the first import method accounts = pd.read_csv('accounts.csv')
and then the following code
matches = match_strings(accounts['name'], min_similarity = 0.79)
dupes = matches[matches.left_side != matches.right_side]
dupes
which returns something like this
left_side | right_side | similarity | |
---|---|---|---|
0 | ABC Enterprises Inc. | ABC Enterprises Incorporated | 0.912345 |
3 | ABC Enterprises | ABC Enterprises Inc. | 0.922345 |
8 | ABC Enterprises Incorporated | ABC Enterprises | 0.892345 |
etc.
The Pandas index column value is dynamically assigned.
Question
Would it be possible to have the match_strings
function (or an addition function) return the IDs of the row the string value is in, so:
id | left_side | id | right_side | similarity | |
---|---|---|---|---|---|
0 | A000123 | ABC Enterprises Inc. | C000456 | ABC Enterprises Incorporated | 0.912345 |
3 | F000345 | ABC Enterprises | A000123 | ABC Enterprises Inc. | 0.922345 |
8 | C000456 | ABC Enterprises Incorporated | F000345 | ABC Enterprise s | 0.892345 |
or even dispense with the text values and just return the ID, because the ID in the resulting String Grouper dataframe can always be linked to the ID of the original dataframe to retrieve the text value in the relevant row, so:
left_side | right_side | similarity | |
---|---|---|---|
0 | A000123 | C000456 | 0.912345 |
3 | F000345 | A000123 | 0.922345 |
8 | C000456 | F000345 | 0.892345 |
To do this would mean the ID values would have to be 'included' in the processing. That is a dataframe, not a series, would be passed to String Grouper, even though the IDs would not be evaluated. I see that the current function uses a Pandas Series. I don't have enough knowledge/experience in the code to know if this is possible to pass a dataframe and 'take along' the ID matching the string of the row - would it be?
Current work around
I use Pandas merge to join the two dataframes:
result = pd.merge(dupes, accounts, left_on="left_side", right_on="name")
result
to give me the following columns:
left_side | right_side | similarity | account_id | name
from which I then make a series with unique account_id:
account_id_list = pd.Series(result.account_id.unique())
Hi @justasojourner,
Thanks for the great write-up and work around. Yes adding an ID field is possible and indeed, a lot of people seem to have this question. I also think the most obvious way is to use the index of the input series for this. There should be no extra calculations needed.
Right now the StringGrouper object is dropping and resetting the existing indices.Internally it keeps track of a "matches list" that can be seen as a big sparse matrix with column indices and row indices of the matches. These are mapped back to the actual strings in the function get_matches
, which accesses them by index.
I think that (but have not tested) your achieved by:
- Not resetting the indices in the constructor
- Updating the
get_matches
to access by position (e.g..iloc
) and not drop indices - Adding the indices of the left_side and right_side Series to the resulting output.
Hi @Bergvca
Thanks for getting back. I was me emailing about this also.
I have been discussing this with a friend who has training in mathematics and data science and he actually put together (coded) the principle of using two Pandas columns. The code is working. I hacked his code into the Python package and added functions to init and called it and it works.
It would be great to extend String Grouper with this functionality as a selectable option. It would be good for you obviously to review and incorporate as you believe best. In his code he did not use the index but another column in the DataFrame which means the DataFrame has to have a two column structure. If the index was setup with the ID values then the function could chose which column to use for text matching.
How would you like to get the code? Doing a pull might not be the best, I can drop it into a message here.
I think the best setup (of course what I would love to have) would be a config setting say
DEFAULT_USE_INDEX: bool = False
where the ID column values from the import are the Series/DataFrame index.
And then in the function call pass the DataFrame column to do matches on (as it is now) and set the option.
Hi @Bergvca
I have had further discussion with my friend about extending string_grouper.py — the match_strings function, the StringGrouperConfig class and the StringGrouper class.
He thought it would be difficult to have/use IDs in the Pandas DataFrame index in the StringGrouper class. I suggested adding another option via **kwargs.
Defined Constant
DEFAULT_ID_COL: str = str() # No named column used for ID by default
(the constant could possibly be left out as it has nothing to start with)
class StringGrouperConfig(NamedTuple):
id_col: str = DEFAULT_ID_COL
Example call - where 'accounts' is the pd.DataFrame
matches = match_strings(accounts['name'], id_col = accounts['cust_id'], min_similarity = 0.79)
Main thing being it would be cleaner to extend the function and class rather than making extra ones. Anyway he is interested to do the coding (his previous code works, I think though this above is better as it is flexible) and is doing it now.
If he is successful, and it works in my testing, would you be happy for it to be submitted via a Pull request for you to review?
Merged and new version uploaded to pypi: