Fuzzy Matching with Alteryx

Some notes on fuzzy matching with Alteryx.

General Fuzzy Matching Points

Limit the number of matches fully you run the match function on with careful key generation.
However, make sure your key generation isn't throwing away legitimate matches.

Matching Multiple Sparsely Populated Fields

Whilst Alteryx allows us to specify multiple fields to match on, it expects each field to be populated, and throws away results where only some fields match. This isn't what we want when we're trying to match some kinds of data. For example, we might have a fully populated first and last name, but only a sparsely populated date of birth and postcode. A match against first and last name and either postcode or DoB would be sufficient for us, but the fuzzy match tool expects both.

To get around this, we can do multiple fuzzy matches, and aggregate the results. For each field of interest, we could preform an independent fuzzy match. Then we join the results together (using both output IDs as keys). Once joined, we can see for each aggregated match, what component fields matched and what didn't and start defining rules about which ones are more important, which we can live without etc. The scores for the different field can also feed into a scoring mechanism used to generate an overall score (like the overall matchscore generated by the fuzzy match node).

Choose a Hero Field

If a given field you are matching can only take certain values (birthdays and postcodes might be an example) then as your data set grows bigger the number of matchces will grow exponentially (see the Birthday Problem). This means that for large enough data sets it can rapidly become unweildy to match fields individually - the number of spurious matches on these fields will get too large before you have the chance to filter it down. A workaround for this is to pick a "Hero" field that must match for any other match to be attempted. It should be a field that has no nulls, and where matches carry some weight (i.e. if your hero field matches, it's probably worth checking the rest) - I find surname works well for this. You can then create a fuzzy match combining the hero field and each other field of interest.

Surname and DoB
Surname and Postcode
...etc

Then aggregate the results as before. This has the advantage that you can even opt to leave out matching the hero field on it's own (so long as you don't mind saying goodbye to matches only on your hero field.

Deal-Breaker Fields

Of course, mismatches betweeen fields can be as informative as matches. For example when inspecting a potential match for two people, we might not care that the postcodes don't match (maybe they moved), but if the dates of birth are in different decades then we probably have a problem. If the deal-breaker field is fully populated, just use it as part of your matching key.

When this becomes more difficult is where our deal-breaker field is sparsely populated. An explicit mismatch between dates of birth is surely bad, but if we have one null's to contend with then we can't use the field as part of the key. Instead what we can do is re-join the field after matching, and calculate whether the resulting matches explicitly disagree on that field. If they do we can then filter them out.

Fields with Multiple Values per ID

Some facts about a record might have a many-to-one relationship. For example, one person might have multiple postcodes (home address, work address, out of date address etc) - a match on any might help inform our fuzzy matching strategy, whilst mismatching ones can simply be discarded. This is easy to incorporate into our match-aggregating strategy - simply try each combination and aggregate using the common ID. We can simply take the fact of there being any match to inform our scoring, or weight the score accordingly for multiple matches across different values for the same ID.

Scoring Matches Across Multiple Fields

With matches generated across all our fields we can start to think about how to decide which matches are good and which are bad. One way to approach this would be from a probability standpoint. If matches on our different fields are independent (i.e. assuming that matches happen by chance, a match on postcode wouldn't make a match on surname any more or less likely), then the combined probability of several matches is the product of the individual probabilities.

The difficulty with this method is a) the difficulty of assessing those probabilities; and b) the range of values this yeilds - the product of several unlikely probabilities will be vanishingly small and not very friendly to look at. Instead lets express these products using a log scale, making that product into a sum. This means in simple terms that we can just add the scores together to come up with an overall score, using a weighting to take into account their relative probabilities. Rather an overwrought way of coming to the conclusion that we could probably have just chosen in the first place, but it's nice to know we can justify it in some way, and to come to an understanding of what our scores mean: If we rank our scores from 1-100, and adding 100 makes us doubly sure of our results, then adding 200 makes 4x more sure, and 300 8x more sure and so on.

Matching Nicknames

Alteryx provides a common nicknames database for matching nicknames with others. To get the best out of this however we need to disable generating keys for the field in question. The reason for this is that because key matches are assesed before the match function, and many nicknames will fail to generate matching keys.

Consider the following:

ID	First Name	Last Name
1	Rob	Smith
2	Robert	Smith

Using Double metaphone for key generation will result in the followiong keys (the last name "Smith" generates two, but I've just included one for brevity).

ID	Key (First Name)	Key (Last Name
1	RP	SM0
2	RPRT	SM0

These will fail to match if we use both keys. Thus if we want to match "Robert" and "Rob" we should only generate keys for the last name. This happens to pair well with the two-field strategy from above, as we can use a second field to limit the number of pairs tested - we don't need to test every possible pair of names for nicknames, just onese where we have some confidence (given by another field in this case) that there might be a match.

Creating Your Own Common Nicknames Database

TODO

Matching Multi-Word Strings

Matching strings with multiple words can be approached in several ways. For something like a company name, it's important to match most if not all of the name, though some words are more important than others. For example, we would like "Bob's Burger Company Limited" to match "Bob's Burgers". But we probably would like to avoid "AAA Taxis Limited" matching "ABC Taxis Limited", even though this has more words in common than the previous example. Below are a few strategies that have different pros and cons.

Method 1: One key per string, test whole string.

Generate one key per string (using Double Metaphone or similar) and test it using the whole string, just like we would for a single word.

Pros: Fast and Efficient, generates only strong matches. Cons: Easily thrown by insertions/transpositions/deletions between words.

Method 2: One key per word, test whole string.

If we generate one key per word, it's much easier to catch pairs that have been rearranged or added to, e.g. "The White Company Limited" and "White Company Limited", which would not have yeilded matching keys using Method 1.

Pros: Still pretty fast. Genrates a bigger pool of matches to test. Cons: See above, also requires testing more pairs.

Method 3: One key per word, test every word.

This method catches rearangements, but is much slower as the pool of pairs expands exponentially compared to the above. Good for short groups of words where order isn't too important (for example a collection of names where some are written "First, Last" and some "Last, First".

Pros: Isn't thrown by word order or insertions, but much slower.

Method 4: Bag of words analysis.

Bag of words analysis involves counting the number of matching words between to strings. For this we might not use the fuzzy match tool at all, instead relying on exact matches between words and counting the total for each pair of strings. For this method it can be helpful to use a second field to generate keys for matching. This is more sutable for strings with many words.

Tweak 1: One key per uncommon word

One problem with methods two and three, is the number of keys they generate, and thus the overhead of testing a large number of pairs. We can avoid some of this by only generating keys for uncommon words (skipping "Limited", "Company", "Ltd", "Investments" when matching companies for example). The fuzzy match tool has an option to do this in the key generation options, but you can also pre-generate keys and filter out the common ones separately if you want to use a longer list or a dynamic method for generating your common words list.

Teak 2: Sort words alphabetically, test whole string.

This gets around the problem of word order changing when testing strings against one another. This is great for dealing with pairs like "The Judean Peoples Front" and "The Peoples Front of Judea" that feature roughly the same content with only minor insertions.

Note: The Judean People's front shouldn't actually be confused with the Peoples front of Judea.

Matching Stategies for Specific Domains

UK Companies

I've used companies a lot as an example above. The main difficulty witht company names is the high frequency of words like "limited", "company", "holdings" and the like. You can exclude these terms when maching, but often this results in very little being left (there are an inordinate number of companies with amazingly dull names like "ABC Holdings Limited". I find it useful to try a few strategies for the same company name field and take the best result, whilst trying to keep the thresholds for matches high so as not to get too many spurious matches.

Word frequency for UK companies can be generated from open data provided by Companies House.

Phone Numbers

Phone numbers are easy - use the preset, which takes the last 10 digits of the number in reverse order. Job done.

First and Middle Names

See above advice on nicknames. The nice thing about these is that they come paired with surnames, so whilst First Names and Middle Names may do silly things like get shortened or swap around, we can try a few different strategies whilst restricting the number of pairs we have to try using the surname as a key.

Postcodes

I've pretty much stuck with exact matches for postcodes. You could of course start doing things like splitting them up into their components, but I'm not too interested in doing so.

BenSimonds/FuzzyMatchAlteryx