WWCode Data Science: NLP Fuzzy Match Algorithms
Fuzzy string matching is technique to find strings which have approximate matches. There are multiple applications of fuzzy matching. This talk will cover a few algorithms which are implemented for such approximate string matchings.
Link to the Jupyter notebook.
Outline of the talk:
- Introduction to fuzzy matching
- Applications of fuzzy matching
- Algorithms used for fuzzy matching
- Levenshtein distance algorithm
- Damerau-Levenshtein distance algorithm
- Bitmap algorithm
- n-gram algorithm
- Implementation of fuzzy matching on real data
- Other fuzzy matching algorithms
- Record Linkange Toolkit library to link records in or between data sources and provides tools for deduplication and record linkage.
Libraries used:
- Jellyfish: Refer here for more information
- Fuzzywuzzy: Refer here for more information
- Fuzzy_match: Refer here for more information
Implementation on Real Data
Download data here from Kaggle.
The csv file is also here.
The data contains two columns for room type descriptions. Column 1 is the description from Expedia, and column 2 is the associated room type in Booking.com.
Aim: is to compare and match these two columns and the result would be 'human like understanding that the matched entries are same'.
Snapshot of the data:
References:
- Levenshtein, Vladimir I. "Binary codes capable of correcting deletions, insertions, and reversals." In Soviet physics doklady, vol. 10, no. 8, pp. 707-710. 1966.
- Damerau, Fred J. "A technique for computer detection and correction of spelling errors." Communications of the ACM 7, no. 3 (1964): 171-176.
- Cayrol, M., Farreny, H. and Prade, H. (1982), 'Fuzzy Pattern Matching', Kybernetes, Vol. 11 No. 2, pp. 103-116.
- Ukkonen, Esko. "Algorithms for approximate string matching." Information and control 64, no. 1-3 (1985): 100-118.
- Geek for Geeks - applications of fuzzy string matching
- Geek for Geeks - Bitap Algorithm
- Stanford slides on n-gram
- Data camp tutorial - fuzzy string matching
- Levenshtein distance theory
- Article on record linking and fuzzy matching
- Medium post on Levenshtein distance
- stackoverflow for n-gram similarity
Feel free to reach out if you have any questions.