/nlp_fuzzy_match_algorithms

Primary LanguageJupyter NotebookMIT LicenseMIT

WWCode Data Science: NLP Fuzzy Match Algorithms

Fuzzy string matching is technique to find strings which have approximate matches. There are multiple applications of fuzzy matching. This talk will cover a few algorithms which are implemented for such approximate string matchings.

Link to the Jupyter notebook.

YouTube Link


Outline of the talk:    

  • Introduction to fuzzy matching
  • Applications of fuzzy matching
  • Algorithms used for fuzzy matching
    • Levenshtein distance algorithm
    • Damerau-Levenshtein distance algorithm
    • Bitmap algorithm
    • n-gram algorithm
  • Implementation of fuzzy matching on real data
  • Other fuzzy matching algorithms
  • Record Linkange Toolkit library to link records in or between data sources and provides tools for deduplication and record linkage.

Libraries used:

  • Jellyfish: Refer here for more information
  • Fuzzywuzzy: Refer here for more information
  • Fuzzy_match: Refer here for more information

Implementation on Real Data

Download data here from Kaggle.

The csv file is also here.

The data contains two columns for room type descriptions. Column 1 is the description from Expedia, and column 2 is the associated room type in Booking.com.

Aim: is to compare and match these two columns and the result would be 'human like understanding that the matched entries are same'.

Snapshot of the data:

image


References:

  1. Levenshtein, Vladimir I. "Binary codes capable of correcting deletions, insertions, and reversals." In Soviet physics doklady, vol. 10, no. 8, pp. 707-710. 1966.
  2. Damerau, Fred J. "A technique for computer detection and correction of spelling errors." Communications of the ACM 7, no. 3 (1964): 171-176.
  3. Cayrol, M., Farreny, H. and Prade, H. (1982), 'Fuzzy Pattern Matching', Kybernetes, Vol. 11 No. 2, pp. 103-116.
  4. Ukkonen, Esko. "Algorithms for approximate string matching." Information and control 64, no. 1-3 (1985): 100-118.
  5. Geek for Geeks - applications of fuzzy string matching
  6. Geek for Geeks - Bitap Algorithm
  7. Stanford slides on n-gram
  8. Data camp tutorial - fuzzy string matching
  9. Levenshtein distance theory
  10. Article on record linking and fuzzy matching
  11. Medium post on Levenshtein distance
  12. stackoverflow for n-gram similarity

Feel free to reach out if you have any questions.