WWCode Data Science: NLP Fuzzy Match Algorithms

Fuzzy string matching is technique to find strings which have approximate matches. There are multiple applications of fuzzy matching. This talk will cover a few algorithms which are implemented for such approximate string matchings.

Link to the Jupyter notebook.

Outline of the talk:    

  • Introduction to fuzzy matching
  • Applications of fuzzy matching
  • Algorithms used for fuzzy matching
    • Levenshtein distance algorithm
    • Damerau-Levenshtein distance algorithm
    • Bitmap algorithm
    • n-gram algorithm
  • Implementation of fuzzy matching on real data
  • Other fuzzy matching algorithms
  • Record Linkange Toolkit library to link records in or between data sources and provides tools for deduplication and record linkage.

Libraries used:

  • Jellyfish: Refer here for more information
  • Fuzzywuzzy: Refer here for more information
  • Fuzzy_match: Refer here for more information

Implementation on Real Data

Download data here from Kaggle.

The csv file is also here.

The data contains two columns for room type descriptions. Column 1 is the description from Expedia, and column 2 is the associated room type in Booking.com.

Aim: is to compare and match these two columns and the result would be 'human like understanding that the matched entries are same'.

Snapshot of the data:



Feel free to reach out if you have any questions.