/Empirical-Study-of-Entity-Resolution-Using-Word-Embedding

Performed entity resolution/record linkage using different types of word embedding techniques on E-Commerce datasets.

Primary LanguageJupyter Notebook

Empirical Study of Entity Resolution Using Word Embedding

Table of Contents

Abstract

Entity Resolution was originally used to be solved through Deterministic Linkage Methods and Probabilistic Linkage Methods that simply matches identifiers from both record pairs and links them when they agree. With modern machine learning techniques applied to this problem, solutions such as string matching, string distance (Jaccard Similarity), graphical models, and etc have often improved the performance of many record linkage systems. Most of these modern machine learning solutions rely on a variation of string matching to link record pairs. This project is focused on solving this problem with the use of word embeddings that essentially captures the word’s meaning and represents it in a vector space. By using word semantic, the machines can now try to link records through the words meaning instead of by matching the letters of the string. This can help solve the problem of polysemy, where different words or phrases can have related meaning. Word embeddings can capture the difference in the meanings of the same word while string matching would not. Thus, word embedding can be a more general approach that can be applied to Entity Resolution problems in many different fields with an optimal performance.

Proposed Approach

Overview

Unsupervised Approach

Note: Training corpus mentioned in the following steps is generated by combining every product's Name/Title and Description into a string. Thus, every line in the training corpus corresponds to a product's Name/Title and Description separated by a space.

1. TF-IDF

Generate TF-IDF features by fitting the training corpus with the TF-IDF Vectorizer settings of Max Feature = 300. Then transforming each product's Name/Title into TF-IDF feature vectors.

2. Random Index

Initial index vectors are generated for every word that occurs in the training corpus. Vectors have the dimension of 300 with elements of +1 or 0 that is generated using a Gaussian Distribution. Then the context vectors are produced by scanning through the training corpus and each time a word occurs within a sliding context window of k, that context's index vector is added to the context vector for the word in question. \cite{Sahlgren20051}

For each product's Name/Title we transform each word within the string into it's corresponding context vector and average them to obtain a vector representation for the product.

3. FastText Unsupervised CBOW + SkipGram

Pre-Trained Approach

4. FastText CBOW (Common Crawl + Wikipedia)

5. FastText SkipGram (Wikipedia)

6. FastText CBOW/SkipGram + Random Indexing

Further Experiments

7. Combined Ranked List from Two best approach

8. Concatenate Word Embeddings of Two best approach

9. Blocking using Manufacturer & Price of Top 3 Approach

Experimental Results

Library and Pre-Trained Embedding

FastText Python Installation:
https://fasttext.cc/docs/en/support.html#building-fasttext-python-module

FastText Pre-Trained Word Vectors:
CBOW: https://fasttext.cc/docs/en/crawl-vectors.html
SkipGram: https://fasttext.cc/docs/en/pretrained-vectors.html

Code

Unsupervised Approaches

  1. TF-IDF
    1. TF-IDF.ipynb
  2. Random Indexing
    1. Random Indexing.ipynb
  3. FastText Unsupervised (CBOW/SkipGram)
    1. FastText(Unsupervised).ipynb

Pre-Trained Approaches

  1. FastText Wikipedia (CBOW)
    1. FastText (Wiki CBOW).ipynb
  2. FastText Wikipedia (SkipGram):
    1. FastText (Wiki SkipGram).ipynb
  3. FastText Wikipedia (CBOW/SkipGram) + Random Indexing
    1. Random Indexing (Pre-Trained CBOW).ipynb
    2. Random Indexing (Pre-Trained SkipGram).ipynb

Further Experiments

  1. Combination of Ranked List (Average Cosine Similarity)
    1. TF-IDF + Pre-Trained (SkipGram) Cosine Similarity.ipynb
    2. Pre-Trained + Unsupervised (SkipGram) Cosine Similarity.ipynb
  2. Word Embedding Concatenation
    1. TF-IDF + Pre-Trained (SkipGram) Vector Concat.ipynb
    2. Pre-Trained + Unsupervised (SkipGram) Vector Concat.ipynb
  3. Blocking using Manufacturing and Price Fields
    1. FastText (Wiki SkipGram) + Blocking (-0.25).ipynb
    2. FastText (Wiki SkipGram) + Blocking (-3).ipynb
    3. TF-IDF + Pre-Trained (SkipGram) Vector Concat + Blocking (-0.25).ipynb
    4. TF-IDF + Pre-Trained (SkipGram) Vector Concat + Blocking (-3).ipynb