- Empirical Study of Entity Resolution Using Word Embedding
Entity Resolution was originally used to be solved through Deterministic Linkage Methods and Probabilistic Linkage Methods that simply matches identifiers from both record pairs and links them when they agree. With modern machine learning techniques applied to this problem, solutions such as string matching, string distance (Jaccard Similarity), graphical models, and etc have often improved the performance of many record linkage systems. Most of these modern machine learning solutions rely on a variation of string matching to link record pairs. This project is focused on solving this problem with the use of word embeddings that essentially captures the word’s meaning and represents it in a vector space. By using word semantic, the machines can now try to link records through the words meaning instead of by matching the letters of the string. This can help solve the problem of polysemy, where different words or phrases can have related meaning. Word embeddings can capture the difference in the meanings of the same word while string matching would not. Thus, word embedding can be a more general approach that can be applied to Entity Resolution problems in many different fields with an optimal performance.
Note: Training corpus mentioned in the following steps is generated by combining every product's Name/Title and Description into a string. Thus, every line in the training corpus corresponds to a product's Name/Title and Description separated by a space.
Generate TF-IDF features by fitting the training corpus with the TF-IDF Vectorizer settings of Max Feature = 300. Then transforming each product's Name/Title into TF-IDF feature vectors.
Initial index vectors are generated for every word that occurs in the training corpus. Vectors have the dimension of 300 with elements of +1 or 0 that is generated using a Gaussian Distribution. Then the context vectors are produced by scanning through the training corpus and each time a word occurs within a sliding context window of k, that context's index vector is added to the context vector for the word in question. \cite{Sahlgren20051}
For each product's Name/Title we transform each word within the string into it's corresponding context vector and average them to obtain a vector representation for the product.
FastText Python Installation:
https://fasttext.cc/docs/en/support.html#building-fasttext-python-module
FastText Pre-Trained Word Vectors:
CBOW: https://fasttext.cc/docs/en/crawl-vectors.html
SkipGram: https://fasttext.cc/docs/en/pretrained-vectors.html
- TF-IDF
- TF-IDF.ipynb
- Random Indexing
- Random Indexing.ipynb
- FastText Unsupervised (CBOW/SkipGram)
- FastText(Unsupervised).ipynb
- FastText Wikipedia (CBOW)
- FastText (Wiki CBOW).ipynb
- FastText Wikipedia (SkipGram):
- FastText (Wiki SkipGram).ipynb
- FastText Wikipedia (CBOW/SkipGram) + Random Indexing
- Random Indexing (Pre-Trained CBOW).ipynb
- Random Indexing (Pre-Trained SkipGram).ipynb
- Combination of Ranked List (Average Cosine Similarity)
- TF-IDF + Pre-Trained (SkipGram) Cosine Similarity.ipynb
- Pre-Trained + Unsupervised (SkipGram) Cosine Similarity.ipynb
- Word Embedding Concatenation
- TF-IDF + Pre-Trained (SkipGram) Vector Concat.ipynb
- Pre-Trained + Unsupervised (SkipGram) Vector Concat.ipynb
- Blocking using Manufacturing and Price Fields
- FastText (Wiki SkipGram) + Blocking (-0.25).ipynb
- FastText (Wiki SkipGram) + Blocking (-3).ipynb
- TF-IDF + Pre-Trained (SkipGram) Vector Concat + Blocking (-0.25).ipynb
- TF-IDF + Pre-Trained (SkipGram) Vector Concat + Blocking (-3).ipynb