Semantic clustering of text using K-means on custom-built contextual Word Co-occurence matrix, decomposed using SVD. Truncated length of word vectors are estimated by plotting explained variance of decomposed matrix.
Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews
The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.
Number of reviews : 568,454
Number of users : 256,059
Number of products : 74,258
Timespan: Oct 1999 : Oct 2012
Number of Attributes/Columns in data: 10
- Id
- ProductId - unique identifier for the product
- UserId - unqiue identifier for the user
- ProfileName
- HelpfulnessNumerator - number of users who found the review helpful
- HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
- Score - rating between 1 and 5
- Time - timestamp for the review
- Summary - brief summary of the review
- Text - text of the review
The code below would clean the review text from html tags and punctuations and write it as a new column in the database and write it to disk. This is further taken up in Part 2 to find accuracy of 10-fold cross validation KNN on vectorized input data, for each of the 4 featurizations, namely BoW, tf-IDF, W2V, tf-IDF weighted W2V.
- Duplication of reviews are found with same userid and timestamp (Cleaned).
- Found discrepancy issues with HelpfulnessDenominator (Cleaned).
- final.sqlite db is to be used for further processing such as Text to Vector operations.
- The preprocessing step is one time effort but the training & visualization steps require multiple runs. Hence, it is prudent to make reprocessing step independant, to avoid multiple runs.
The preprocessing step has produced final.sqlite file after doing the data preparation & cleaning. The review text is now devoid of punctuations, HTML markups and stop words.
To find clusters of semantically related words from Amazon reviews using contextual Word Co-occurence matrix. Co-occurence Matrix is factor decomposed using SVD, truncated with an estimated K on the basis of maximum explained variance.
-
Found Top Features based on TF-IDF featurization.
-
Created Word Co-Occurence Matrix with neighbourhood = 5
-
Word Co-Occurence Matrix Decomposition done using SVD. Found matrix, U.
-
Found the best value of ’k’, based on explained variance of matrix, U (same as in PCA).
-
Done TruncatedSVD on U to find Word Vectors (Reduced U to ’k’ components)
-
Ran K-means Clustering on StandardizedWord Vectors to find clusters.
-
Took one word, found the cluster to which it belongs & found the most similar words using cosine similarity metric.
-
Draw word cloud based on cosine similarity. Do step (g) & (h) for couple of words
-
Analyze the word vector clusters obtained.
3 user defined functions are written to
-
Elbow Method to find K
-
Analyze the Clusters function
-
Generate Similiarity Word Clouds
The words meaningfully similar to ’food’ in cluster 15 are irregular, wholefood, brunch, bitterish, grind, robusto, dogfoodanalyst, dinner, cocunut etc. Thus 15th cluster contains food and food related words in general.
The words meaningfully similar to ’alkalin’ in cluster 16 are nonalcohol, stimuli, vitamin, needl, lecithin, technivorm, latent, microbiolog, benzocain, sunburn etc. Thus 16th cluster contains chemcial and medicine related words.
The words meaningfully similar to ’gourmet’ in cluster 1 are pillsburi, chia, blueberri, dragon-fruit, naturopath, maker, healthi, masterpiec, halal, semi, reput, fast, fatti, pilchard etc. Thus, 1st cluster contains gourmet and food/health related words.
-
There are clusters where semantic relation could be noticed. For instance, words hypoglecemia and hysterectomi are grouped together (both are medical words).
-
The words grouped together with word, ’alkalin’ in cluster 16 are nonalcohol, stimuli, vitamin, needl, lecithin, technivorm, latent, microbiolog, benzocain, sunburn etc. Thus 16th cluster contains chemical and medicine related words.
-
The words meaningfully similar to ’food’ in cluster 15 are irregular, wholefood, brunch, bitterish, grind, robusto, dogfoodanalyst, dinner, cocunut etc. Thus 15th cluster contains food and food related words in general.
-
The words along with ’gourmet’ in cluster 1 are pillsburi, chia, blueberri, dragonfruit, naturopath, maker, healthi, masterpiec, halal, semi, reput, fast, fatti, pilchard etc. Thus, 1st cluster contains gourmet and food/health related words.
-
Thus, using factor decomposedWord Co-occurence matrix, semantically related words are clustered from Amazon reviews.