Most of the data on the web is in some form of free format. This means it's both unstructured and unlabeled, making it challenging to process and analyze. Furthermore, processing it might involve labeling data one document, one text bit, one comment at a time, which is labor intensive, time consuming and not something we're looking forward to. This project will demo techniques that perform unsupervised learning on a large corpus of unstructured text:
-
will determine the clusterings topics using NMF (Non-negative Matrix Factorization). The topics will be represented as numbers. They will provide the most common words associated with that clustering. It will be up to us to name that topic something meaningful. This is the only work involving human judgement. It's not necessary - it will only help with the readability and also with judging the quality of unsupervised machine topic modelling.
-
after the clustering, will do the topic assignment for each comment in the corpus
-
last step involves analyzing each comment for polarity sentiment and intensity of emotion using an NLTK library called VADER - Valence Aware Dictionary for sEntiment Reasoning
For those who like using enterprise BI tools, here is a very modest section of processing and distributing a python script in PowerBI.
The benefit of Power BI is in the instant interactivity: being able to check the quality of a sentiment analysis in real time - if your data source refreshes the content, you can set Power BI to refresh on a pre-determined schedule and boom, you have it, real time data at your fingertips
To make Power BI python ready go here, ignore the first part and scroll to "Enable Python scripting" part. That's where you learn how to properly connect Power BI with python (Power BI can't read paths so you have to specify the python directory in the PBI "options" menu). Well, a small price to pay...
To enable matplotlib.pyplot visualization through Power Bi, go here.