Semantic Word Clouds
Link to our project: http://swc-flask-aws-environement.eba-hzdjd3su.ap-northeast-2.elasticbeanstalk.com/
Project presentation: https://docs.google.com/presentation/d/1a484mO3MLXM_jfHvXKSjxntxaqrfESYLEZOqo_uvWfQ/edit#slide=id.gaf72c2d91d_0_27
You’re in the right place if you want to make better word clouds. The science shows that great word clouds are split into groups of similar words, with space in between and unique colors. Making those groups by yourself can be a pain, so we give you a headstart. Computers are super-quick at this kind of thing but far from perfect, so simply drag and drop words between clusters until you’re satisfied.
- Identify keywords with TF-IDF. Using TF-IDF allows us to focus on unusually frequent rare words and ignore frequent but common words. It does this by comparing the prevalence of a word in the text submitted by the user with the logarthmically adjusted prevalence of that word in a reference corpus (Reuters corpus in our case)
- Get the word embeddings for each keyword from a pre-trained Glove model.
- Do k-means clustering on the word embeddings. K points are randomly selected as cluster centroids. Each word vector is then grouped in the same cluster as its closest centroid. The cluster centroids are then recalculated (averaging all word vectors in a cluster). The word vectors are then re-grouped with the closest centroid. This process repeats a fixed number of times or until the positions of the centroids stop changing.
- Clone this repo
- Download and unzip the pre-trained word vectors from http://nlp.stanford.edu/data/glove.6B.zip into the back-end directory
- Run build_idf_csv.py to build a table of reference Inverse Document Frequency (IDF) values
- Run convert_format.py
- Run ws.py to get the server going
- Open index.html in the front-end directory