Simple Word Analysis

A program to analyse text on a website and output counts of the words and attempt to cluster the words into n buckets of similar words. See the notebook to get an idea of how it works.

Installation

Clone the repository and install the requirements with pip or anaconda (recommended). You need python 3. The analysis depends on a pre-trained english language model from Spacy which needs to be downloaded and installed.

Conda

conda install --file requirements.txt
python -m spacy download en_core_web_md

Pip

pip install -r requirements.txt
python -m spacy download en_core_web_md

Usage

The python file takes 3 positional arguments

Number of clusters to put words into
path where the output is to be placed
url of the website with the text to analyse

The model is very sensitive to the number of clusters, try to pick a number that you think matches the number of word types you expect at the url. A good default is 8. An extra cluster is always crated to hold words that a not recognised in the pre-trained model vocabulary.

example

python word-analysis.py 5 example_output https://www.bbc.co.uk/news

Output

A csv file with columns

word
word count
cluster number
distance to cluster centre
Decomposition of the word vector into 2-dimensional space using Principal Component Analysis (PCA)

A html file containing an interactive chart to explore the data.

See the output obtained when running the example above

Future Work

Tests and Evaluation

Tests could be written with a framework like pytest to test each component as the project evolves. Error handling and logging should be added to the code if being used in production. Since the input to the program is a url and hence continually changing a special test website may have to be set-up that does not change and the word counts are known.

To evaluate the clusters a human would need to evaluate. If the tool is to be used on a certain domain then a training data set could be created giving the expected outputs for given texts. In this case accuracy could be used as a evaluation metric and the problem can become a supervised problem instead of a unsupervised one. The model can also then be improved by fine tuning on a domain specific corpus.

Database output

SQL Alchemy can be used to output to a relational database. For evaluation cluster fits you would want to store the website text and final clusters every time the program is run. A document store such as mongo would be more appropiate here than SQL.

Website and API

The core code can be wrapped in a flask API or cloud lambda function. A docker container would be needed to deploy the code with all data science dependencies and the spacy language model. A front end website can also be built once there is an API that can take a post request with the given parameters and return a json with the data. Plotly was used to make the plot in python. Plotly also has a javascript library.

Wordcloud

A wordcloud is a popular to visualise word counts. The python wordcloud library is easy to implement.

DrPav/site-word-analysis