This is the code base used for analysis of blog posts from WordPress. The analysis consists of: 1)pre-processing which places the json events into a Python Pandas dataframe; 2) temporal sequence analysis which pulls out distinct temporal event patterns and does a K-Means clustering of these; 3) a text similarity analysis which pulls out the similar posts; and 4) and graph based analysis which pulls out subgraphs using interactions around posts and blog follows.
This distribution supports either MacOS or Linux. The MacOSX assumes that Python Anaconda package manager is installed. This is a simplifying assumption that allows me to sidestep some of the arduous install involved in getting all the packages (e.g. numpy, scipy, snap.py) together in one environment.
The Linux install makes no such assumption.
To put everything together, running
make deps check
will create the python virtualenv under anaconda, download snap.py and run a small set of tests.
On Linux
run
sudo make deps
then
make check
The distribution as mentioned is split into four components. The Python modules that implement these pieces are stored in
the analysis
subdirectory. Each module has a main class which provides a build method that constructs a set of internal lists and tables that are used the
analysis. A class implements additional functionality -- K-Means clustering or graph rendering -- that are specific to its analysis.
preprocessing.py. Parses the json of the original records and places them into a dataframe and other structures that are used in the subsequent analyses. The main class is Preprocessor which parses the blog json data and creates and internal Pandas dataframe used by subsequent analyses. An instance of Preprocessor is created by supplying the path to the json file containing the blog event information.
activity_patterns.py. Teases out the temporal patterns and does the temporal pattern clustering. The main class is ActivityAnalyzer. An instance of
ActivityAnalyzer is created by providing an instance of Preprocessor that has run its build
method. Besides build, ActivityAnalyzer provides
a compute_clusters method
that computes the set of temporal actvity clusters. This is accessed by the cluster member variable computed during each invocation of compute_clusters.
nlp.py. Constructs the tfidf vectors and computes the pairwise distances for the posts. The main class is PostSimilarityAnalyzer. Instances of the class are instantiated by
providing an instance of Preprocessor that has run the build
method. The get_similarity_group_counter
method provides a table storing posts grouped by similarity.
graph_analysis.py. This contains routines to create the allegrographdb nodes and it also does the community detection. The main class is GraphAnalyzer. A GraphAnalyzer instance
is created by providing an instance of ActivityAnalyzer on which the build
method has been run. GraphAnalyzer provides a build
method that computes the core graph nodes, edges and freatures. The display_summary_graph
method displays the larger graphs provided in the writeup which show all the nodes. The display_community_graph
method displays the subgraph of the communitities detected around the paragraph-length blog posts. The method write_triples
takes the name of a file to which rdf triple representation of the graph is stored in. The build
method must be run to write the triples and generate the display. The member modularity
stores a metric of the quality of the community subgraphs identified during build
. The member variable sig_communities_by_id
stores the nodes associated with each subgraph computed during the build
method invocation. Analysis based upon this data is given in the writeup.
The code can be invoked in an IPython Notebook to facilitate display of the graphs.
After running
make deps check
start a notebook by running
make notebook
As an example, the complete graph is displayed in the notebook using the following operations
import analysis
from analysis import (activity_patterns, graph_analysis, nlp, preprocessing)
from analysis.preprocessing import Preprocessor
from analysis.activity_patterns import ActivityAnalyzer
from analysis.nlp import PostSimilarityAnalyzer
from analysis.graph_analysis import GraphAnalyzer
import numpy as np
import snap
import matplotlib.pyplot as plt
get_ipython().magic(u'matplotlib inline')
# In[3]:
preproc = Preprocessor('./data/wrangler_test.json')
# In[4]:
patterns = ActivityAnalyzer(preproc)
patterns.build()
# In[5]:
text_proc = PostSimilarityAnalyzer(preproc)
text_proc.build()
# In[6]:
graph_proc = GraphAnalyzer(patterns)
graph_proc.build()
# In[9]:
graph_proc.diplay_summary_graph()