The scripts in dm_graphs.py
provide a way to explore the relationships between words in a distributional model. It relies on the NetworkX
package. Please see the paper for more context. For an interactive demo, go here.
This work was presented as a poster at the ADS workshop. Bibtex:
@unpublished{miltenburg2015exploring,
Author = {Emiel van Miltenburg},
Date-Added = {2016-07-17 09:22:11 +0000},
Date-Modified = {2016-07-17 16:43:09 +0000},
Note = {Presented as a poster at the Advances in Distributional Semantics workshop, collocated with IWCS. GitHub page: \url{https://github.com/evanmiltenburg/dm-graphs}},
Title = {Exploring and visualizing distributional models using graphs},
Year = {2015}}
The file googlenews.py
shows how to use the dm_graphs
module. Just create an iterable that contains tuples (u,v,w)
corresponding to edges between u
and v
with weight w
. The weight in this case is the cosine similarity between u
and v
. Then create a graph, and fill it with the data. The rest of the code shows how to make the network easier to visualize by using dm_graphs.graph_reduce(G)
and dm_graphs.MST_pathfinder(G)
. These return a sparser version of the network.
The folder googlenews-demo
contains some output files (.gexf
) that you can open with Gephi. The pdf files contain visualizations of this data. For these, I used YifanHu's MultiLevel layout algorithm, followed by ForceAtlas2. The colors were randomly generated by Gephi. I used the modularity detection function to detect clusters, and then applied partition coloring by modularity class.
Main functions
MST_pathfinder()
is an implementation of MST-pathfinder algorithm (Quirin et al. 2008).graph_reduce()
reduces the graph by only including the edges that link each node to its top-n similar neighbors. It also has an optional restriction such that every edge should have a weight above a particular threshold.maxmax_transform()
is an implementation of phase 1 of the MaxMax algorithm (Hope & Keller 2013). It transforms a weighted undirected graph into a directed graph.maxmax_clusters()
is a function inspired by the MaxMax algorithm that produces soft clusters of words. Clusters should correspond to word senses.
Graph analysis
main_graph()
returns the largest connected component.graph_analysis()
returns some statistics about the graph, including a suggested partition based on the Louvain method.
Utilities
add_partition_data()
adds partition data from the analysis to the graph.invert_weights()
changes weights on all edges to 1-weight.rank_reweight()
reweights the edges based on the similarity ranks of the nodes.remove_weights()
removes the weights from the graph. Useful if you don't want Gephi to make the edges thicker.write_functions()
displays the available methods to write out the graphs.