/dm-graphs

Scripts to explore and visualize distributional semantic models using graphs.

Primary LanguagePythonApache License 2.0Apache-2.0

dm-graphs

The scripts in dm_graphs.py provide a way to explore the relationships between words in a distributional model. It relies on the NetworkX package. Please see the paper for more context. For an interactive demo, go here.

This work was presented as a poster at the ADS workshop. Bibtex:

  @unpublished{miltenburg2015exploring,
  	Author = {Emiel van Miltenburg},
  	Date-Added = {2016-07-17 09:22:11 +0000},
  	Date-Modified = {2016-07-17 16:43:09 +0000},
  	Note = {Presented as a poster at the Advances in Distributional Semantics workshop, collocated with IWCS. GitHub page: \url{https://github.com/evanmiltenburg/dm-graphs}},
  	Title = {Exploring and visualizing distributional models using graphs},
  	Year = {2015}}

Basic usage

The file googlenews.py shows how to use the dm_graphs module. Just create an iterable that contains tuples (u,v,w) corresponding to edges between u and v with weight w. The weight in this case is the cosine similarity between u and v. Then create a graph, and fill it with the data. The rest of the code shows how to make the network easier to visualize by using dm_graphs.graph_reduce(G) and dm_graphs.MST_pathfinder(G). These return a sparser version of the network.

The folder googlenews-demo contains some output files (.gexf) that you can open with Gephi. The pdf files contain visualizations of this data. For these, I used YifanHu's MultiLevel layout algorithm, followed by ForceAtlas2. The colors were randomly generated by Gephi. I used the modularity detection function to detect clusters, and then applied partition coloring by modularity class.

Features

Main functions

  • MST_pathfinder() is an implementation of MST-pathfinder algorithm (Quirin et al. 2008).
  • graph_reduce() reduces the graph by only including the edges that link each node to its top-n similar neighbors. It also has an optional restriction such that every edge should have a weight above a particular threshold.
  • maxmax_transform() is an implementation of phase 1 of the MaxMax algorithm (Hope & Keller 2013). It transforms a weighted undirected graph into a directed graph.
  • maxmax_clusters() is a function inspired by the MaxMax algorithm that produces soft clusters of words. Clusters should correspond to word senses.

Graph analysis

  • main_graph() returns the largest connected component.
  • graph_analysis() returns some statistics about the graph, including a suggested partition based on the Louvain method.

Utilities

  • add_partition_data() adds partition data from the analysis to the graph.
  • invert_weights() changes weights on all edges to 1-weight.
  • rank_reweight() reweights the edges based on the similarity ranks of the nodes.
  • remove_weights() removes the weights from the graph. Useful if you don't want Gephi to make the edges thicker.
  • write_functions() displays the available methods to write out the graphs.