wordtree

This Python library generates word tree diagrams. Word tree diagrams show how often different phrases occur in a corpus that contain a specific keyword. For example, here's the keyword "dog" in the Amazon Pet Supplies review corpus:

Installation

You need to have Graphviz installed on your machine. See the Graphviz website for instructions. Then, you can do:

pip install wordtree

Example usage

import wordtree
documents = ["hello world", "world is my oyster"]
g = wordtree.search_and_draw(corpus = documents, keyword = "world")
g.render() # creates a file world.dv.png

API documentation

This library has two main functions: search, which counts phrase ("N-gram") frequency in a corpus, and draw, which generates a word tree diagram from the N-gram frequencies. search_and_draw naturally combines the two together.

search

Required arguments:

corpus: list of strings to search through
keyword: single word that sits at the center of keyword tree

Optional arguments:

max_n: maximum size of an N-gram to consider, e.g. max_n = 5 means only show phrases up to 5 words in length
tokenize: a function from a string to list of strings, cutting a document into words. By default, just splits on space.

Returns:

ngrams: list of N-grams as tuples
frequencies: parallel list of frequency count for each N-gram

draw