/word-association-graph

Plots a word association graph between the nouns in a given text with the adjectives and verbs in the text

Primary LanguagePython

word-association-graph

A simple function to plot a word association graph between the nouns in a given text with the adjectives and verbs in the text.

  • The input text is a string of sentences ending in periods. If the text does not have any period, it does not produce a plot.
  • The output is a plot of the nouns in the text connected to the adjectives and verbs as they appear in the text.
  • k is the 'spread factor' - lower the k, lesser the intra-cluster spread,and vice versa.
  • The nodes are sized according to their degree.
  • Nodes are colored red if they are nouns, yellow if they are adjectives, and blue if they are verbs.

How to use:

# Download the file. Then:
from word_assoc_graph import plot_word_associations

import re

## The first paragraph of Wikipedia's article on itself
text = "Wikipedia was launched on January 15, 2001, by Jimmy Wales and Larry Sanger.[10] Sanger coined its name,[11][12] as a portmanteau of wiki[notes 3] and 'encyclopedia'. Initially an English-language encyclopedia, versions in other languages were quickly developed. With 5,748,461 articles,[notes 4] the English Wikipedia is the largest of the more than 290 Wikipedia encyclopedias. Overall, Wikipedia comprises more than 40 million articles in 301 different languages[14] and by February 2014 it had reached 18 billion page views and nearly 500 million unique visitors per month.[15] In 2005, Nature published a peer review comparing 42 science articles from Encyclopadia Britannica and Wikipedia and found that Wikipedia's level of accuracy approached that of Britannica.[16] Time magazine stated that the open-door policy of allowing anyone to edit had made Wikipedia the biggest and possibly the best encyclopedia in the world and it was testament to the vision of Jimmy Wales.[17] Wikipedia has been criticized for exhibiting systemic bias, for presenting a mixture of 'truths, half truths, and some falsehoods',[18] and for being subject to manipulation and spin in controversial topics.[19] In 2017, Facebook announced that it would help readers detect fake news by suitable links to Wikipedia articles. YouTube announced a similar plan in 2018."

text = re.sub("[\[].*?[\]]", "", text) ## Remove brackets and anything inside it
# You can do more processing (like stopword removal, stemming, lemmatization, etc if you want)

plot_word_associations(text, k=0.5, font_size=26)

capture

'''
A longer text.
Article link: https://www.kenyonreview.org/2008/10/a-big-long-paragraph-that-none-of-us-edit/
I replaced all the special character double quotes by single quotes when assigning it to 'text'
'''

text = "A week ago a friend invited a couple of other couples over for dinner. Eventually, the food (but not the wine) was cleared off the table for what turned out to be some fierce Scrabbling. Heeding the strategy of going for the shorter, more valuable word over the longer cheaper word, our final play was 'Bon,' which–as luck would have it!–happens to be a Japanese Buddhist festival, and not, as I had originally asserted while laying the tiles on the board, one half of a chocolate-covered cherry treat. Anyway, the strategy worked. My team only lost by 53 points instead of 58. Just the day before, our host had written of the challenges of writing short. In journalism–my friend's chosen trade, and mostly my own, too–Mark Twain's observation undoubtedly applies: 'I didn't have time to write a short letter, so I wrote a long one instead.' The principle holds across genres, in letters, reporting, and other writing. It's harder to be concise than to blather. (Full disclosure, this blog post will clock in at a blather-esque 803 words.) Good writing is boiled down, not baked full of air like a souffl??. No matter how yummy souffl??s may be. Which they are. Yummy like a Grisham novel. Lately, I've been noticing how my sentences have a tendency to keep going when I write them onscreen. This goes for concentrated writing as well as correspondence. (Twain probably believed that correspondence, in an ideal world, also demands concentration. But he never used email.) Last week I caught myself packing four conjunctions into a three-line sentence in an email. That's inexcusable. Since then, I have tried to eschew conjunctions whenever possible. Gone are the commas, the and's, but's, and so's; in are staccato declaratives. Better to read like bad Hemingway than bad Faulkner. Length–as we all know, and for lack of a more original or effective way of saying it–matters. But (ahem), it's also a matter of how you use it. Style and length are technically two different things. Try putting some prose onscreen, though, and they mix themselves up pretty quickly. This has much to do with the time constraints we claim to feel in the digital age. We don't have time to compose letters and post them anymore–much less pay postage, what with all the banks kinda-sorta losing our money these days–so we blast a few emails. We don't have time to talk, so we text. We don't have time to text to specific people, so we update our Facebook status. We don't have time to write essays, so we blog. I'm less interested by the superficial reduction of words–i.e. the always charming imho or c u l8r–than the genres in which those communications occur: blogs, texts, tweets, emails. All these interstitial communiques, do they really reflect super brevity that would make Twain proud? Or do they just reflect poorly stylized writing that desperately seeks a clearer form? I rather think the latter. Clive Thompson wrote last month in the NYT Magazine that constant digital updates, after a day, can begin 'to feel like a short story; follow it for a month, and it's a novel.' He was right to see the bits as part of a larger whole. The words now flying through our digital pipes & ether more or less tend to resemble parts of bigger units, perhaps even familiar genres. But stories and novels have definite conclusions; they also have conventional lengths. Quick, how long is the conventional blog, when you add up all of its posts and comments? How long is the longest email thread you send back and forth on a single topic? Most important: What exactly are we writing when we're doing all of this writing? I won't pretend to coin a whole new term here; I still think the best we can muster is a more fitting analogue. And if we must find an analogue in an existing literary unit, I propose the paragraph. Our constant writing has begun to feel like a neverending digital paragraph. Not a tight, stabbing paragraph from The Sun Also Rises or even a graceful, sometimes-slinking, sometimes-soaring paragraph from Absalom! Absalom!, I mean a convoluted, haphazard, meandering paragraph, something like Kerouac's original draft of On the Road–only taped together by bytes. And 1 percent as interesting. Paragraphs, particularly those that wrap from one page to the next, inherently possess a necessary suspension that tightens the reader's focus yet breaks down the narrative into digestable sections. Just like emails or blogs or texts. The mental questions while reading all of these feel the same: 'Is this the last line or is there more?' 'Is the writer really trying to say something here, or just setting up a larger point?' 'Does this part have the information I'm looking for?' ('Can I skip ahead?') David F. Smydra Jr. is a reporter, writer, and editor living in Silicon Valley. He occasionally posts similar bursts of media fancy here."

text = re.sub("['\",]", "", text) ## Remove all punctuations except period

plot_word_associations(text)

capture