
Feb. 21, 2018 PuPPY talk on Lexicon Mining and Semiotic Squares

Lexicon Mining, Language Visualization and Semiotic Squares in Python

February 21, 2018 Talk to the Puget Sound Python Programming Group

Please see Kessler-Puppy-2018-02-21.pptx for some introductory slides, and a brief survey of psychological literature on the importance of function words in lexicon mining.

The two notebooks used are written in Python 3.6. Please run

$ pip install scattertext spacy gensim

before using them.

The first notebook, Class-Association-Scores.ipynb, demonstrates a how to use Scattertext to visualize term-category assocations. The notebook will motivate and introduce the "Fightin' Words" formula-- the Log-Odds-Ratio with an Informative Dirichlet Prior (Monroe et al. 2008). The notebook goes on to discuss Scaled F-Score and the Dense Rank Difference. Data will be used from Pang et al., 2002.

The second notebook, Explore-Headlines.ipynb, shows how to use Scattertext to visualize the interactions between a number of document categories. The example used will be headlines posted to Facebook accounts from a variety of publishers in 2016. The data is taken verbatim from Max Woolfe's data set, available at https://github.com/minimaxir/clickbait-cluster under the MIT license.

I've included a notebook exploring toxic comment classification, from a recent Kaggle competition. Toxic-Comments.


