The slides are available here.
Code for the Feb. 26, 2023 talk "Scatterchron: visualizing diachronic or multi-class corpora in whole and parts” is available at BBC One Year News.ipynb.
An interactive version is available at on nbviewer.
Ensure you are using Scattertext version >= 0.2.0, and Python 3.11 or higher.
- Recommended: install conda (https://docs.conda.io/projects/conda/en/stable/user-guide/install/index.html#installing-conda-on-a-system-that-has-other-python-installations-or-packages)
- Create a new virtual environment
$ conda create -n st311 python=3.11
- Activate it
$ source activate st311
- Install spaCy
- https://spacy.io/usage
- Install Scattertext (do not use conda to install Scattertext; ensure you have version 0.2.0 or higher)
$ pip3 install -U scattertext
- If you would like to use UMAP
$ pip3 install umap-learn
- If you would like to run all of Part 2 of the tutorial and do not mind having GPL-3 software, install ScattertextVL (for viral license)
$ pip3 install -U scattertextvl
- The documentation is vignette-based. Many features are undocumented. The code is still in beta. Breaking changes can be made at any time!
- With this in mind, don't be afraid to look through the code, make changes, and get your hands dirty.
- Test case coverage could be a lot higher. Breaking changes may have been made that didn't trigger test case failures.
- The visualization framework is written in Javascript and D3 v4. Browsers do not consistently implement the same Javascript standard, and their implementations can shift version-to-version, etc. In other words, you may have to modify the Javascript code to fix your visualization.
- Introducing Scattertext
Part 1 of the tutorial is available at Keyness Workshop Tutorial Part 1 - It's good to be flawed.ipynb
An interactive version is available at on nbviewer.
- The Rotten Tomatoes Corpus
- Creating text-based corpora
- Counting terms
- Visualizing term counts
- How the visualization works
- Customizing the visualization; text colors
- Scoring terms
- Visualizing term scores
- Using scattertext to train Gensim word embeddings
- Visualizing projections of word embeddings
- Visualizing how similar words are used across-categories
- Dispersion metrics
- Residual Dispersion
- Du's Eta for term scoring
Part 2 of the tutorial is available at Keyness Tutorial Part 2 - Integrating External Lexicons, Feature Sets and Topics
An interactive version is available at on nbviewer.
- Visualizing Empath lexicons
- Making use of a topic models output
- Making use of the Biber Feature Set via MTFE (Le Foll et al 2023)
- Making use of the USAS Feature Set
- Making use of Roget's thesaurus
Part 3 of the tutorial is available at Keyness Workshop Tutorial Part 3 - Reading Doyle over Time and Pages.ipynb
An interactive version is available at on nbviewer.
- Segmenting long documents into evenly sized chunks while respecting sentence boundaries (
SentenceSequenceSegmenter
) - Offset-based feature identification for non-textual features, such as part-of-speech tag sequences
- Timeline based visualizations
- One time-step per page in a novel
- Clustering time-steps together
- Looking at the evolution of Doyle's style through part-of-speech tag sequences