This code is designed to measure various kinds of linguistic diversity using similarity-sensitive hill numbers (SSHN). Originally adapted from the study of species diversity in ecology, SSHNs are used to characterize the effective number of species in a population. Within the context of NLP, species are the linguistic units of interest (e.g. words, parse trees, etc) and the population is a corpus of documents. For example, if the "token semantic diversity" of a corpus is 9, this can be read as the corpus containing 9 semantic concepts.
import pandas as pd
from textdiversity import (
TokenSemantics, DocumentSemantics, AMR, # semantics
DependencyParse, ConstituencyParse, # syntactical
PartOfSpeechSequence, # morphological
Rhythmic # phonological
)
corpus1 = ['one massive earth', 'an enormous globe', 'the colossal world'] # unique words
corpus2 = ['basic human right', 'you were right', 'make a right'] # lower unigram diversity
metrics = [
TokenSemantics(), DocumentSemantics(), AMR(),
DependencyParse(), ConstituencyParse(),
PartOfSpeechSequence(),
Rhythmic()
]
results = []
for metric in metrics:
results.append({
"metric": metric.__class__.__name__,
"corpus1": metric(corpus1),
"coprus2": metric(corpus2)
})
df = pd.DataFrame(results)
metric | corpus1 | coprus2 | |
---|---|---|---|
0 | TokenSemantics | 7.42473 | 7.99136 |
1 | DocumentSemantics | 1.18850 | 1.65927 |
2 | AMR | 2.76379 | 1.71087 |
3 | DependencyParse | 1.00000 | 1.88204 |
4 | ConstituencyParse | 1.00001 | 1.88989 |
5 | PartOfSpeechSequence | 1.17621 | 1.80000 |
6 | Rhythmic | 1.36364 | 1.81230 |
pip install textdiversity
Some of the text diversity measures rely on software that must be installed separately. For your particular OS, follow the instructions below:
- Phoneme Diversity
- Phonemizer
- Visit: https://github.com/espeak-ng/espeak-ng/releases
- Install either
espeak-ng-X64.msi
orespeak-ng-X86.msi
(This project used Release 1.51) - Add Phonemizer environement variable ([reference])
- Navigate: Windows Key > Edit System Environment Variables > Environment Variables
- Add New Variable:
- Variable Name:
PHONEMIZER_ESPEAK_LIBRARY
- Variable Value:
C:\Program Files\eSpeak NG\libespeak-ng.dll
- Variable Name:
- Phonemizer
- Syntactical Diversity (non-core functionality)
- Dependency Parse Tree Visualization
UNKOWN SETUP STEPS
- Dependency Parse Tree Visualization
- Phoneme Diversity
- Phonemizer
sudo apt-get install espeak-ng
- Phonemizer
- Syntactical Diversity
- Dependency Parse Tree Visualization (non-core functionality)
sudo apt-get install python3-dev graphviz libgraphviz-dev pkg-config
- Dependency Parse Tree Visualization (non-core functionality)