This repo contains Python scripts for analyzing the linguistic diversity among collections of texts, specifically measuring lexical, semantic, and syntactic diversity.
-
Install the required dependencies using the provided
requirements.txt
:pip install -r requirements.txt
-
Prepare the input data. Input text files should be placed in subdirectories under the
data/
directory. The folder structure should be as follows:data/ ├── story/outputs/ ├── dialogue/outputs/ ├── summary/outputs/ ├── translation/outputs/ └── wiki/outputs/
- Input files should contain plain text.
- Each line in an input file corresponds to one sample.
- Sentences within the same sample are separated by
<newline>
markers. - The scripts assume UTF-8 encoding for the text files.
-
semantic_diversity.py
:- Computes sentence embeddings using transformer models and calculates cosine similarity to measure semantic diversity.
- Results are saved to
sem.txt
.
-
syntactic_diversity.py
:- Parses sentences to generate syntactic dependency graphs and computes graph kernel similarities to assess syntactic diversity.
- Results are saved to
syn.txt
.
-
lexical_diversity.py
:- Tokenizes text and calculates the Type-Token Ratio (TTR) for unigrams, bigrams, and trigrams. The average TTR is reported as lexical diversity.
- Results are saved to
unique-n.txt
.