WUGs

Scripts to process Word Usage Graphs (WUGs).

If you use this software for academic research, please cite these papers:

Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
Dominik Schlechtweg. 2023. Human and Computational Measurement of Lexical Semantic Change. PhD thesis. University of Stuttgart.

Find WUG data sets on the WUGsite.

Usage

Under scripts/ we provide a pipeline creating and clustering graphs and extracting data from them (e.g. change scores). Assuming you are working on a UNIX-based system, first make the scripts executable with

chmod 755 scripts/*.sh

Then run one of the following commands for Usage-Usage Graphs (UUGs) and Usage-Sense Graphs (USGs) respectively:

bash -e scripts/run_uug.sh
bash -e scripts/run_usg.sh

For the alternative pipeline with multiple possible clustering algorithms (Correlation Clustering, Chinese Whispers, Louvain method) and custom plotting functionalities, instead run:

bash -e scripts/run_uug2.sh

There are two scripts for external use with the DURel annotation tool allowing to specify input directory and other parameters from the command line (find usage examples in test.sh:

bash -e scripts/run_system.sh $dir ...
bash -e scripts/run_system2.sh $dir ...

Attention: modifies graphs iteratively, i.e., current run is dependent on previous run. Script deletes previously written data to avoid dependence.

We recommend you to run the scripts within a virtual environment with Python 3.10. Install the required packages running pip install -r requirements.txt. Important: The script uses simple test parameters; in order to improve the clustering load parameters_opt.sh in run_uug.sh or run_usg.sh.

When installing, please check whether pygraphviz was installed correctly. There have been recurring errors with pygraphviz installation across operating systems. If an error occurs, you can check this page for solutions: https://pygraphviz.github.io/documentation/stable/install.html#providing-path-to-graphviz

On Linux, installing graphviz through the package manager is recommended.

Description

data2join.py: joins annotated data
data2annotators.py: extracts mapping from users to (anonymized) annotators
data2agr.py: computes agreement on full data
use2graph.py: adds uses to graph
sense2graph.py: adds senses to graph, for usage-sense graphs
sense2node.py: adds sense annotation data to nodes, if available
judgments2graph.py: adds judgments to graph
graph2cluster.py: clusters graph
extract_clusters.py: extract clusters from graph
graph2stats.py: extracts statistics from graph, including change scores
graph2plot.py: plots interactive graph in 2D

Please find the parameters for the current optimized WUG versions in parameters_opt.sh. Note that the parameters for the SemEval versions in parameters_semeval.sh will only roughly reproduce the published versions, because of non-deterministic clustering and small changes in the cleaning as well as clustering procedure.

For annotating and plotting your own graphs we recommend to use the DURel Tool.

Additional scripts and data

misc/usim2data.sh: downloads USim data and converts it to WUG format
misc/make_release.sh: create data for publication from pipeline output (compare to format of published data sets on WUGsite)
durel_system/: contains files relevant for the DURel Annotation System

Input

For usage-usage graphs:

uses: find examples at test_uug/data/*/uses.csv
judgments: find examples at test_uug/data/*/judgments.csv

For usage-sense graphs:

uses: find examples at test_usg/data/*/uses.csv
senses: find examples at test_usg/data/*/senses.csv
judgments: find examples at test_usg/data/*/judgments.csv

Note: The column 'identifier' in each uses.csv should identify each word usage uniquely across all words.

Input Format

The uses.csv files must contain one use per line with the following fields specified as header and separated by :

<lemma>\t<pos>\t<date>\t<grouping>\t<identifier>\t<description>\t<context>\t<indexes_target_token>\t<indexes_target_sentence>\n

The CSV files should inlcude one empty line at the end. You can use this example as a guide (ignore additional columns). The files can contain additional columns including more information such as language, lemmatization, etc.

Find information on the individual fields below:

lemma: the lemma form of the target word in the respective word use
pos: the POS tag if available (else put space character)
date: the date of the use if available (else put space character)
grouping: any string assigning uses to groups (e.g. time-periods, corpora or dialects)
identifier: an identifier unique to each use across lemmas. We recommend to use this format: filename-sentenceno-tokenno
description: any additional information on the use if available (else put space character)
context: the text of the use. This will be shown to annotators.
indexes_target_token: The character indexes of the target token in context (Python list ranges as used in slicing, e.g. 17:25)
indexes_target_sentence: The character indexes of the target sentence (containing the target token) in context (e.g. 0:30 if context contains only one sentence, or 10:45 if it contains additional surrounding sentences). The part of the context beyond the specified character range will be marked as background in gray.

The judgments.csv files must contain one use pair judgment per line with the following fields specified as header and separated by :

<identifier1>\t<identifier2>\t<annotator>\t<judgment>\t<comment>\t<lemma>\n