Scripts to process Word Usage Graphs (WUGs).
If you use this software for academic research, please cite these papers:
- Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
- Dominik Schlechtweg. 2023. Human and Computational Measurement of Lexical Semantic Change. PhD thesis. University of Stuttgart.
Find WUG data sets on the WUGsite.
Under scripts/
we provide a pipeline creating and clustering graphs and extracting data from them (e.g. change scores). Assuming you are working on a UNIX-based system, first make the scripts executable with
chmod 755 scripts/*.sh
Then run one of the following commands for Usage-Usage Graphs (UUGs) and Usage-Sense Graphs (USGs) respectively:
bash -e scripts/run_uug.sh
bash -e scripts/run_usg.sh
For the alternative pipeline with multiple possible clustering algorithms (Correlation Clustering, Chinese Whispers, Louvain method) and custom plotting functionalities, instead run:
bash -e scripts/run_uug2.sh
There are two scripts for external use with the DURel annotation tool allowing to specify input directory and other parameters from the command line (find usage examples in test.sh
:
bash -e scripts/run_system.sh $dir ...
bash -e scripts/run_system2.sh $dir ...
Attention: modifies graphs iteratively, i.e., current run is dependent on previous run. Script deletes previously written data to avoid dependence.
We recommend you to run the scripts within a virtual environment with Python 3.10. Install the required packages running pip install -r requirements.txt
. Important: The script uses simple test parameters; in order to improve the clustering load parameters_opt.sh
in run_uug.sh
or run_usg.sh
.
When installing, please check whether pygraphviz was installed correctly. There have been recurring errors with pygraphviz installation across operating systems. If an error occurs, you can check this page for solutions: https://pygraphviz.github.io/documentation/stable/install.html#providing-path-to-graphviz
On Linux, installing graphviz through the package manager is recommended.
data2join.py
: joins annotated datadata2annotators.py
: extracts mapping from users to (anonymized) annotatorsdata2agr.py
: computes agreement on full datause2graph.py
: adds uses to graphsense2graph.py
: adds senses to graph, for usage-sense graphssense2node.py
: adds sense annotation data to nodes, if availablejudgments2graph.py
: adds judgments to graphgraph2cluster.py
: clusters graphextract_clusters.py
: extract clusters from graphgraph2stats.py
: extracts statistics from graph, including change scoresgraph2plot.py
: plots interactive graph in 2D
Please find the parameters for the current optimized WUG versions in parameters_opt.sh
. Note that the parameters for the SemEval versions in parameters_semeval.sh
will only roughly reproduce the published versions, because of non-deterministic clustering and small changes in the cleaning as well as clustering procedure.
For annotating and plotting your own graphs we recommend to use the DURel Tool.
misc/usim2data.sh
: downloads USim data and converts it to WUG formatmisc/make_release.sh
: create data for publication from pipeline output (compare to format of published data sets on WUGsite)durel_system/
: contains files relevant for the DURel Annotation System
For usage-usage graphs:
- uses: find examples at
test_uug/data/*/uses.csv
- judgments: find examples at
test_uug/data/*/judgments.csv
For usage-sense graphs:
- uses: find examples at
test_usg/data/*/uses.csv
- senses: find examples at
test_usg/data/*/senses.csv
- judgments: find examples at
test_usg/data/*/judgments.csv
Note: The column 'identifier' in each uses.csv
should identify each word usage uniquely across all words.
The uses.csv
files must contain one use per line with the following fields specified as header and separated by :
<lemma>\t<pos>\t<date>\t<grouping>\t<identifier>\t<description>\t<context>\t<indexes_target_token>\t<indexes_target_sentence>\n
The CSV files should inlcude one empty line at the end. You can use this example as a guide (ignore additional columns). The files can contain additional columns including more information such as language, lemmatization, etc.
Find information on the individual fields below:
- lemma: the lemma form of the target word in the respective word use
- pos: the POS tag if available (else put space character)
- date: the date of the use if available (else put space character)
- grouping: any string assigning uses to groups (e.g. time-periods, corpora or dialects)
- identifier: an identifier unique to each use across lemmas. We recommend to use this format:
filename-sentenceno-tokenno
- description: any additional information on the use if available (else put space character)
- context: the text of the use. This will be shown to annotators.
- indexes_target_token: The character indexes of the target token in
context
(Python list ranges as used in slicing, e.g.17:25
) - indexes_target_sentence: The character indexes of the target sentence (containing the target token) in
context
(e.g.0:30
if context contains only one sentence, or10:45
if it contains additional surrounding sentences). The part of the context beyond the specified character range will be marked as background in gray.
The judgments.csv
files must contain one use pair judgment per line with the following fields specified as header and separated by :
<identifier1>\t<identifier2>\t<annotator>\t<judgment>\t<comment>\t<lemma>\n
The CSV files should inlcude one empty line at the end. You can use this example as a guide (ignore additional columns). The files can contain additional columns including more information such as the round of annotation, etc.
Find information on the individual fields below:
- identifier1: identifier of the first use in the use pair (must correspond to identifier in uses.csv)
- identifier2: identifier of the second use in the use pair
- annotator: annotator name
- judgment: annotator judgment on graded scale (e.g. 1 for unrelated, 4 for identical)
- comment: the annotator's comment (if any)
- lemma: the lemma form of the target word in both uses
Find further research on WUGs in these papers:
- Andrey Kutuzov, Samia Touileb, Petter Mæhlum, Tita Enstad, and Alexandra Wittemann. 2022. NorDiaChange: Diachronic Semantic Change Dataset for Norwegian. In Proceedings of the Thirteenth Language Resources and Evaluation Conference.
- Anna Aksenova, Ekaterina Gavrishina, Elisei Rykov, and Andrey Kutuzov. 2022. RuDSI: Graph-based Word Sense Induction Dataset for Russian. In Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing.
- Frank D. Zamora-Reina, Felipe Bravo-Marquez, Dominik Schlechtweg. 2022. LSCDiscovery: A shared task on semantic change discovery and detection in Spanish. In Proceedings of the 3rd International Workshop on Computational Approaches to Historical Language Change.
- Gioia Baldissin, Dominik Schlechtweg, Sabine Schulte im Walde. 2022. DiaWUG: A Dataset for Diatopic Lexical Semantic Variation in Spanish. In Proceedings of the 13th Language Resources and Evaluation Conference.
- Dominik Schlechtweg, Enrique Castaneda, Jonas Kuhn, Sabine Schulte im Walde. 2021. Modeling Sense Structure in Word Usage Graphs with the Weighted Stochastic Block Model. In Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics.
- Sinan Kurtyigit, Maike Park, Dominik Schlechtweg, Jonas Kuhn, Sabine Schulte im Walde. 2021. Lexical Semantic Change Discovery. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
- Serge Kotchourko. 2021. Optimizing Human Annotation of Word Usage Graphs in a Realistic Simulation Environment. Bachelor thesis.
- Benjamin Tunc. 2021. Optimierung von Clustering von Wortverwendungsgraphen. Bachelor thesis.
@inproceedings{Schlechtweg2021dwug,
title = {{DWUG}: A large Resource of Diachronic Word Usage Graphs in Four Languages},
author = {Schlechtweg, Dominik and Tahmasebi, Nina and Hengchen, Simon and Dubossarsky, Haim and McGillivray, Barbara},
booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
publisher = {Association for Computational Linguistics},
address = {Online and Punta Cana, Dominican Republic},
pages = {7079--7091},
url = {https://aclanthology.org/2021.emnlp-main.567},
year = {2021}
}
@phdthesis{Schlechtweg2023measurement,
author = "Schlechtweg, Dominik",
title = "Human and Computational Measurement of Lexical Semantic Change",
school = "University of Stuttgart",
address = "Stuttgart, Germany",
url = {http://dx.doi.org/10.18419/opus-12833}
year = 2023
}