Data Sets and Models for Evaluation of Lexical Semantic Change Detection.
If you use this software for academic research, please cite these papers:
-
Dominik Schlechtweg, Anna Hätty, Marco del Tredici, and Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 732-746, Florence, Italy. ACL.
-
Jens Kaiser, Sinan Kurtyigit, Serge Kotchourko, Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.
Also make sure you give appropriate credit to the below-mentioned software this repository depends on.
Parts of the code rely on DISSECT, gensim, numpy, scikit-learn, scipy, VecMap.
The scripts should be run directly from the main directory. If you wish to do otherwise, you may have to change the path you add to the path attribute in sys.path.append('./modules/')
in the scripts. All scripts can be run directly from the command line:
python3 representations/count.py <corpDir> <outPath> <windowSize>
e.g.
python3 representations/count.py corpora/test/corpus1/ test_matrix1 1
The usage of each script can be understood by running it with help option -h
, e.g.:
python3 representations/count.py -h
We recommend you to run the scripts within a virtual environment with Python 3.7.4. Install the required packages running pip install -r requirements.txt
. (See also error sources.)
A standard model of LSC detection executes three consecutive steps:
- learn semantic representations from corpora (
representations/
) - align representations (
alignment/
) - measure change (
measures/
)
As an example, consider a very simple model (CNT+CI+CD) going through these steps:
- learn count vectors from each corpus to compare (
representations/count.py
) - align them by intersecting their columns (
alignment/ci_align.py
) - measure change with cosine distance (
measures/cd.py
)
You can apply this model to the testing data using the following commands:
python3 representations/count.py corpora/test/corpus1/ test_matrix1 1
python3 representations/count.py corpora/test/corpus2/ test_matrix2 1
python3 alignment/ci_align.py test_matrix1 test_matrix2 test_matrix1_aligned test_matrix2_aligned
python3 measures/cd.py -s testsets/test/targets.tsv test_matrix1_aligned test_matrix2_aligned test_results.tsv
Input Format: All the scripts in this repository can handle two types of matrix input formats:
- sparse scipy matrices stored in npz format
- dense matrices stored in word2vec plain text format
To learn more about how matrices are loaded and stored check out modules/utils_.py
.
The scripts assume a corpus format of one sentence per line in UTF-8 encoded (optionally zipped) text files. You can specify either a file path or a folder. In the latter case the scripts will iterate over all files in the folder.
Pre-training can be utilzed when working with small target corpora or if additional semantic information, not contained in the target corpus, is desired.
To pre-train SGNS models use representations/sgns.py
to create embeddings on the chosen pre-training corpus (saved as a .model file). Afterwards alignment/sgns_vi.py
or alignment/sgns_vi_l2normalize.py
may be used to refine the pre-trained model on the target corpus. See Alignment for differences between the two scripts.
Name | Code | Type | Comment |
---|---|---|---|
Count | representations/count.py |
VSM | |
PPMI | representations/ppmi.py |
VSM | |
SVD | representations/svd.py |
VSM | |
RI | representations/ri.py |
VSM | |
SGNS | representations/sgns.py |
VSM | |
SCAN | repository | TPM | - different corpus input format |
Table: VSM=Vector Space Model, TPM=Topic Model
Name | Code | Applicability | Comment |
---|---|---|---|
CI | alignment/ci_align.py |
Count, PPMI | |
SRV | alignment/srv_align.py |
RI | - consider using more powerful TRIPY |
OP | alignment/map_embeddings.py |
SVD, RI, SGNS | - drawn from VecMap - for OP- and OP+ see scripts/ |
VI | alignment/sgns_vi.py |
SGNS | - bug fixes 27/12/19 (see script for details) |
alignment/sgns_vi_l2normalize.py |
SGNS | - additional length normalization between initialization and training, improvments over VI detailed in Kaiser et al. 2021 | |
WI | alignment/wi.py |
Count, PPMI, SVD, RI, SGNS | - consider using the more advanced Temporal Referencing |
Name | Code | Applicability | Comment |
---|---|---|---|
CD | measures/cd.py |
Count, PPMI, SVD, RI, SGNS | |
LND | measures/lnd.py |
Count, PPMI, SVD, RI, SGNS | |
JSD | - | SCAN | |
FD | measures/freq.py |
from corpus | - log-transform with measures/trsf.py - get difference with measures/diff.py |
TD | measures/typs.py |
Count | as above |
HD | measures/entropy.py |
Count | as above |
Name | Code | Applicability | Comment |
---|---|---|---|
SOT | postprocessing/sot.py |
VSM | |
MC+PCR | postprocessing/pcr.py |
VSM |
Find detailed notes on model performances and optimal parameter settings in these papers.
The evaluation framework of this repository is based on the comparison of a set of target words across two corpora. Hence, models can be evaluated on a triple (dataset, corpus1, corpus2), where the dataset provides gold values for the change of target words between corpus1 and corpus2.
Dataset | Language | Corpus 1 | Corpus 2 | Download | Comment |
---|---|---|---|---|---|
DURel | German | DTA18 | DTA19 | Dataset, Corpora | - version from Schlechtweg et al. (2019) at testsets/durel/ |
SURel | German | SDEWAC | COOK | Dataset, Corpora | - version from Schlechtweg et al. (2019) at testsets/surel/ |
SemCor LSC | English | SEMCOR1 | SEMCOR2 | Dataset, Corpora | |
SemEval Eng | English | CCOHA 1810-1860 | CCOHA 1960-2010 | Dataset, Corpora | |
SemEval Ger | German | DTA 1800-1899 | BZND 1946-1990 | Dataset, Corpora | |
SemEval Lat | Latin | LatinISE -200-0 | LatinISE 0-2000 | Dataset, Corpora | |
SemEval Swe | Swedish | Kubhist2 1790-1830 | Kubhist2 1895-1903 | Dataset, Corpora | |
RuSemShift1 | Russian | RNC 1682-1916 | RNC 1918-1990 | Dataset, Corpora | |
RuSemShift2 | Russian | RNC 1918-1990 | RNC 1991-2016 | Dataset, Corpora | |
RuShiftEval1 | Russian | RNC 1682-1916 | RNC 1918-1990 | Dataset, Corpora | |
RuShiftEval2 | Russian | RNC 1918-1990 | RNC 1991-2016 | Dataset, Corpora | |
RuShiftEval3 | Russian | RNC 1682-1916 | RNC 1991-2016 | Dataset, Corpora | |
DIACR-Ita | Italian | Unità 1945-1970 | Unità 1990-2014 | Dataset, Corpora |
We provide several evaluation pipelines, downloading the corpora and evaluating the models on (most of) the above-mentioned datasets, see pipelines.
Name | Code | Applicability | Comment |
---|---|---|---|
Spearman correlation | evaluation/spr.py |
DURel, SURel, SemCor LSC, SemEval*, Ru* | - outputs rho (column 3) and p-value (column 4) |
Average Precision | evaluation/ap.py |
SemCor LSC, SemEval*, DIACR-Ita | - outputs AP (column 3) and random baseline (column 4) |
Consider uploading your results for DURel as a submission to the shared task Lexical Semantic Change Detection in German, for SemEval* to SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection and for RuShiftEval to RuShiftEval.
Under scripts/
you find an example of a full evaluation pipeline for the models on two small test corpora. Assuming you are working on a UNIX-based system, first make the scripts executable with
chmod 755 scripts/*.sh
Then run
bash -e scripts/run_test.sh
The script first reads the two gzipped test corpora corpora/test/corpus1/
and corpora/test/corpus2/
. Then it produces model predictions for the targets in testsets/test/targets.tsv
and writes them under results/
. It finally writes the Spearman correlation between each model's predictions and the gold rank (testsets/test/gold.tsv
) under the respective folder in results/
. Note that the gold values for the test data are meaningless, as they were randomly assigned.
We also provide a script for each dataset running all the models on it including necessary downloads. For this run either of
bash -e scripts/run_durel.sh
bash -e scripts/run_surel.sh
bash -e scripts/run_semcor.sh
bash -e scripts/run_semeval*.sh
You may want to change the parameters in scripts/parameters_durel.sh
, etc. (e.g. vector dimensionality, iterations), as running the scripts on the full parameter set may take several days and require a large amount of disk space.
- September 1, 2019: Python scripts were updated from Python 2 to Python 3.
- December 27, 2019: bug fixes in
alignment/sgns_vi.py
(see script for details) - March 23, 2020: updates in
representations/ri.py
andalignment/srv_align.py
(see scripts for details)
- if you are on a Windows system and get error messages like
[bash] $'\r': command not found
, consider removing trailing '\r' characters withsed -i 's/\r$//' scripts/*.sh
@inproceedings{Schlechtwegetal19,
title = {{A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains}},
author = {Dominik Schlechtweg and Anna H\"{a}tty and Marco del Tredici and Sabine {Schulte im Walde}},
booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
year = {2019},
address = {Florence, Italy},
publisher = {Association for Computational Linguistics},
pages = {732--746},
doi = {10.18653/v1/P19-1072}
}
@inproceedings{Kaiser2021effects,
title = "Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection",
author = "Kaiser, Jens and Kurtyigit, Sinan and Kotchourko, Serge and Schlechtweg, Dominik",
booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics",
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics"
}