/LSCDetection

Data Sets and Models for Evaluation of Lexical Semantic Change Detection

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

LSCDetection

General

Data Sets and Models for Evaluation of Lexical Semantic Change Detection.

If you use this software for academic research, please cite these papers:

Also make sure you give appropriate credit to the below-mentioned software this repository depends on.

Parts of the code rely on DISSECT, gensim, numpy, scikit-learn, scipy, VecMap.

Usage

The scripts should be run directly from the main directory. If you wish to do otherwise, you may have to change the path you add to the path attribute in sys.path.append('./modules/') in the scripts. All scripts can be run directly from the command line:

python3 representations/count.py <corpDir> <outPath> <windowSize>

e.g.

python3 representations/count.py corpora/test/corpus1/ test_matrix1 1

The usage of each script can be understood by running it with help option -h, e.g.:

python3 representations/count.py -h

We recommend you to run the scripts within a virtual environment with Python 3.7.4. Install the required packages running pip install -r requirements.txt. (See also error sources.)

Models

A standard model of LSC detection executes three consecutive steps:

  1. learn semantic representations from corpora (representations/)
  2. align representations (alignment/)
  3. measure change (measures/)

As an example, consider a very simple model (CNT+CI+CD) going through these steps:

  1. learn count vectors from each corpus to compare (representations/count.py)
  2. align them by intersecting their columns (alignment/ci_align.py)
  3. measure change with cosine distance (measures/cd.py)

You can apply this model to the testing data using the following commands:

    python3 representations/count.py corpora/test/corpus1/ test_matrix1 1
    python3 representations/count.py corpora/test/corpus2/ test_matrix2 1

    python3 alignment/ci_align.py test_matrix1 test_matrix2 test_matrix1_aligned test_matrix2_aligned

    python3 measures/cd.py -s testsets/test/targets.tsv test_matrix1_aligned test_matrix2_aligned test_results.tsv

Input Format: All the scripts in this repository can handle two types of matrix input formats:

To learn more about how matrices are loaded and stored check out modules/utils_.py.

The scripts assume a corpus format of one sentence per line in UTF-8 encoded (optionally zipped) text files. You can specify either a file path or a folder. In the latter case the scripts will iterate over all files in the folder.

Pre-Training

Pre-training can be utilzed when working with small target corpora or if additional semantic information, not contained in the target corpus, is desired.

To pre-train SGNS models use representations/sgns.py to create embeddings on the chosen pre-training corpus (saved as a .model file). Afterwards alignment/sgns_vi.py or alignment/sgns_vi_l2normalize.py may be used to refine the pre-trained model on the target corpus. See Alignment for differences between the two scripts.

Semantic Representations

Name Code Type Comment
Count representations/count.py VSM
PPMI representations/ppmi.py VSM
SVD representations/svd.py VSM
RI representations/ri.py VSM
SGNS representations/sgns.py VSM
SCAN repository TPM - different corpus input format

Table: VSM=Vector Space Model, TPM=Topic Model

Alignment

Name Code Applicability Comment
CI alignment/ci_align.py Count, PPMI
SRV alignment/srv_align.py RI - consider using more powerful TRIPY
OP alignment/map_embeddings.py SVD, RI, SGNS - drawn from VecMap
- for OP- and OP+ see scripts/
VI alignment/sgns_vi.py SGNS - bug fixes 27/12/19 (see script for details)
alignment/sgns_vi_l2normalize.py SGNS - additional length normalization between initialization and training, improvments over VI detailed in Kaiser et al. 2021
WI alignment/wi.py Count, PPMI, SVD, RI, SGNS - consider using the more advanced Temporal Referencing

Measures

Name Code Applicability Comment
CD measures/cd.py Count, PPMI, SVD, RI, SGNS
LND measures/lnd.py Count, PPMI, SVD, RI, SGNS
JSD - SCAN
FD measures/freq.py from corpus - log-transform with measures/trsf.py
- get difference with measures/diff.py
TD measures/typs.py Count as above
HD measures/entropy.py Count as above

Post-Processing

Name Code Applicability Comment
SOT postprocessing/sot.py VSM
MC+PCR postprocessing/pcr.py VSM

Parameter Settings

Find detailed notes on model performances and optimal parameter settings in these papers.

Evaluation

The evaluation framework of this repository is based on the comparison of a set of target words across two corpora. Hence, models can be evaluated on a triple (dataset, corpus1, corpus2), where the dataset provides gold values for the change of target words between corpus1 and corpus2.

Datasets

Dataset Language Corpus 1 Corpus 2 Download Comment
DURel German DTA18 DTA19 Dataset, Corpora - version from Schlechtweg et al. (2019) at testsets/durel/
SURel German SDEWAC COOK Dataset, Corpora - version from Schlechtweg et al. (2019) at testsets/surel/
SemCor LSC English SEMCOR1 SEMCOR2 Dataset, Corpora
SemEval Eng English CCOHA 1810-1860 CCOHA 1960-2010 Dataset, Corpora
SemEval Ger German DTA 1800-1899 BZND 1946-1990 Dataset, Corpora
SemEval Lat Latin LatinISE -200-0 LatinISE 0-2000 Dataset, Corpora
SemEval Swe Swedish Kubhist2 1790-1830 Kubhist2 1895-1903 Dataset, Corpora
RuSemShift1 Russian RNC 1682-1916 RNC 1918-1990 Dataset, Corpora
RuSemShift2 Russian RNC 1918-1990 RNC 1991-2016 Dataset, Corpora
RuShiftEval1 Russian RNC 1682-1916 RNC 1918-1990 Dataset, Corpora
RuShiftEval2 Russian RNC 1918-1990 RNC 1991-2016 Dataset, Corpora
RuShiftEval3 Russian RNC 1682-1916 RNC 1991-2016 Dataset, Corpora
DIACR-Ita Italian Unità 1945-1970 Unità 1990-2014 Dataset, Corpora

We provide several evaluation pipelines, downloading the corpora and evaluating the models on (most of) the above-mentioned datasets, see pipelines.

Metrics

Name Code Applicability Comment
Spearman correlation evaluation/spr.py DURel, SURel, SemCor LSC, SemEval*, Ru* - outputs rho (column 3) and p-value (column 4)
Average Precision evaluation/ap.py SemCor LSC, SemEval*, DIACR-Ita - outputs AP (column 3) and random baseline (column 4)

Consider uploading your results for DURel as a submission to the shared task Lexical Semantic Change Detection in German, for SemEval* to SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection and for RuShiftEval to RuShiftEval.

Pipelines

Under scripts/ you find an example of a full evaluation pipeline for the models on two small test corpora. Assuming you are working on a UNIX-based system, first make the scripts executable with

chmod 755 scripts/*.sh

Then run

bash -e scripts/run_test.sh

The script first reads the two gzipped test corpora corpora/test/corpus1/ and corpora/test/corpus2/. Then it produces model predictions for the targets in testsets/test/targets.tsv and writes them under results/. It finally writes the Spearman correlation between each model's predictions and the gold rank (testsets/test/gold.tsv) under the respective folder in results/. Note that the gold values for the test data are meaningless, as they were randomly assigned.

We also provide a script for each dataset running all the models on it including necessary downloads. For this run either of

bash -e scripts/run_durel.sh
bash -e scripts/run_surel.sh
bash -e scripts/run_semcor.sh
bash -e scripts/run_semeval*.sh

You may want to change the parameters in scripts/parameters_durel.sh, etc. (e.g. vector dimensionality, iterations), as running the scripts on the full parameter set may take several days and require a large amount of disk space.

Important Changes

  • September 1, 2019: Python scripts were updated from Python 2 to Python 3.
  • December 27, 2019: bug fixes in alignment/sgns_vi.py (see script for details)
  • March 23, 2020: updates in representations/ri.py and alignment/srv_align.py (see scripts for details)

Error Sources

  • if you are on a Windows system and get error messages like [bash] $'\r': command not found, consider removing trailing '\r' characters with sed -i 's/\r$//' scripts/*.sh

BibTex

@inproceedings{Schlechtwegetal19,
	title = {{A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains}},
	author = {Dominik Schlechtweg and Anna H\"{a}tty and Marco del Tredici and Sabine {Schulte im Walde}},
    booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
	year = {2019},
	address = {Florence, Italy},
	publisher = {Association for Computational Linguistics},
	pages = {732--746},
    doi = {10.18653/v1/P19-1072}
}
@inproceedings{Kaiser2021effects,
    title = "Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection",
    author = "Kaiser, Jens and Kurtyigit, Sinan and Kotchourko, Serge and Schlechtweg, Dominik",
    booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics",
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics"
}