LSCDetection

General
Usage
Models
Parameter Settings
Evaluation
Important Changes
Error Sources
BibTex

General

Data Sets and Models for Evaluation of Lexical Semantic Change Detection.

If you use this software for academic research, please cite these papers:

Dominik Schlechtweg, Anna Hätty, Marco del Tredici, and Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 732-746, Florence, Italy. ACL.
Jens Kaiser, Sinan Kurtyigit, Serge Kotchourko, Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.

Also make sure you give appropriate credit to the below-mentioned software this repository depends on.

Parts of the code rely on DISSECT, gensim, numpy, scikit-learn, scipy, VecMap.

Usage

The scripts should be run directly from the main directory. If you wish to do otherwise, you may have to change the path you add to the path attribute in sys.path.append('./modules/') in the scripts. All scripts can be run directly from the command line:

python3 representations/count.py <corpDir> <outPath> <windowSize>

e.g.

python3 representations/count.py corpora/test/corpus1/ test_matrix1 1

The usage of each script can be understood by running it with help option -h, e.g.:

python3 representations/count.py -h

We recommend you to run the scripts within a virtual environment with Python 3.7.4. Install the required packages running pip install -r requirements.txt. (See also error sources.)

Models

A standard model of LSC detection executes three consecutive steps:

learn semantic representations from corpora (representations/)
align representations (alignment/)
measure change (measures/)

As an example, consider a very simple model (CNT+CI+CD) going through these steps:

learn count vectors from each corpus to compare (representations/count.py)
align them by intersecting their columns (alignment/ci_align.py)
measure change with cosine distance (measures/cd.py)

You can apply this model to the testing data using the following commands:

    python3 representations/count.py corpora/test/corpus1/ test_matrix1 1
    python3 representations/count.py corpora/test/corpus2/ test_matrix2 1

    python3 alignment/ci_align.py test_matrix1 test_matrix2 test_matrix1_aligned test_matrix2_aligned

    python3 measures/cd.py -s testsets/test/targets.tsv test_matrix1_aligned test_matrix2_aligned test_results.tsv

Input Format: All the scripts in this repository can handle two types of matrix input formats:

sparse scipy matrices stored in npz format
dense matrices stored in word2vec plain text format

To learn more about how matrices are loaded and stored check out modules/utils_.py.

The scripts assume a corpus format of one sentence per line in UTF-8 encoded (optionally zipped) text files. You can specify either a file path or a folder. In the latter case the scripts will iterate over all files in the folder.

Pre-Training

Pre-training can be utilzed when working with small target corpora or if additional semantic information, not contained in the target corpus, is desired.

To pre-train SGNS models use representations/sgns.py to create embeddings on the chosen pre-training corpus (saved as a .model file). Afterwards alignment/sgns_vi.py or alignment/sgns_vi_l2normalize.py may be used to refine the pre-trained model on the target corpus. See Alignment for differences between the two scripts.

Semantic Representations

Name	Code	Type	Comment
Count	`representations/count.py`	VSM
PPMI	`representations/ppmi.py`	VSM
SVD	`representations/svd.py`	VSM
RI	`representations/ri.py`	VSM
SGNS	`representations/sgns.py`	VSM
SCAN	repository	TPM	- different corpus input format

Table: VSM=Vector Space Model, TPM=Topic Model

Alignment

Name	Code	Applicability	Comment
CI	`alignment/ci_align.py`	Count, PPMI
SRV	`alignment/srv_align.py`	RI	- consider using more powerful TRIPY
OP	`alignment/map_embeddings.py`	SVD, RI, SGNS	- drawn from VecMap - for OP- and OP+ see `scripts/`
VI	`alignment/sgns_vi.py`	SGNS	- bug fixes 27/12/19 (see script for details)
	`alignment/sgns_vi_l2normalize.py`	SGNS	- additional length normalization between initialization and training, improvments over VI detailed in Kaiser et al. 2021
WI	`alignment/wi.py`	Count, PPMI, SVD, RI, SGNS	- consider using the more advanced Temporal Referencing

Measures

Name	Code	Applicability	Comment
CD	`measures/cd.py`	Count, PPMI, SVD, RI, SGNS
LND	`measures/lnd.py`	Count, PPMI, SVD, RI, SGNS
JSD	-	SCAN
FD	`measures/freq.py`	from corpus	- log-transform with `measures/trsf.py` - get difference with `measures/diff.py`
TD	`measures/typs.py`	Count	as above
HD	`measures/entropy.py`	Count	as above

Post-Processing

Name	Code	Applicability	Comment
SOT	`postprocessing/sot.py`	VSM
MC+PCR	`postprocessing/pcr.py`	VSM

Parameter Settings

Find detailed notes on model performances and optimal parameter settings in these papers.

Evaluation

The evaluation framework of this repository is based on the comparison of a set of target words across two corpora. Hence, models can be evaluated on a triple (dataset, corpus1, corpus2), where the dataset provides gold values for the change of target words between corpus1 and corpus2.

Datasets

Dataset	Language	Corpus 1	Corpus 2	Download	Comment
DURel	German	DTA18	DTA19	Dataset, Corpora	- version from Schlechtweg et al. (2019) at `testsets/durel/`
SURel	German	SDEWAC	COOK	Dataset, Corpora	- version from Schlechtweg et al. (2019) at `testsets/surel/`
SemCor LSC	English	SEMCOR1	SEMCOR2	Dataset, Corpora
SemEval Eng	English	CCOHA 1810-1860	CCOHA 1960-2010	Dataset, Corpora
SemEval Ger	German	DTA 1800-1899	BZND 1946-1990	Dataset, Corpora
SemEval Lat	Latin	LatinISE -200-0	LatinISE 0-2000	Dataset, Corpora
SemEval Swe	Swedish	Kubhist2 1790-1830	Kubhist2 1895-1903	Dataset, Corpora
RuSemShift1	Russian	RNC 1682-1916	RNC 1918-1990	Dataset, Corpora
RuSemShift2	Russian	RNC 1918-1990	RNC 1991-2016	Dataset, Corpora
RuShiftEval1	Russian	RNC 1682-1916	RNC 1918-1990	Dataset, Corpora
RuShiftEval2	Russian	RNC 1918-1990	RNC 1991-2016	Dataset, Corpora
RuShiftEval3	Russian	RNC 1682-1916	RNC 1991-2016	Dataset, Corpora
DIACR-Ita	Italian	Unità 1945-1970	Unità 1990-2014	Dataset, Corpora

We provide several evaluation pipelines, downloading the corpora and evaluating the models on (most of) the above-mentioned datasets, see pipelines.

Metrics

Name	Code	Applicability	Comment
Spearman correlation	`evaluation/spr.py`	DURel, SURel, SemCor LSC, SemEval, Ru	- outputs rho (column 3) and p-value (column 4)
Average Precision	`evaluation/ap.py`	SemCor LSC, SemEval*, DIACR-Ita	- outputs AP (column 3) and random baseline (column 4)

Consider uploading your results for DURel as a submission to the shared task Lexical Semantic Change Detection in German, for SemEval* to SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection and for RuShiftEval to RuShiftEval.

Pipelines

Under scripts/ you find an example of a full evaluation pipeline for the models on two small test corpora. Assuming you are working on a UNIX-based system, first make the scripts executable with

chmod 755 scripts/*.sh

Then run

bash -e scripts/run_test.sh

The script first reads the two gzipped test corpora corpora/test/corpus1/ and corpora/test/corpus2/. Then it produces model predictions for the targets in testsets/test/targets.tsv and writes them under results/. It finally writes the Spearman correlation between each model's predictions and the gold rank (testsets/test/gold.tsv) under the respective folder in results/. Note that the gold values for the test data are meaningless, as they were randomly assigned.

We also provide a script for each dataset running all the models on it including necessary downloads. For this run either of

bash -e scripts/run_durel.sh
bash -e scripts/run_surel.sh
bash -e scripts/run_semcor.sh
bash -e scripts/run_semeval*.sh

You may want to change the parameters in scripts/parameters_durel.sh, etc. (e.g. vector dimensionality, iterations), as running the scripts on the full parameter set may take several days and require a large amount of disk space.

Important Changes

September 1, 2019: Python scripts were updated from Python 2 to Python 3.
December 27, 2019: bug fixes in alignment/sgns_vi.py (see script for details)
March 23, 2020: updates in representations/ri.py and alignment/srv_align.py (see scripts for details)

Error Sources

if you are on a Windows system and get error messages like [bash] $'\r': command not found, consider removing trailing '\r' characters with sed -i 's/\r$//' scripts/*.sh

BibTex

@inproceedings{Schlechtwegetal19,
	title = {{A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains}},
	author = {Dominik Schlechtweg and Anna H\"{a}tty and Marco del Tredici and Sabine {Schulte im Walde}},
    booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
	year = {2019},
	address = {Florence, Italy},
	publisher = {Association for Computational Linguistics},
	pages = {732--746},
    doi = {10.18653/v1/P19-1072}
}

@inproceedings{Kaiser2021effects,
    title = "Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection",
    author = "Kaiser, Jens and Kurtyigit, Sinan and Kotchourko, Serge and Schlechtweg, Dominik",
    booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics",
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics"
}

Lexuss-D/LSCDetection