WSD-Z-reweighting: A Python repository from suytingwan

The is the code repo for The ACL 2022 paper: Rare and zero-shot word sense disambiguation using Z-reweighting

Envs:

python 3.7.6

torch 1.7.1

transformer 4.1.1

Data preparation

Before running the experiments, download training data Semcor. Also, the unified evaluation framework for WSD.

mkdir data
cd data
wget http://lcl.uniroma1.it/wsdeval/data/WSD_Evaluation_Framework.zip
unzip WSD_Evaluation_Framework.zip

How to prepare weights for Z-reweighting strategy

Before everything starts, firstly transform the xml format into csv

cd ./preprocess
python transform.py

Generate semcor.csv file for training. Similarly, applying to dev and test datasets, senseval2, senseval3, etc.

Sort words by frequency order.

Get polysemy distribution and instance number for words and senses, set K value and calculate smoothed polysemy distribution

python poly_power.py

Generate semcor_sense_count.json, semcor_synset_count.txt, semcor_polysemy_K_{}.npy, where K = 50, 100, 200, 300, 400.

Use power law function to fit the polysemy distribution

Set lammda and assign weight to training words in SemCor

python power_law_fit.py

This will first use the threadholds to group the words by one-decimal score generated by power-law fitting curve. According to groups, futher set gamma to adjust the weight for training words and generate weight file semcor_synset_weight_{K}_{gamma}.json, where K = 50, 100, 200, 300, 400 and gamma=1,2. The weights are used in Z-reweighting strategy.

How to run code

For Z-reweighting strategy, the training scripts are:

CUDA_VISIBLE_DEVICES=0,1 python biencoder_Z_reweighting.py \
    --data-path ./data \
    --postprocess-data-path ./preprocess \
    --K 300 \
    --gamma 2 \
    --ckpt bert_base_K300_gamma2_Z_reweighting \
    --encoder-name bert-base \
    --multigpu

Our trained checkpoint for K=300, gamma=2 on Z-reweighting strategy can be downloaded at model. For easy evaluation:

CUDA_VISIBLE_DEVICES=0,1 python biencoder_Z_reweighting.py \
    --data-path ./data \
    --postprocess-data-path ./preprocess \
    --ckpt bert_base_K300_gamma2_Z_reweighting \
    --encoder-name bert-base \
    --multigpu \
    --split ALL \
    --eval

MCS/LCS analysis

Use definition of MCS and LCS from WordNet, top1 ranked sense is MCS, others are LCS.

cd ./analysis
python analyze_mcs_lcs.py