The is the code repo for The ACL 2022 paper: Rare and zero-shot word sense disambiguation using Z-reweighting
python 3.7.6
torch 1.7.1
transformer 4.1.1
Before running the experiments, download training data Semcor. Also, the unified evaluation framework for WSD.
mkdir data
cd data
wget http://lcl.uniroma1.it/wsdeval/data/WSD_Evaluation_Framework.zip
unzip WSD_Evaluation_Framework.zip
Before everything starts, firstly transform the xml format into csv
cd ./preprocess
python transform.py
Generate semcor.csv
file for training. Similarly, applying to dev and test datasets, senseval2, senseval3, etc.
Get polysemy distribution and instance number for words and senses, set K value and calculate smoothed polysemy distribution
python poly_power.py
Generate semcor_sense_count.json
, semcor_synset_count.txt
, semcor_polysemy_K_{}.npy
, where K = 50, 100, 200, 300, 400.
Set lammda and assign weight to training words in SemCor
python power_law_fit.py
This will first use the threadholds to group the words by one-decimal score generated by power-law fitting curve. According to groups, futher set gamma
to adjust the weight for training words and generate weight file semcor_synset_weight_{K}_{gamma}.json
, where K = 50, 100, 200, 300, 400
and gamma=1,2
. The weights are used in Z-reweighting strategy.
For Z-reweighting strategy, the training scripts are:
CUDA_VISIBLE_DEVICES=0,1 python biencoder_Z_reweighting.py \
--data-path ./data \
--postprocess-data-path ./preprocess \
--K 300 \
--gamma 2 \
--ckpt bert_base_K300_gamma2_Z_reweighting \
--encoder-name bert-base \
--multigpu
Our trained checkpoint for K=300, gamma=2 on Z-reweighting strategy can be downloaded at model. For easy evaluation:
CUDA_VISIBLE_DEVICES=0,1 python biencoder_Z_reweighting.py \
--data-path ./data \
--postprocess-data-path ./preprocess \
--ckpt bert_base_K300_gamma2_Z_reweighting \
--encoder-name bert-base \
--multigpu \
--split ALL \
--eval
Use definition of MCS and LCS from WordNet, top1 ranked sense is MCS, others are LCS.
cd ./analysis
python analyze_mcs_lcs.py