This repository contains the code and data for the ACL paper Superbizarre Is Not Superb: Derivational Morphology Improves BERT's Interpretation of Complex Words. The paper shows that a derivational input segmentation helps BERT understand the meaning of complex words, particularly if they did not appear during pretraining.
The code requires Python>=3.6
, numpy>=1.18
, torch>=1.2
, and transformers>=2.5
.
The three datasets used in the experiments can be found in data
.
The datasets contain derivatives with corresponding semantic classes (sentiment and topicality).
Please refer to the paper for details about the datasets.
The labeling of the datasets is as follows:
- Amazon: 0 = negative (e.g., overpriced, crappy), 1 = positive (e.g., megafavorite, applausive)
- ArXiv: 0 = physics (e.g., semithermal, ozoneless), 1 = computer science (e.g., autoencoded, rankable)
- Reddit: 0 = entertainment (e.g., supervampires, spoilerful), 1 = knowledge (e.g., antirussian, immigrationism)
The datasets are provided as csv files and as segmentation-specific pickled PyTorch datasets that can be easily loaded for model training.
The repository also contains the code for generating the different segmentations in src
.
To replicate the hyperparameter search for the learning rate, run the script start_hs.sh
in src
.
To train the models using different segmentations, run the script start_main.sh
in src
.
If you use the code or data in this repository, please cite the following paper:
@inproceedings{hofmann2021superbizarre,
title = {Superbizarre Is Not Superb: Derivational Morphology Improves {BERT}{'}s Interpretation of Complex Words},
author = {Hofmann, Valentin and Pierrehumbert, Janet and Sch{\"u}tze, Hinrich},
booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics},
year = {2021}
}