/superbizarre

Code and data for "Superbizarre Is Not Superb: Derivational Morphology Improves BERT's Interpretation of Complex Words"

Primary LanguagePython

Superbizarre Is Not Superb

This repository contains the code and data for the ACL paper Superbizarre Is Not Superb: Derivational Morphology Improves BERT's Interpretation of Complex Words. The paper shows that a derivational input segmentation helps BERT understand the meaning of complex words, particularly if they did not appear during pretraining.

Dependencies

The code requires Python>=3.6, numpy>=1.18, torch>=1.2, and transformers>=2.5.

Data

The three datasets used in the experiments can be found in data. The datasets contain derivatives with corresponding semantic classes (sentiment and topicality). Please refer to the paper for details about the datasets. The labeling of the datasets is as follows:

  • Amazon: 0 = negative (e.g., overpriced, crappy), 1 = positive (e.g., megafavorite, applausive)
  • ArXiv: 0 = physics (e.g., semithermal, ozoneless), 1 = computer science (e.g., autoencoded, rankable)
  • Reddit: 0 = entertainment (e.g., supervampires, spoilerful), 1 = knowledge (e.g., antirussian, immigrationism)

The datasets are provided as csv files and as segmentation-specific pickled PyTorch datasets that can be easily loaded for model training. The repository also contains the code for generating the different segmentations in src.

Usage

To replicate the hyperparameter search for the learning rate, run the script start_hs.sh in src. To train the models using different segmentations, run the script start_main.sh in src.

Citation

If you use the code or data in this repository, please cite the following paper:

@inproceedings{hofmann2021superbizarre,
    title = {Superbizarre Is Not Superb: Derivational Morphology Improves {BERT}{'}s Interpretation of Complex Words},
    author = {Hofmann, Valentin and Pierrehumbert, Janet and Sch{\"u}tze, Hinrich},
    booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics},
    year = {2021}
}