/Irma_Tuinenga-Words_Made_Easy

A Comparative Study of Methods for English Lexical Simplification.

Primary LanguageJupyter Notebook

Words Made Easy: A Comparative Study of Methods for English Lexical Simplification.

The code in this repository builds various systems for English Lexical Simplification, based on Masked Language Model (MLM) technology combined with additional methods. It focuses on the consecutive stages of generating, selecting, and ranking substitutes for given complex words, adhering to the requirements for the TSAR-2022 Shared Task on Multilingual Lexical Simplification.

The repository contains the following folders and files:

FOLDERS:

data:

This folder contains 2 sub folders: trial and test. The trial subfolder is used for development of the models; the test subfolder set is used for evaluation. Both sub folders contain 4 files:

  • 2 files without annotations: the original file (tsar2022_en_XXXX_none.tsv) and the cleaned file (tsar2022_en_XXXX_none_no_noise.tsv).
  • 2 files including annotations (the gold files): the original file (tsar2022_en_XXXX_gold.tsv) and the cleaned file (tsar2022_en_XXXX_gold_no_noise.tsv).

XXXX = 'trial' or 'test' depending on the folder.

For more information about how the cleaned files were created see 'datafiles_preprocessing.ipynb' at 'FILES' in this README.


predictions:

This folder contains 2 sub folders: trial and test. The trial subfolder contains tsv files with predictions (based on the requirements for the TSAR-2022 Shared Task) for the trial file, the test subfolder contains files with predictions for the test file. The naming of the files corresponds to the model characteristics. The predictions are generated by the evaluation files mentioned at 'FILES' in this README.


output:

This folder contains 2 sub folders: trial and test. The trial subfolder contains tsv files with the evaluation output (the 10 metrics used in the TSAR-2022 Shared Task) for the trial file, the test subfolder contains files with the evaluation output for the test file. The naming of the files correspond to the model characterics. The outputs are generated by the evaluation script 'tsar_eval.py' mentioned at 'FILES' in this README.


FILES:

The numbered order below is in accordance with the order if you want to reproduce the results.

1. requirements.txt:

The requirements in order to run the code in this repository.

2. datafiles_preprocessing.ipynb:
  • Preprocessing actions to remove unwanted characters from the sentences in the datafiles in data/trial and data/test. This was needed as the evaluation code returned errors for sentences that contained those characters. For example, see ./data/trial/ tsar2022_en_trial_none: the third sentence starts with #34-3 ". Sentences that started with a similar structure had those characters removed, and the resulting files without those characters were stored in the data/trial and data/test folder with the extension '_no_noise'. These files were used for evaluating the the evaluation script 'eval.py'. For verification purposes, the original files are also stored in the data/trial and data/test folders.

  • Preprocessing actions for the CEFR datasets used. For the availability of these datasets, see the note at the bottom of this README.

Evaluation files for trial and test set:

3. evaluations_SG_SS_phase1_6models_trial.ipynb.
4. evaluations_SG_SS_phase1+2_best2models_trial.ipynb.
5. evaluations_SR_Hyper-Hypo_trial.ipynb.
6. evaluations_SR_CEFR_trial.ipynb.
7. evaluations_SG_SS_phase1+2_test.ipynb.
8. evaluations_SR_test.ipynb.
9. post-eval_experiments_test.ipynb.

Scripts:

utils.py:

utils script with functions used in evaluation files for trial set. This script also contains several elaborate print statements, created for verification purposes throughout the development stages of the code.

utils_test.py:

utils script with functions used in evaluation files for test set. Same as utils.py, but used for test set evaluation, plus code for post-evaluation experiments and analyses. This script only contains print statements where required.

tsar_eval.py:

pre-supplied script (with small update to correct an UnicodeDecodeError due to missing 'encoding = utf-8' command) to calculate the 10 evaluation metrics. This script was called from the command line to evaluate the different files in the predictions folder, and its outputs are stored in the outputs folder. The results have been recorded in the evaluations and post-evaluations files mentioned above.


Note:

The used CEFR datasets, due to their restricted licenses, are not included in the upload to this (public) cltl GitHub.