/USDP

Unsupervised Sentence Simplification via Dependency Parsing

Primary LanguageRoff

USDP

Codes for reproducing experiments in the paper Unsupervised Sentence Simplification via Dependency Parsing

Requirements

Python 3.6 or 3.7 is required.

cd USDP
pip install -r requirements

Model Evaluator

The pre-trained models used in this experiment include:

  • Spacy + Benepar Parsing: nlp.pickle
  • SBERT sentence embeddings:
    • Monolingual paraphrase-mpnet-base-v2: evaluator.pickle
    • Multilingual distiluse-base-multilingual-cased-v2: mtlevaluator.pickle
  • Constituent-based 4-gram Kneser-Ney smoothing
    • English: critic.pickle
    • Vietnamese: vncritic.pickle

1. Spacy model

nlp.pickle is Spacy object for NLP parsing. It can be directly obtained by installing Spacy and calling the object

pip install -U spacy
python -m spacy download en_core_web_sm

python
import spacy
from utils import write_pickle
nlp = spacy.load("en_core_web_sm")
write_pickle(nlp, 'nlp.pickle')

2. SBERT model

The pre-trained SBERT models are available here

python
from sentence_transformers import SentenceTransformer
from utils import write_pickle
model = SentenceTransformer('paraphrase-mpnet-base-v2')
write_pickle(model, 'evaluator.pickle')

if paraphrase-mpnet-base-v2 is no longer avaiable, try all-mpnet-base-v2.

3. Kneser-Ney smoothing model

To train the English Fluency model critic.pickle,

Step 1: Parsing

  • Obtain any corpus (~1M sentences) wherein each line is a sentence

  • Run this command-line for constituent parsing so that each line now becomes a sequence of constituents. Pickle the data as train_ngram.data.

    • English:
    python run_ngram.py your-data-path train_ngram.data parse en
    
    • Vietnamese:
    python run_ngram.py your-data-path train_ngram.data parse vi
    

Step 2: Training

Run this command-line to train a 4-gram KNS on train_ngram.data

python run_ngram.py train_ngram.data critic.pickle train

Data

The data folder in evaluation contains the testing data of

  • TurkCorpus: turkcorpus.orig
  • PWKP: pwkp.test.orig
  • CP_Vietnamese-VLC (extracted): vndata.orig

TurkCorpus and PWKP have their ground-truth references with extension .simp and outputs of competing models in a corresponding folder. All data is gratefully borrowed from EASSE and Under the Sea NLP.

Note that the outputs of RM+EX+LS+RO on PWKP are created by reproducing the experiment from Edit-Unsup-TS

Running USDP

Phase 1: Structural Simplification

To reproduce USDP-Base on English data, simply run

python run_generation.py evaluation/config_en_base.json

Change the path to evaluation/config_vn_base.json for Vietnamese simplification. Feel free to modify the parameters to experiment with other variants, such as USDP-Match.

Phase 2: Back Translation

Successfully completing phase 1 will output sentences that are structurally simpler than the original ones. You can further implement lexical simplification and paraphrasing by back-translating the outputs using any multilingual pre-trained machine translation system. We simply make use of Google Translate service in our experiment.

References

If you use the codes or datasets in this repository, please cite our paper

@article{vo2022unsupervised,
  title={Unsupervised Sentence Simplification via Dependency Parsing},
  author={Vo, Vy and Wang, Weiqing and Buntine, Wray},
  journal={arXiv preprint arXiv:2206.12261},
  year={2022}
}