/muss

Code and models used in "MUSS Multilingual Unsupervised Sentence Simplification by Mining Paraphrases".

Primary LanguageHTMLOtherNOASSERTION

Multilingual Unsupervised Sentence Simplification

Code and pretrained models to reproduce experiments in "MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases".

Prerequisites

Linux with python 3.6 or above (not compatible with python 3.9 yet). If your operating system is Windows, you can use WSL with ubuntu 20.04LTS.

Installing

git clone git@github.com:facebookresearch/muss.git
cd muss/
pip install -e .  # Install package
python -m spacy download pt_core_news_md en_core_web_md fr_core_news_md es_core_news_md # Install required spacy models
ulimit -n 100000 # If you train a new model

How to use

Some scripts might still contain a few bugs, if you notice anything wrong, feel free to open an issue or submit a Pull Request.

Simplify sentences from a file using pretrained models

First, download the template of the desired language in the folder resources/models. Pretrained models should be downloaded automatically, but you can also find them here:

muss_en_wikilarge_mined
muss_en_mined
muss_fr_mined
muss_es_mined
muss_pt_mined

Then run the command:

python scripts/simplify.py FILE_PATH_TO_SIMPLIFY --model-name MODEL_NAME

# English
python scripts/simplify.py scripts/examples.en --model-name muss_en_wikilarge_mined
# French
python scripts/simplify.py scripts/examples.fr --model-name muss_fr_mined
# Spanish
python scripts/simplify.py scripts/examples.es --model-name muss_es_mined
# Portuguese
python scripts/simplify.py scripts/examples.pt --model-name muss_pt_mined

Mine the data

If you are going to add a new language to this project, in folder resources/models/language_models/wikipedia donwload the files of the target language from https://huggingface.co/edugp/kenlm/tree/main/wikipedia. These language models are used to filter high quality sentences in the paraphrase mining phase.

To run paraphrase mining run the command below:

python scripts/mine_sequences.py

Train the model

python scripts/train_model.py NAME_OF_DATASET --language LANGUAGE

Evaluate simplifications

Please head over to EASSE for Sentence Simplification evaluation.

License

The MUSS license is CC-BY-NC. See the LICENSE file for more details.

Authors

Citation

If you use MUSS in your research, please cite MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases

@article{martin2021muss,
  title={MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases},
  author={Martin, Louis and Fan, Angela and de la Clergerie, {\'E}ric and Bordes, Antoine and Sagot, Beno{\^\i}t},
  journal={arXiv preprint arXiv:2005.00352},
  year={2021}
}