/Split-and-Rephrase

Break long English Sentence into simple sentences

Primary LanguagePython

Split-and-Rephrase

cover

Overview

The aim is to split a complex sentence into a meaning-preserving sequence of shorter sentences.

The input sentence with more than two clauses is strategically broken into 2 sentences, each sentence having no more than 2 clauses. They are sent to the Hugging Face's T5 pre-trained model fine-tuned with 300K sentences from websplit v1.0 dataset, to split up into multiple sentences. Each multiple sentence is further assigned a similarity score to the input sentence based on TF-IDF Vectorizer. The sentences with fewer similarity scores are removed.

The link to the models and data can be found here, and link to the jars can be found here.

Example

cover

Future Works

[1] Preservation of keywords is an important factor. But the output from the fine-tuned Hugging Face's T5 model replaced a few words with their synonyms. This can be improved by filtering the training data from the dataset.
[2] Loss of important keywords. The output sometimes ignores important dates, places, etc.

Credits

[1] rui-yan: split-and-rephrase
[2] shreyaUp: Sentence-Simplification