Text2UML: a Preprocessing Pipeline for NLP Applications focused on turning Requirements Texts into UML
This repository holds the datasets and code of which the results are published in the paper "Preprocessing Requirements Documents for Automatic UML Modelling" by Martijn Schouten, Guus Ramackers and Suzan Verberne for the 27th International Conference on Natural Language & Information Systems.
Next to the code for the preprocessing class that can be found in the run_pipeline.py
file, this repository holds datasets that can be used for future research:
- We manually labelled requirements texts of the PURE dataset by Ferrari et al. (2017) with whether word(groups) are attributes or classes. This new dataset contains almost 80.000 rows for training algorithms to distinguish classes and attributes in running texts.
- We also manually labelled a new validation dataset that include more focused, smaller texts for generating and validating UML models. The source of these texts are training materials of a big U.S.-based software company.
- Our version of the Lindholmen dataset contains a cleaned and preprocessed overview of all classes and attributes of the Lindholmen dataset by Chaudron et al. (2017).
- To combat the development focus of the Lindholmen dataset, this repository contains a similar dataset of files with their (cleaned and normalised) classes and attributes that we extracted from the MAR search engine by Lopez et al. (2020).