/maven-api-dataset

Primary LanguageJupyter NotebookMIT LicenseMIT

Breaking Bad? Dataset & Analysis

The code, datasets, and notebooks presented in this repository accompany the paper “Breaking Bad? Semantic Versioning and Impact of Breaking Changes in Maven Central” authored by Lina Ochoa, Thomas Degueule, Jean-Rémy Falleri, and Jurgen Vinju, and published in the Journal of Empirical Software Engineering. Please refer to the companion webpage for more information, or to the Zenodo package to access the datasets.

Table of Contents

Content

  • Scripts: The code and scripts used to generate the datasets of the replication study. In particular, the BuildDataset class contains the whole pipeline used to derive the datasets from the Maven Dependency Dataset (MDD) and the Maven Dependency Graph (MDG), and; the scripts folder contains some SQL queries used to explore the MDD and other R scripts used for sampling purposes.
  • Datasets: All datasets generated by the previous code and scripts. All of them can be found in the data folder.
    • gen: contains all the CSV files generated by the BuildDataset pipeline: they contain the data used to answer our research questions.
    • annotations.csv: contains the annotations extracted from the top-1000 most popular libraries on Maven Central.
    • annotations-api.csv: contains the API-related annotations extracted from the top-1000 most popular libraries on Maven Central.
    • mdd-libraries.csv: contains the 148,253 artefacts extracted from the MDD.
    • version-suffixes.csv: contains the most popular version suffixes in the MDG.
  • Notebooks: The Jupyter notebooks where we performed the analysis of the datasets to answer the research questions of our study. These notebooks are fed with the data contained in the previously described datasets. They rely on the R kernel and language. (Please, refer to the main article or companion webpage to see the definition of the study research questions.)
    • Q1-MDD.ipynb: presents the analysis of research question Q1 for the MDD dataset.
    • Q1-MDG.ipynb: presents the analysis of research question Q1 for the MDG dataset.
    • Q2-MDD.ipynb: presents the analysis of research question Q2 for the MDD dataset.
    • Q2-MDG.ipynb: presents the analysis of research question Q2 for the MDG dataset.
    • Q3-MDD.ipynb: presents the analysis of research question Q3 for the MDD dataset.
    • Q3-MDG.ipynb: presents the analysis of research question Q3 for the MDG dataset.

Reproducing the Results

To reproduce our results, one should use the BuildDataset pipeline to re-generate the CSVs answering the research questions. Keep in mind that, while the results obtained should be identical for Q1 and Q2, they may differ slightly for Q3 as random sampling is performed.

To run the whole pipeline for both the Maven Dependency Dataset (MDD) and the Maven Dependency Graph (MDG), one must first download the MDG and load it into a local neo4j database. Then, the following lines in config.properties must be updated accordingly:

neo4j_host  = bolt://localhost:7687
neo4j_user  = neo4j
neo4j_pwd   = j4oen
rscript     = /usr/local/bin/Rscript # Should point to the local Rscript executable

Once the neo4j database is imported and started, one can run the pipeline as follows. The CSV files in gen must be removed first otherwise they will not be overwritten.

$ cd code/cypher-queries/
$ mvn clean package
$ MAVEN_OPTS="-Xms32g -Xmx32g" mvn exec:java -Dexec.mainClass=mcr.BuildDataset -Dexec.args=-all # Adapt memory limits to your system

Our code attempts to parallelize the analysis as much as possible, and creates memory-hungry Rascal interpreters for each thread, which may result in huge RAM consumption. A minimum of 32GB is advised. On a modern computer, the pipeline may take one or several days to complete.

The pipeline accepts different arguments to produce different datasets:

usage: buildDataset
  -all                    Build all datasets
  -cleanDB                Remove all non-(test|compile) dependencies
  -deltas                 Build the deltas.csv dataset
  -deltasRaemaekers       Build the deltas-raemaekers.csv dataset
  -detections             Build the detections.csv dataset
  -detectionsRaemaekers   Build the detections-raemaekers.csv dataset
  -mdg                    Build MDG's datasets
  -raemaekers             Build Raemaekers' datasets
  -upgrades               Builds the upgrades.csv dataset
  -upgradesRaemaekers     Builds the upgrades-raemaekers.csv dataset
  -versions               Build the versions.csv dataset
  -versionsRaemaekers     Build the versions-raemaekers.csv dataset

When the analysis finishes, the produced CSV files can be found in the gen folder.

Academic Citation

You can freely use the content of this repository for your own research. You can either cite the replication study article using the following BibTeX:

@article{ochoa22breaking,
	author    = {Lina Ochoa and Thomas Degueule and Jean-Rémy Falleri and Jurgen Vinju},
	title     = {{Breaking Bad? Semantic Versioning and Impact of Breaking Changes in Maven Central}},
	journal   = {Empirical Software Engineering},
	volume    = {27},
	number    = {3},
	pages     = {1--42},
	year      = {2022},
	doi       = {10.1007/s10664-021-10052-y}
}

Or you can also cite only the dataset and software package hosted in Zenodo:

@misc{emse2021breakingbaddata,
	title={{Breaking Bad? Semantic Versioning and Impact of Breaking Changes in Maven Central (Dataset)}},
	author={Ochoa, Lina and Degueule, Thomas and Falleri, Jean-Rémy and Vinju, Jurgen},
	doi={10.5281/zenodo.5221840},
	year={2021},
	publisher={Zenodo},
}

Contact

If you have any question about our work or the content presented in this repository, do not hesitate to send us an email at l.m.ochoa.venegas <at> tue.nl or thomas.degueule <at> labri.fr.

License

This repository—and all its content—is licensed under the MIT License.
© 2021 Maracas