This repository contains the dataset and analysis scripts used to answer the research questions in the paper. The repository contains Projects
, Data_Collection
, RQ1
, RQ2
, RQ3
, RQ4
, and Threats_to_validity
folders. The details of each folder are described below.
This folder contains the list of selected projects for SQL and NoSQL systems. Each file contains the name of the project and the corresponding GitHub URL.
The Data_Collection
folder contains the raw dataset, the list of import statements, and the details of snapshots taken for both SQL and NoSQL systems.
This file contains the list of JNoSQL supported databases. It contains the name of the database and the corresponding links.
These files contain the list of import statements to identify SQL and NoSQL subject systems.
Both files contain the list of snapshots taken for all the selected subject systems. Each snapshot is identified by a commit SHA and assigned a corresponding version number incrementally from oldest to newest. In addition, each file contains the commit time of the snapshots.
The dataset files contain SATD comments identified from multiple snapshots of both SQL and NoSQL systems. It provides the comment location, and whether the comment is data-access or not. The datasets for all the research questions are derived from this raw dataset.
SQL_projects_commit_time_stat.csv, NSQL_projects_commit_time_stat.csv, and combined_projects_commit_time_stat.csv
Those files report the mean, standard devation and 95% confidence interval of commit time span for each SQL subject system, NoSQL subject system and the combination of the two respectively.
This notebook shows how we generated the commit_time stats for our subject systems.
This R notebook shows the distribution of commit_time stats.
This folder contains the analysis scripts and data for answering RQ1.
This file contains the name of the subject systems, the latest version, the total number of commits, and the group being either SQL or NoSQL. This file is used to find the groupings and to draw the distribution of the total number of commits.
This file is used to generate the files RQ1_SQL_Diffusion_dataset.csv
and RQ1_NoSQL_Diffusion_dataset.csv
from the raw dataset located in the Data_collection
folder.
These datasets are subsets of the diffusion datasets grouped according to the total number of commits for each subject system.
RQ1 contains the EvolutionPlots.ipynb
R notebook, and input datasets used to answer RQ1. The diffusion dataset contains the total number of data accesses and the regular (non-data-access) comments for SQL and NoSQL systems. The other datasets are extracted from the diffusion datasets according to the projects' total number commits.
This is an R notebook used to analyze the dataset and draw the plots reported in RQ1.
This folder contains the datasets and the analysis R script used to answer RQ2.
These datasets contain the T
and the status
of each unique comment. T
and status
variables are inputs to the survival analysis as they are defined in the paper.
Both files are subsets of the combined datasets obtained by removing regular (non-data-access) SATDs from the combined datasets.
This file is an R notebook for all the analyses reported in RQ2.
This folder contains the topic modeling notebooks, data preparation notebook, and the manual analysis result file.
These notebooks contain the data cleaning and LDA topic modeling implementations for SQL and NoSQL systems, respectively.
This notebook is used to generate the sample dataset from the outputs of the LDA analysis. We used stratified random sampling to generate statistically significant sample comments.
This file contains the codebook used in the manual tagging of data-access SATDs. In addition, it contains the final labels after resolving disagreements between the authors.
This folder contains the tagged dataset (the labeled dataset from RQ3 tagged with introduction and removal commit goals) and the notebook files for the data analysis.
This file contains information related to the introduction and removal of each data-access SATD labeled in RQ3.
This file contains a python notebook for the analysis reported in RQ4.
This is an R notebook to plot the introduction and removal commit time.
This folder contains the file IsDAC_check.csv
corresponding to the manual checking described in the Threats to construct validity. isDAC=1
means that the file contains a direct data-access code and 0 means that it does not have a direct data-access code.
There are either Python 3 or R notebooks in the repository. Running the notebooks require Python 3 and R kernels. We used Anaconda for Python and R environments to run the notebooks.