Structure of our replication package SANER_2021

About the Project
Phase1: DVC projects caracteristics
Phase2: DVC coupling with source code artifacts
Phase3: Complexity evolution of ML pipelines

About The Project

Mining software repositories related to datasets and machine learning traceability

In the following package, we share a list of mined repository to study the co-evolution between the dvc ML and data tracking and the source code artifacts. We classified the dvc features in three classes:

DVC-data: DVC files that only track data
DVC-pipeline: DVC files that track a pipeline, they are caracterised by the keywords ("cmd": for the executed command and "deps": for the stage dependencies)
DVC-utility: DVC files within the .dvc folder .i.e,.dvc/config, .dvc/gitignore.

This is how a DVC can be used inside a repository:

We classified the source code artifacts in five categories (Source code, Test, Data, Gitignore, Others).

Phase1: DVC projects caracteristics

In a first part of this study we analysed 391 projects that was gathered on the 28 Febrary 2020 from Github. We want to explore the usage of DVC in these repositories and their caracteristics. The list of the repositories are listed in file

The following figure show the period these projects waited to start trying dvc (first day creating the repository until first day starting dvc)

The following figure show how long the projects has been using DVC since the first commit introducing a DVC file.

The following figure show the remote storage used in these repositories.

We plot the distribution of the DVC files changes by project commits chronologically in grouped chrunks of 10%.

Phase2: DVC coupling with source code artifacts

In the second part of this study, we studied the coupling between different categories of dvc and source code artifacts at two levels:

Description of coupling-statiscal test scripts

In the following we will present a sample of script that we used to compute the coupling between the "DVC pipeline category" and "source code artifacts(source, test, data, gitignore, others)":

Step1: you have to download all the projects, we used in the analysis of the coupling in commit level mentioned in the file "Commit_projects.csv".
Step2: Provide the path of the repository where the projects were downloaded as argument
Step3: Execute the script: python3 coupled_commits.py <path_source_reposiotry>

We use a χ2 chi-squared statistical test to validate the statistical significance of the coupling between changes to A and B, for example (DVC-data and test). We present in the following a script sample we used to compute the statistical significance between "dvc data category" and "source code artifacts(source, test, data, gitignore, others)".

Step1: you have to download all the projects, we used in the analysis of the coupling in pull request level mentioned in the file "Pull_request_projects.csv".
Step2: Provide the path of the repository where the projects were downloaded as argument
Step3: Execute the script: python3 significance_pr.py <path_source_reposiotry>

The results of the coupling are shown in the following plots of the commit and pull request level analysis.