Structure of our replication package SANER_2021

Table of Contents

About The Project

Mining software repositories related to datasets and machine learning traceability

In the following package, we share a list of mined repository to study the co-evolution between the dvc ML and data tracking and the source code artifacts. We classified the dvc features in three classes:

  • DVC-data: DVC files that only track data
  • DVC-pipeline: DVC files that track a pipeline, they are caracterised by the keywords ("cmd": for the executed command and "deps": for the stage dependencies)
  • DVC-utility: DVC files within the .dvc folder .i.e,.dvc/config, .dvc/gitignore.

This is how a DVC can be used inside a repository:

pipeline_commands

We classified the source code artifacts in five categories (Source code, Test, Data, Gitignore, Others).

Phase1: DVC projects caracteristics

In a first part of this study we analysed 391 projects that was gathered on the 28 Febrary 2020 from Github. We want to explore the usage of DVC in these repositories and their caracteristics. The list of the repositories are listed in file

  • The following figure show the period these projects waited to start trying dvc (first day creating the repository until first day starting dvc)

applying_dvc

  • The following figure show how long the projects has been using DVC since the first commit introducing a DVC file.

usage

The following figure show the remote storage used in these repositories.

remote

  • We plot the distribution of the DVC files changes by project commits chronologically in grouped chrunks of 10%.

Phase2: DVC coupling with source code artifacts

In the second part of this study, we studied the coupling between different categories of dvc and source code artifacts at two levels:

Description of coupling-statiscal test scripts

In the following we will present a sample of script that we used to compute the coupling between the "DVC pipeline category" and "source code artifacts(source, test, data, gitignore, others)":

  • Step1: you have to download all the projects, we used in the analysis of the coupling in commit level mentioned in the file "Commit_projects.csv".

  • Step2: Provide the path of the repository where the projects were downloaded as argument

  • Step3: Execute the script: python3 coupled_commits.py <path_source_reposiotry>

We use a χ2 chi-squared statistical test to validate the statistical significance of the coupling between changes to A and B, for example (DVC-data and test). We present in the following a script sample we used to compute the statistical significance between "dvc data category" and "source code artifacts(source, test, data, gitignore, others)".

  • Step1: you have to download all the projects, we used in the analysis of the coupling in pull request level mentioned in the file "Pull_request_projects.csv".

  • Step2: Provide the path of the repository where the projects were downloaded as argument

  • Step3: Execute the script: python3 significance_pr.py <path_source_reposiotry>

The results of the coupling are shown in the following plots of the commit and pull request level analysis.

Commit-level analysis (25 projects):

internal DVC categories coupling

coupling with source code artifacts

Pull Request-level analysis (10 projects):

internal DVC categories coupling

coupling with source code artifacts

Phase3: Complexity evolution of ML pipelines

In the Third part of this study, we studied the complexity evolution of the ML pipeline over time in a list of 25 projects

Halstead vs McCabe

McCabe Complexity

Halstead Complexity