- About the Project
- Phase1: DVC projects caracteristics
- Phase2: DVC coupling with source code artifacts
- Phase3: Complexity evolution of ML pipelines
Mining software repositories related to datasets and machine learning traceability
In the following package, we share a list of mined repository to study the co-evolution between the dvc ML and data tracking and the source code artifacts. We classified the dvc features in three classes:
- DVC-data: DVC files that only track data
- DVC-pipeline: DVC files that track a pipeline, they are caracterised by the keywords ("cmd": for the executed command and "deps": for the stage dependencies)
- DVC-utility: DVC files within the
.dvc
folder .i.e,.dvc/config, .dvc/gitignore.
This is how a DVC can be used inside a repository:
We classified the source code artifacts in five categories (Source code, Test, Data, Gitignore, Others).
In a first part of this study we analysed 391 projects that was gathered on the 28 Febrary 2020 from Github. We want to explore the usage of DVC in these repositories and their caracteristics. The list of the repositories are listed in file
- The following figure show the period these projects waited to start trying dvc (first day creating the repository until first day starting dvc)
- The following figure show how long the projects has been using DVC since the first commit introducing a DVC file.
The following figure show the remote storage used in these repositories.
- We plot the distribution of the DVC files changes by project commits chronologically in grouped chrunks of 10%.
In the second part of this study, we studied the coupling between different categories of dvc and source code artifacts at two levels:
In the following we will present a sample of script that we used to compute the coupling between the "DVC pipeline category" and "source code artifacts(source, test, data, gitignore, others)":
-
Step1: you have to download all the projects, we used in the analysis of the coupling in commit level mentioned in the file "Commit_projects.csv".
-
Step2: Provide the path of the repository where the projects were downloaded as argument
-
Step3: Execute the script: python3 coupled_commits.py <path_source_reposiotry>
We use a χ2 chi-squared statistical test to validate the statistical significance of the coupling between changes to A and B, for example (DVC-data and test). We present in the following a script sample we used to compute the statistical significance between "dvc data category" and "source code artifacts(source, test, data, gitignore, others)".
-
Step1: you have to download all the projects, we used in the analysis of the coupling in pull request level mentioned in the file "Pull_request_projects.csv".
-
Step2: Provide the path of the repository where the projects were downloaded as argument
-
Step3: Execute the script: python3 significance_pr.py <path_source_reposiotry>
The results of the coupling are shown in the following plots of the commit and pull request level analysis.
In the Third part of this study, we studied the complexity evolution of the ML pipeline over time in a list of 25 projects