This app was developed to perform tasks and visualize data during a master's thesis. It consists of a REST-API and a web-based UI. Some tasks, especially the creation of a random dataset and the collection of source code metrics in GitHub repositories can be reused easily.
To install this project, clone the repo and run npm install
.
The main NodeJS application can be started with npm start
. With npm run debug
, the application is started in self-reloading development mode.
For correlation analysis, also a small Python server is included in this project. To start the Python server, use npm run py-start
or py-debug
.
You can open the UI of the main app in your browser as localhost/ui/index
. From there most things can be done in the UI.
The app uses environment variables for authentication data for GitHub- and Libraries.io-APIs. Create a file called .env
in the root directory with the following entries:
GITHUB_KEY=<github_api_key>
LIO_KEY=<libraries_io_api_key>
The LIO key is optional and only needed when you want to use the LibrariesIO lib.
All data that is created is saved in projects as JSON. You can see some example projects with different datasets listed on the home page. You can also create a new project to create a new dataset. You have the following options:
- Type: GitHub or NPM as the source of projects/repositories
- Name: Name of the project
- Size: Number of projects/repositories in the dataset
- Languages: Any languages that are available on GitHub/NPM
- Query additions: For GitHub projects, you can specify query options to restrict the sampling frame
On the project page, you can start the necessary tasks for dataset creation.
- Create "counts": The tool calculates the size of the sampling frame by fetching the number of repositories for every language and year. This fine-grained process is necessary because the GitHub-API restricts query execution times.
- Fetch randoms: Random numbers are fetched from random.org for repository selection.
- Fetch repositories: The tool creates the dataset by using the randoms number as repository indices (repositories are sorted by creation date). This may take some time.
The resulting data is stored in the local project directory.
The tool implements many tasks that build on the created dataset, you can study the code or UI yourself or contact me if you are interested in more information. Examples:
- Counting and fetching commits and pull requests
- Check if project has package.json
- Find project dependencies
- Download repos
- Check if repo contains tests
- Calc static metrics for repo
- Guide through manual repo data collection including dynamic metrics, categories, test framework etc.
- Manual connection establishment of test and source code files
- Calc correlations between test and source code metrics
- Calc testability scores for files, projects, categories etc.
- Mine refactorings from JS repos (needs additional apps that are not included)
- Classify and add refactorings
- Display and list relevant data like repos, commits, refactorings etc.
- /knowledge: data that is used globally
- /lib: backend libraries, they are structured for reuse and partly documented
- /projects: project-specific data
- /scripts: currently only holds playground, which was used for one-time-tasks in the thesis
- /server: backend applications, main nodejs app and python app
- /templates: configuration templates for test frameworks that were used in the process of collection dynamic metrics in the repositories
- /ui: HTML, CSS, JS and fonts that are used for the UI
- .env: Environment variables
All data that is collected within the work of the thesis is stored in JSON files in the /projects directory. While all projects in this folder were established and used at some point, two are of special relevance to the final results of the thesis. The projects are:
version_1
: The complete dataset with 384 repositories. It is used to classify repositories regarding the presence of tests and other meta data, as well as the calculation of testability scoresversion_1_new
: Repositories from version_1 that contain test cases. The project was created to make working with the data more clear and comprehensible. It is used for metric collection and the search for testability refactorings
The other projects had the following purposes:
active_projects
: Exploration, the dataset includes only projects that had activity within one year until the creation date of the datasetnpm
: A random selection of projects on NPM, retrieved by randomly choosing entries from the NPM registry. A complete list of entries from the registry can be found inknowledge/npmall.json
unrestricted
: No query restrictions used (no minimum number of stars)version_1_new_performance
: Repositories from version_1 that contain performance test cases. The project was created to make working with the data more clear and comprehensible
It is explained where to find which data within the project folders:
commits.json
: A list of commit data for commits that modify pairs of test and source code filescounts.json
: Numbers of repositories per language and timeframedependencies.json
: Aggregated usage statistics of dependencies from all repositoriesevaluation.json
: Correlation analysis resultslevel.json
: Testability scores for repositories, files, test frameworks, categories, file types and execution environmentsmetrics.json
: Collected metrics for files in repositoriesproject.json
: General information about the datasetprs.json
: Pull requests associated to stored commitsrandoms.txt
: Random numbers used for dataset creationrefactorings.json
: Data for all refactorings found automatically and manually in commits with labeling for testability refactoringsrepos.json
: Data for all repositories in the dataset/evaluation
directory: Jupyter notebooks for visualization of data during evaluation