/RelationalGit

RelationalGit extracts valuable information about commits, blame, changes, devs, and pull requests out of git's data structure and imports them to a relational database such as Microsoft SQL Server. These data can be a ground for further source code mining analysis.

Primary LanguageC#MIT LicenseMIT

RelationalGit

RelationalGit extracts valuable information about commits, blame, changes, devs, and pull requests out of git's data structure and imports them to a relational database such as Microsoft SQL Server. These data can be a ground for further source code mining analysis.

So, You can easily query the database and find answers to many interesing questions. Since, source code mining is one of the hottest topics in academia and industry, RelationalGit wants to help researchers to start their investigations more conveniently. For example you can find answers to the following questions by running a simple SQL query over extracted data.

  • What are the files that are recently changed by a given developer?
  • Who is the author of a specific line in a specific file? (Git Blame)
  • Which developer has the most commits?
  • What files usually are changed together? this way you can detect and document your hidden dependencies.
  • Which developer has the most knowledge about a file or project? This idea is based on Rigby's paper.
  • Which files are changing constantly? maybe they are bug-prone.
  • Who is the most appropriate developer to work on a given file?

Dependencies

Before installing RelationalGit, you need to install the following dependencies.

1) .NET Core

You need to get the latests bits of .NET Core.

2) SQL Server

Cross-Platform: Download the SQL Server Docker image to run it natively on your system. Then you need to install Microsoft SQL Operation Studio or SQLCMD to query the database. On Windows you can install Sql Server Management Studio to query the database.

Windows: You can download Sql Server - LocalDb, Express, and Developer Editions - and Sql Server Management Studio free of charge.

3) PowerShell Core

You need to get the latest version of PowerShell Core. RelationalGit uses PowerShell for extracting blame information.

RelationalGit 💘 Open Source

RelationalGit has been built on top of the most popular Git Libraries. It uses libgit2Sharp, Octokit.Net, and Octokit.Extensions in order to extract data from git data structure and Github respectively.

Install (dotnet Global Tool)

RelationalGit is a dotnet Global tool. You can use it seamlessly with your favorite command-line application.

dotnet tool install --global RelationalGit

Configuration File

You need to create a configuration file with the following format. The configuration file at least needs to have ConnectionStrings:RelationalGit and Mining (empty) sections.

{
"ConnectionStrings": {
   "RelationalGit": "Server=IP;User Id=user;Password=123;Database=db"
},
"Mining": {
  "Extensions": [ ".cs",".java",".scala", ".vb",".rs",".go",".s",".proto",".coffee", ".sql",".rb",".ruby",".ts", ".js", ".jsx", ".sh", ".tsx", ".py", ".c", ".h", ".cpp", ".il", ".make", ".cmake", ".ps1", ".r", ".cmd"],
  "GitBranch": "master",
  "RepositoryPath": "PATH_TO_REPO",
  "GitHubRepo": "REPO_NAME",
  "GitHubOwner": "REPO_OWNER",
  "GitHubToken": "GITHUB_TOKEN",
  "PeriodLength": 3,
  "PeriodType": "month",
  "MegaDevelopers": [ "dotnetbot","dotnetmaestro","dotnetmaestrobot","bors","dotnet bot","dotnetgitsyncbot","k8scirobot","k8smergerobot","dotnetautomergebot" ],
  "MegaCommitSize": 100,
  "ExtractBlames":true,
  "FilesAtRiksOwnersThreshold": 1,
  "FilesAtRiksOwnershipThreshold": 0.09999,
  "LeaversOfPeriodExtendedAbsence": 1,
  "MegaPullRequestSize": 100,
  "CoreDeveloperThreshold": 5000,
  "CoreDeveloperCalculationType": "ownership-lines",
  "KnowledgeSaveStrategyType": "reviewers-expertise-review",
  "KnowledgeSaveReviewerReplacementType": "one-of-actuals",
  "KnowledgeSaveReviewerFirstPeriod": "1",
  "SelectedReviewersType":"core",
  "LeaversType": "all",
  "BlamePeriods": [],
  "BlamePeriodsRange":[10,20],
  "ExcludedBlamePaths": ["*\\lib\\*"],
  "LgtmTerms":["%lgtm%","%looks good%","%look good%","%seems good%","%seem good%","%sounds good%","%sound good%","%its good%","%its good%","%r+%","%good job%"],
  "MinimumActualReviewersLength":"0",
  "PullRequestReviewerSelectionStrategy" : "0:nothing-nothing,-:replacerandom-1",
  "AddOnlyToUnsafePullrequests" : true,
  "NumberOfPeriodsForCalculatingProbabilityOfStay":4,
  "RecommenderOption": "alpha-1,beta-1,risk-3,hoarder_ratio-1",
  "ChangePast":true
  }
}

You need to tell relational git where's your config file. If you don't, it assumes there is a configuration file in the user directory named relationalgit.json.

⭐ Commands

RelationalGit has several built-in commands for extracting git information and computing various knowledge loss scenarios. You can override the values you set in the configuration file by passing explicit arguments for each command.

For example, the following lines execute the get-github-pullrequest-reviewer-comments command to gather all the PR's comments of the GitHub repository which is defined via GitHubOwner and GitHubRepo values of the setting file.

dotnet-rgit --conf-path "C:\Users\Ehsan Mirsaeedi\Documents\relationalgit.json"  --cmd get-github-pullrequest-reviewer-comments 
dotnet-rgit --cmd get-github-pullrequest-reviewer-comments // it gets the setting file from the default location

Below is the complete list of commands.

get-git-commits

Extract the commits from the repository refrenced by RepositoryPath parameter. And Fill the Commits table.

get-git-commits-changes

Extract the introduced changes of each commit from the repository refrenced by RepositoryPath parameter. And Fill the CommittedChanges and CommitRelationship tables. It detects rename operations and assign a canonical path to the files to make it possible to track renames.

alias-git-names

Try to resolve multiple developer names confusion by finding unique users from the repository refrenced by RepositoryPath parameter. It fills the AliasedDeveloperNames table. You can manually edit this table' data to do the final touches.

apply-git-aliased

Fills the NormalizedAuthorName of -Commits_ table and NormalizedDeveloperIdentity of CommittedChanges table with normalized name computed by alias-git-names command.

ignore-mega-commits

Turns on the ignore flag of mega commits and their associated blames. Also, commits and blames authored by mega developers are marked as ignored.

periodize-git-commits

Breaks the project's history into periods.

extract-dev-info

Extract the details of developers' contributions.

get-git-commit-blames-for-periods

Extracts files and blames of the last commit of each period.

get-github-pullrequests

Retrieves the list of all pull requests.

get-github-pullrequest-reviewers

Gets the list of reviewers assigned to pull requests.

get-github-pullrequest-reviewer-comments

Retrieves the list of inline comments made on pull requests.

get-github-pullrequests-files

Retrieves the files of pull requests.

get-pullrequest-issue-comments

Retrieve the list of comments made on the discussion thread of pull requests.

map-git-github-names

Links GitHub logins to the corresponding normalized unique author names.

compute-loss

Through historical simulations, we can evaluate the effectiveness of different approaches to reviewer recommendation. In these simulation, we change the actual reviewers of pull requests with recommendations generated by a given recommender. After simulation, we can query the database to see how expertise, workload, and knowledge distribution change.

Complete Data Gathering Sample

for a complete data gathering one can run a following script, assuming the setting file is located at the default location (User Directory \ relationalgit.json) and all the required setting values are set.

dotnet-rgit --cmd get-git-commits
dotnet-rgit --cmd get-git-commits-changes
dotnet-rgit --cmd alias-git-names
dotnet-rgit --cmd apply-git-aliased
dotnet-rgit --cmd ignore-mega-commits
dotnet-rgit --cmd periodize-git-commits
dotnet-rgit --cmd get-git-commit-blames-for-periods
dotnet-rgit --cmd apply-git-aliased
dotnet-rgit --cmd ignore-mega-commits
dotnet-rgit --cmd get-github-pullrequests
dotnet-rgit --cmd get-github-pullrequest-reviewers
dotnet-rgit --cmd get-github-pullrequest-reviewer-comments
dotnet-rgit --cmd get-github-pullrequests-files
dotnet-rgit --cmd get-merge-events
dotnet-rgit --cmd get-pullrequest-issue-comments
dotnet-rgit --cmd map-git-github-names
dotnet-rgit --cmd extract-dev-info

Run the Simulations

dotnet-rgit --cmd simulate-recommender --recommendation-strategy NoReviews --conf-path $corefx_conf
dotnet-rgit --cmd simulate-recommender --recommendation-strategy Reality --conf-path $corefx_conf
dotnet-rgit --cmd simulate-recommender --recommendation-strategy cHRev --conf-path $corefx_conf
dotnet-rgit --cmd simulate-recommender --recommendation-strategy AuthorshipRec --conf-path $corefx_conf
dotnet-rgit --cmd simulate-recommender --recommendation-strategy RecOwnRec  --conf-path $corefx_conf
dotnet-rgit --cmd simulate-recommender --recommendation-strategy RetentionRec  --conf-path $corefx_conf
dotnet-rgit --cmd simulate-recommender --recommendation-strategy LearnRec  --conf-path $corefx_conf
dotnet-rgit --cmd simulate-recommender --recommendation-strategy TurnoverRec --conf-path $corefx_conf
dotnet-rgit --cmd simulate-recommender --recommendation-strategy Sofia  --conf-path $corefx_conf