G-Repo

Click above to see the tool demo on Youtube!

G-Repo is a tool developed in Java and Python and it is useful to Mine Software Repository (i.e., to collect empirical evidence using the data available in software repositories). For example:

“When and Why Your Code Starts to Smell Bad (and Whether the SmellsGo Away)”

“Do Developers Feel Emotions? An Exploratory Analysis of Emotions in Software Artifacts”

Many MSR studies use GitHub as a data source because:

It contains millions of open source repositories.
Provides a REST API to extract this data.

But which repositories must be chosen to conduct an MSR study?

A trend is to select a number of top starred repositories, which are the most voted repositories by GitHub users.

Problems ⛔

#1: Limitations of the Github API; The GitHub Search API, which also allows to download information about the repositories, returns a maximum of 1000 results. So if a query returns more than 1000 results, they are truncated for best-matching.
#2: Repository not containing the files in the required programming language; The search often returns repositories that are not actually written in the requested programming language.
#3: Non-English language repositories; Not all repositories are written in English, so as a result of a search it is very likely that the user gets repositories with a readme written in different language(s) and these should be discarded.

Launch Requirements

$ pip3 install six
$ pip3 install langdetect

Getting Started

To launch G-Repo.jar move in G-Repo-jar folder and run the command:

java -jar --module-path "absolut/path/to/javaFX-sdk/lib" --add-modules=javafx.controls,javafx.fxml G-Repo.jar

G-Repo provides functionality to search for repositories by native GitHub qualifiers listed below:

Type of Search	Qualifier to specify
By Repos Name	repo:owner/name
Within User’s or Organization’s Repos	user:USERNAME, org:ORGNAME
By Size in KB	size:n, size:(>;<=;>=;<)n, size:n1..n2
By Number of Followers	followers:n, followers:(>;<=;>=;<)n, followers:n1..n2
If Forked or not	fork:true(false)
By Number of Stars	stars:n, stars:(>;<=;>=;<)n, stars:n1..n2
By Language	language:LANGUAGE
By Topic	topic:TOPIC
By Number of Topics	topics:n, topics:(>;<=;>=;<)n, topics:n1..n2
By License	license:LICENSE
If Public or Private	is:public(private)
If a Mirror or ot	mirror:true(false)
If Archived or not	archived:true(false)
By Number of Issues good-first	good-first-issues:>n
By Number of Issues help-wanted	help-wanted-issues:>n

In order for the search to be successful, you must have a valid token!

⚠️ For the execution to be successful the repositories will be cloned! ⚠️

The programming language detection feature allows to detect the programming language - markup most used within the repositories; if the repositories found are empty the result will be not classifiable.

G-Repo is also able to detect the language used for a given repository by analyzing its README.md file. The language detector script that G-Repo uses is capable of classifying the repositories according to the language used within the README.md.

By default the script used by G-Repo uses a nondeterministic classification algorithm. This functionality is part of a design from the original Google project. If you want to force it to use a deterministic approach put translation_type = 0. If the repository lacks a README file, it is empty, does not have enough text, or it only contains special characters then the repository will be classified as unknown, and the same applies in case some repository throws an exception during the parserization process, otherwise if everything goes fine it will be classified as english, not-english or mixed.

References

The language-map repository was used to generate the file used for the detection of the programming language-markup.
For the language recognition see Language Detection 🚀

MatHeartGaming/G-Repo

G-Repo

Problems ⛔

Launch Requirements

Getting Started

References