/G-Repo

GUI for GHRepoSearcher. It allows to search online repositories on github.

Primary LanguageJava

G-Repo

Watch the video

Click above to see the tool demo on Youtube!

G-Repo is a tool developed in Java and Python and it is useful to Mine Software Repository (i.e., to collect empirical evidence using the data available in software repositories). For example:

“When and Why Your Code Starts to Smell Bad (and Whether the SmellsGo Away)”

“Do Developers Feel Emotions? An Exploratory Analysis of Emotions in Software Artifacts”

Many MSR studies use GitHub as a data source because:

  • It contains millions of open source repositories.
  • Provides a REST API to extract this data.

But which repositories must be chosen to conduct an MSR study?

  • A trend is to select a number of top starred repositories, which are the most voted repositories by GitHub users.

Problems ⛔

  • #1: Limitations of the Github API; The GitHub Search API, which also allows to download information about the repositories, returns a maximum of 1000 results. So if a query returns more than 1000 results, they are truncated for best-matching.

  • #2: Repository not containing the files in the required programming language; The search often returns repositories that are not actually written in the requested programming language.

  • #3: Non-English language repositories; Not all repositories are written in English, so as a result of a search it is very likely that the user gets repositories with a readme written in different language(s) and these should be discarded.


Launch Requirements

$ pip3 install six
$ pip3 install langdetect

Getting Started

To launch G-Repo.jar move in G-Repo-jar folder and run the command:

java -jar --module-path "absolut/path/to/javaFX-sdk/lib" --add-modules=javafx.controls,javafx.fxml G-Repo.jar

G-Repo provides functionality to search for repositories by native GitHub qualifiers listed below:

Type of Search Qualifier to specify
By Repos Name repo:owner/name
Within User’s or Organization’s Repos user:USERNAME, org:ORGNAME
By Size in KB size:n, size:(>;<=;>=;<)n, size:n1..n2
By Number of Followers followers:n, followers:(>;<=;>=;<)n, followers:n1..n2
If Forked or not fork:true(false)
By Number of Stars stars:n, stars:(>;<=;>=;<)n, stars:n1..n2
By Language language:LANGUAGE
By Topic topic:TOPIC
By Number of Topics topics:n, topics:(>;<=;>=;<)n, topics:n1..n2
By License license:LICENSE
If Public or Private is:public(private)
If a Mirror or ot mirror:true(false)
If Archived or not archived:true(false)
By Number of Issues good-first good-first-issues:>n
By Number of Issues help-wanted help-wanted-issues:>n

In order for the search to be successful, you must have a valid token!

⚠️ For the execution to be successful the repositories will be cloned! ⚠️

The programming language detection feature allows to detect the programming language - markup most used within the repositories; if the repositories found are empty the result will be not classifiable.

G-Repo is also able to detect the language used for a given repository by analyzing its README.md file. The language detector script that G-Repo uses is capable of classifying the repositories according to the language used within the README.md.

By default the script used by G-Repo uses a nondeterministic classification algorithm. This functionality is part of a design from the original Google project. If you want to force it to use a deterministic approach put translation_type = 0. If the repository lacks a README file, it is empty, does not have enough text, or it only contains special characters then the repository will be classified as unknown, and the same applies in case some repository throws an exception during the parserization process, otherwise if everything goes fine it will be classified as english, not-english or mixed.

References

  • The language-map repository was used to generate the file used for the detection of the programming language-markup.
  • For the language recognition see Language Detection 🚀