RABBIT is a recursive acronym for "RABBIT is an Activity-Based Bot Identification Tool". It is based on BIMBAS (stands for Bot Identification Model Based on Activity Sequences), a binary classification model to identify bot contributors based on their recent activities in GitHub. RABBIT is quite efficient, being able to predict thousands of accounts per hour, without reaching GitHub's imposed hourly API rate limit of 5,000 queries per hour for authorised users.
The tool has been developed by Natarajan Chidambaram, a researcher at the Software Engineering Lab of the University of Mons (Belgium) as part of his PhD research in the context of DigitalWallonia4.AI research project ARIAC (grant number 2010235) and TRAIL.
This tool is developed as part of the research article titled: "A Bot Identification Model and Tool based on GitHub Activity Sequences" that is submitted to the Journal of Systems and Software.
RABBIT accepts a GitHub contributor name (login name) and/or a text file of multiple login names (one name per line). It requires a GitHub API key if more than 15 queries are required to be made per hour. First, the tool checks whether the login name corresponds to a valid existing GitHub contributor and returns invalid otherwise. If the login name corresponds to a GitHub App (based on the login name ending with [bot] and the Bot type returned by a call to the GitHub Users API), the tool directly determines the type as bot without even querying their events. For the remaining login names, BIMBAS will determine the type of contributor as bot, human or unknown after the following steps. The first step consists of extracting the latest public events performed by the contributor in GitHub, using one or more queries to the GitHub Events API. If the number of events retrieved is less than the required threshold the prediction will be unknown due to a lack of data. If enough events are available to determine the type of contributor, the second step converts the events into activities (belonging to 24 different activity types). The third step computes the contributor's behavioural features. The fourth step executes BIMBAS and returns the type of contributor bot or human along with a confidence score between 0 and 1 (including both).
Note about misclassifications. RABBIT is based on a machine learning classification model (BIMBAS) that is trained and validated on a ground-truth dataset, and cannot reach a precision and recall of 100%. When running it on a set of GitHub contributors of your choice, it is therefore possible to have misclassifications (humans misclassified as bots, or vice versa). If you encounter such situations while running the tool, please inform us about it, so that we can strive to further improve the accuracy of the classification model. A known reason for the presence of misclassifications is a too limited number of activities available for the contributor.
In order not to conflict with already installed packages on your machine, it is recommended to use a virtual environment to install RABBIT. You can create a Python virtual environment and install and run the tool in this environment. You can use any virtual environment of your choice. Below are the steps to install and create a virtual environment with virtualenv.
Use the following command to install the virtual environment:
pip install virtualenv
Create a virtual environment in the folder where you want to place your files:
virtualenv <envname>
Start using the environment by:
source <envname>/bin/activate
After running this command your command line prompt will change to (<envname>) ...
and now you can install RABBIT with the pip command.
When you are finished running the tool, you can quit the environment by:
deactivate
To install RABBIT, execute the following command:
pip install git+https://github.com/natarajan-chidambaram/RABBIT
Alternatively, RABBIT
is available via Nix.
To execute RABBIT for many contributors (if more than 15 API queries are required per hour), you need to provide a GitHub personal access token (API key). You can follow the instructions here to obtain such a token.
You can execute the tool with all default parameters by running rabbit <LOGIN_NAME>
.
Here is the list of parameters:
<LOGIN_NAME>
Any number of positional arguments specifying the login names of the contributors for which the type needs to be determined.
Example: $ rabbit natarajan-chidambaram tommens
--input-file <path/to/loginnames.txt>
A text input file with the login names (one name per line) of the contributors for which the type needs to be determined.
Example: $ rabbit --input-file logins.txt
Either the positional argument <LOGIN_NAME>
or --input-file
is mandatory. In case both are given, then the accounts given with --input-file
will be processed after the accounts given as positional arguments have been processed.
--key <APIKEY>
GitHub personal access token (key) to extract events from the GitHub Events API.
Note: APIKEY (--key) is mandatory if more than 15 queries are required to be made per hour
Example: $ rabbit --input-file logins.txt --key token
You can obtain an access token as described earlier
--min-events <MIN_EVENTS>
Minimum number of events that are required to determine the type of contributor.
Example: $ rabbit --input-file logins.txt --min-events 10
The default minimum number of events is 5.
--min-confidence <MIN_CONFIDENCE>
Minimum confidence on contributor type to stop further querying.
Example: $ rabbit --input-file logins.txt --min-confidence 0.5
The default minimum confidence is 1.0
--max-queries <NUM_QUERIES>
Maximum number of queries that will be made to the GitHub Events API for each contributor.
Example: $ rabbit --input-file logins.txt --queries 2
The default number of queries is 3, allowed values are 1, 2 or 3.
--verbose
Report the #events, #activities and values of the features that were used to determine the type of contributor.
Example: $ rabbit --input-file logins.txt --verbose
The default value is False.
--json <FILE_NAME.json>
Saves the result in json format.
Example: $ rabbit --input-file logins.txt --json output.json
--csv <FILE_NAME.csv>
Saves the result in comma-separated values (csv) format.
--incremental
Method of reporting the results. If provided, the result for the contributor will be reported as soon as its type is determined. If not provided, the results will be reported after determining the type of all provided contributors.
Example: $ rabbit --input-file logins.txt --key token --incremental
The default value is False.
With positional arguments
$ rabbit natarajan-chidambaram tensorflow-jenkins
contributor type confidence
natarajan-chidambaram human 0.984
tensorflow-jenkins bot 0.878
With GitHub Apps (Note: Apps should have `[bot]' at the end of their name and should be given within quotes)
$ rabbit natarajan-chidambaram tensorflow-jenkins "github-actions[bot]"
contributor type confidence
natarajan-chidambaram human 0.984
tensorflow-jenkins bot 0.878
github-actions[bot] bot 1.0
With --input-file
$ rabbit --input-file logins.txt
contributor type confidence
tensorflow-jenkins bot 0.878
johnpbloch-bot bot 0.996
github-actions[bot] bot 1.0
With --key
$ rabbit --input-file logins.txt --key token
contributor type confidence
tensorflow-jenkins bot 0.878
johnpbloch-bot bot 0.996
github-actions[bot] bot 1.0
With combined use of positional arguments and --input-file
$ rabbit natarajan-chidambaram --input-file logins.txt
contributor type confidence
natarajan-chidambaram human 0.984
tensorflow-jenkins bot 0.878
johnpbloch-bot bot 0.796
github-actions[bot] bot 1.0
With --min-events
$ rabbit --input-file logins.txt --min-events 10
contributor type confidence
tensorflow-jenkins bot 0.878
johnpbloch-bot bot 0.796
github-actions[bot] bot 1.0
With --min-confidence
$ rabbit --input-file logins.txt --min-confidence 0.5
contributor type confidence
tensorflow-jenkins bot 0.832
johnpbloch-bot bot 0.659
github-actions[bot] bot 1.0
With --max-queries
$ rabbit --input-file logins.txt --max-queries 1
contributor type confidence
tensorflow-jenkins bot 0.832
johnpbloch-bot bot 0.796
github-actions[bot] bot 1.0
natarajan-chidambaram human 0.984
With --verbose
$ rabbit --input-file logins.txt --verbose
contributor events activities NA NT NOR ... NAT_std NAT_gini NAT_IQR type confidence
tensorflow-jenkins 160 160 160 4 2 ... 17.093 0.541 15.503 bot 0.878
johnpbloch-bot 300 300 300 3 1 ... 23.452 0.724 21.451 bot 0.796
github-actions[bot] NaN - - - - ... - - - bot 1.0
natarajan-chidambaram 74 74 74 5 1 ... 14.834 0.924 12.113 human 0.984
With --csv or --json
$ rabbit --input-file logins.txt --csv types.csv
$ rabbit --input-file logins.txt --json types.json
With --incremental
$ rabbit --input-file logins.txt --incremental
This tool is distributed under Apache-2.0