GitHub alias merging script

This script tries to identify different aliases, i.e., ([optional login], name, email) tuples, used by the same person in GitHub / GHTorrent data.

There are several reasons why multiple aliases occur. For example, since on GitHub the name and email address of committers and authors are set locally in each developer's git client, rather than globally at GitHub level, there is variation in these attributes across devices and time. Moreover, GHTorrent may introduce artificial user accounts when encountering contributions by "unknown" users while crawling data from GitHub's API.

Input

A csv file or database table such as users in GHTorrent. See the Alias class for possible fields.

Important: each alias must have a unique numeric id. The script will produce a map of alias ids to person ids.

Internals

The script:

For every pair of aliases, collects clues that could indicate the aliases belong to the same person, e.g., the email address is the same, the name is the same, or the prefix of the email address matches the user's login. See here.
Creates clusters of aliases that share clues, as candidates for merging. See here.
Uses heuristics to decide whether each of the previous clusters is valid. For example, if all have the same email then the cluster is considered valid and all candidates are merged. Similarly, if all candidates in the cluster have the same full name and email domain (after clearer options have been exhausted) then the cluster is considered valid and all candidates are merged. See here.

Output

There are three main files generated by the script.

idm_map.csv is a map of alias user ids (first column) to the unique person id (second column).
idm_log.csv is a log file with information on what aliases have been merged and why, i.e., what clues were used to make that decision.
idm_maybe.csv is another log file, with identical structure to idm_log.csv, listing all the clusters that could have potentially been validated (candidate aliases for merging) because they also share clues. However, as the heuristics are implemented now, they haven't been merged.

Important: Carefully inspect these files manually. If you observe (many) false positives in idm_log.csv, it means the heuristics were too greedy and should be made more conservative. If instead you observe (many) false negatives in idm_maybe.csv, it means the heuristics were too conservative and can be made more greedy.

More information

For more details see section II.A.a from this MSR 2015 paper:

@inproceedings{vasilescu2015msrdata,
  author = {Vasilescu, Bogdan and Serebrenik, Alexander and Filkov, Vladimir},
  title = {A Data Set for Social Diversity Studies of {GitHub} Teams},
  booktitle = {12th Working Conference on Mining Software Repositories, Data Track},
  year = {2015},
  series = {MSR},
  pages = {514--517},
  publisher = {IEEE},
  doi = {http://dx.doi.org/10.1109/MSR.2015.77}
}

License

CC0 1.0

bvasiles/ght_unmasking_aliases