gotec/git2net

Support for git mailmap files

Closed this issue · 10 comments

Breee commented

Problem:
It happens that the same author commits with several names and email adresses.
This is problematic, if one tries to analyse a git repository and the collaboration of authors and the contributions of them.

Solution:
Git mailmap files can be used to create a mapping of identities to the original identity.
It would be really useful, if git2net would be able to parse and respect mailmap files.

gotec commented

Hi Breee,

We currently have a paper under review addressing this problem using a rule-based algorithm similar to the one proposed by Bird et al. (https://dl.acm.org/doi/10.1145/1137983.1138016).
Once this is accepted, the functionality will be integrated into git2net.

From my understanding, mailmap files would need to be manually maintained by the user. Therefore, I would prefer to use such a rule-based disambiguation approach that solves the problem in an automated way and also works for large repositories.

Am I correct about the mailmap files, or is there a way to automatically create them, e.g., using the GitHub login? I have not seen them used in any repositories so far.

Best,
Christoph

Breee commented

Sounds interesting.

Yes, you gotta create them manually in each repository.
See here: https://git-scm.com/docs/git-check-mailmap#_mapping_authors

But i'm not sure, if you can achieve the same automatically, because:

Given that there is no commit hook that rejects my commit, if the user does not match a real user on the git-service. (By default this is not the case and I can commit as whoever i wish to), then:

  • I could easily commit as user1 <user1@example.com> and user2 <user2@example.com> by changing my git config.
  • But my real identity on github ist Breee <xxxx>.

Imo, there is no way you can guess, that user1 and user2 is me.
With a mailmap file, i can map user1 and user2 to Breee.

gotec commented

I understand your use case and agree that no automated process could correctly match them if there is no pattern in the name and email (using only this information). Commonly, manually disambiguated datasets are used as benchmarks for disambiguation algorithms. Thus. if I can determine manually (without additional knowledge) that two authors identities should be merged I also do not expect the disambiguation algorithm to do so.

I have developed git2net primarily to facilitate the large-scale analysis of git repositories. In this scenario, I analyse repositories with many thousand contributors, so it is rarely possible to add information manually.

That said, it should be fairly straight forward to write a small function that generates user ids based on mailmap files. This would represent a "manual disambiguation" approach that could exist parallel to the automated processes I propose above.
When performing disambiguation on the SQLite database resulting from a crawl, the user could then select which algorithm to use. In any case, the results would be stored in a column author_id that can subsequently be used for the network generation.

I will look into this when adding the other disambiguation algorithms. Unfortunately, this will likely only be in April.

Breee commented

If you want, you can point me to the files / functions and i can try to contribute.

gotec commented

Sure, that would be highly appreciated :)

As you have probably seen, git2net generates an SQLite database for each run that can then be used to generate networks.
I propose to add author disambiguation as an intermediate step in between.

Therefore we would need a standalone function that generates unique author IDs from the author names and emails in the database. Hence, the function would take an existing SQLite database + the path to the mailmap files as input and modify the commits table in the database by adding an author_id column containing the author's unique id for each commit.
So far I have used increasing integers (starting with 0) as the unique author ids, so I propose to use the same convention here.
I suggest writing this function in a separate python file author_disambiguation.py on the same level as extraction.py and visualisation.py.
I will eventually add the other disambiguation functions to this file too.

In addition, the visualisation functions in visualisaion.py need to be updated by adding an option to use the author_id instead of the author names. Ideally, this would even be the default with a warning being provided to the users if they attempt to create visualisations from non-disambiguated databases.

Author name ambiguities also arise when editing a document locally and making changes directly on the GitHub website. Different author names can also occur in this case.

Is it possible to aggregate the different names from the author's personal data or the history of their projects, if applicable? But this is then bound to the specific system - GitHub, GitLab - and would probably be a separate project?

gotec commented

I'm aware of ALFAA (https://link.springer.com/article/10.1007/s10664-019-09786-7) that uses developer behaviour to disambiguate author identities. The issue I see with that is that all data used for disambiguation should not be used in any subsequent analysis to avoid biases in the results.

Given the many different options to solve this problem, I agree that this is certainly out of the scope of what git2net aims to accomplish. However, if you have an implementation of a disambiguation approach that allows for this, I would be happy to include it, e.g. as a submodule, and make it available here.

Breee commented

Sure, that would be highly appreciated :)

As you have probably seen, git2net generates an SQLite database for each run that can then be used to generate networks.
I propose to add author disambiguation as an intermediate step in between.

Therefore we would need a standalone function that generates unique author IDs from the author names and emails in the database. Hence, the function would take an existing SQLite database + the path to the mailmap files as input and modify the commits table in the database by adding an author_id column containing the author's unique id for each commit.
So far I have used increasing integers (starting with 0) as the unique author ids, so I propose to use the same convention here.
I suggest writing this function in a separate python file author_disambiguation.py on the same level as extraction.py and visualisation.py.
I will eventually add the other disambiguation functions to this file too.

In addition, the visualisation functions in visualisaion.py need to be updated by adding an option to use the author_id instead of the author names. Ideally, this would even be the default with a warning being provided to the users if they attempt to create visualisations from non-disambiguated databases.

I'll look into it soon, for now we concentrate on fast progress in our analysis. And understanding the code might take me a few days

Until then, if someone else faces this problem, I suggest as temporary workaround that one rewrites history local according to the mailmap file: https://stackoverflow.com/questions/27275187/rewrite-git-history-according-to-a-mailmap-file
But be careful to not push these changes by accident. History rewriting is a git feature that should only be used by experienced users and is really dangerous.

gotec commented

Hi Breee,

Thank you for your patience.

I have just released git2net version 1.5.0 which features author disambiguation using the package gambit-disambig.

The code behind gambit is available at https://github.com/gotec/gambit, and you can find a preprint of our MSR paper on arXiv (https://arxiv.org/abs/2103.05666).

Next, I will look into the suggested mailmap files.

Cheers,
Christoph

gotec commented

For once adding support for mailmap files turned out to be much more straight forward than expected.
Starting with git2net 1.5.1 if there is a .mailmap file in a repository this file will be automatically considered when mining the repository. If there is no mailmap file, git2net will yield exactly the same result as before. The same holds for authors or committers that do not appear in the mailmap file.

After using a mailmap file there is no need for additional disambiguation (as long as the mailmap covers all cases). Therefore, I suggest using the option author_identifier='author_email' with all visualisation functions.

With this, I will close this issue. Feel free to reopen in case you find any further issues with the integration :)

Cheers,
Christoph