vsoch/citelang

Lower than expected counts from citelang contrib

dtrudg opened this issue · 2 comments

Describe the bug

I've been using citelang to analyze contributions for https://github.com/sylabs/singularity between v3.9.0 and 3.10.0 and am seeing some unexpected counts. The 'count' for a contributor is lower than I'd expect given the number of lines changed by their commits.

If I understand correctly, citelang is intending to count authors' contributions on a git blame line by line basis through each commit in the specified range. I haven't had a chance to investigate the underlying cause yet, but I have been able to isolate a smaller example on the singularity repository looking at a small range of v3.9.1...v3.9.2 with a contribution from "Richard Hattersley"...

citelang contrib --start v3.9.1 --end v3.9.2 --filters ~/filters.yaml
Found 18 commits.
                                       ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
                                       ┃ Name               ┃ Count ┃
                                       ┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
                                       │ David Trudgian     │ 264   │
                                       │ Dave Trudgian      │ 2     │
                                       │ Richard Hattersley │ 1     │
                                       └────────────────────┴───────┘

Richard's commit in this range is sylabs/singularity@b871d84 and there are 4 lines changed here in 3 files.

The --detail report shows citelang is only counting the change in CHANGELOG.md, and not the other files:

citelang contrib --start v3.9.1 --end v3.9.2 --filters ~/filters.yaml --detail
Found 18 commits.
Loading cached result /Users/dtrudg/Git_sylabs/singularity-analysis/.contrib/v3.9.1-v3.9.2.json
                             ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
                             ┃ Name               ┃ Paths                    ┃
                             ┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
                             │ David Trudgian     │ debian/copyright: 63     │
                             │                    │ e2e/plugin/plugin.go: 19 │
                             │                    │ debian/s...              │
                             │ Dave Trudgian      │ CHANGELOG.md: 2          │
                             │ Richard Hattersley │ CHANGELOG.md: 1          │
                             └────────────────────┴──────────────────────────┘

To Reproduce

git clone git@github.com:sylabs/singularity singularity-analysis
cd singularity analysis
citelang contrib --start v3.9.1 --end v3.9.2 --filters ~/filters.yaml --detail

My ~/filters.yaml is:

ignore_files:
  - go.mod
  - go.sum

Expected behavior

The count for Richard Hattersley is 4... matching the 4 lines that git blame attributes in the files modified in the commit in question:

git blame CHANGELOG.md | grep Hattersley
b871d846ca (Richard Hattersley   2021-11-25 09:29:02 +0000   12) - Correct documentation for sign command r.e. source of key index.

git blame CONTRIBUTORS.md | grep Hattersley
b871d846ca CONTRIBUTORS.md (Richard Hattersley   2021-11-25 09:29:02 +0000  88) - Richard Hattersley 

git blame cmd/internal/cli/sign.go | grep Hattersley
b871d846ca cmd/internal/cli/sign.go           (Richard Hattersley 2021-11-25 09:29:02 +0000  20) 	privKey int // -k encryption key (index from 'key list --secret') specification
b871d846ca cmd/internal/cli/sign.go           (Richard Hattersley 2021-11-25 09:29:02 +0000  71) 	Usage:        "private key to use (index from 'key list --secret')"

Version that produced the bug

0.0.28

vsoch commented

Taking a look! Sorry about the bug, I'll get insight about this asap for you.

vsoch commented

okay it looks like I was using --first-parent which doesn't reveal the commit in the result:

git log --first-parent --all --format=%H | grep b871d84
# no result

And without it:

git log --all --format=%H | grep b871d84
b871d846caef3e9b7bbb1f9c7bf5e5f309f05b63

The reason is counting twice, e.g., this article explains it a bit: https://marcgg.com/blog/2015/08/04/git-first-parent-log/. So, what I think we need here is just an option to disable using --first-parent, meaning that the contribution might be represented twice, but you don't miss it. Once I added that, your commit above was added to the listing to parse, and I also confirmed I can see the files sign.go, CONTRIBUTING.md, and CHANGELOG.md in the listing of those we parse:

for x in listing:
    ...:     if "sign.go" in x or "CONTRIBUTING" in x or "CHANGE" in x:
    ...:         print(x)
    ...:
    ...:
CHANGELOG.md
CONTRIBUTING.md
cmd/internal/cli/sign.go
e2e/sign/sign.go
internal/app/singularity/sign.go

And then when we do the blame, we run:

$ git blame -w -M -C --line-porcelain b871d846caef3e9b7bbb1f9c7bf5e5f309f05b63 -- cmd/internal/cli/sign.go

And I think the second "remove redundancy" specification in the above is M and C,

-C[<score>]           find line copies within and across files
-M[<score>]           find line movements within and across files

because a line that is then used later (commit by you) I think would fall under one of those categories. So I extended the "deep" boolean to remove them, and I think we see the full history now:

image

E.g., your count is slightly up but we have the correct contribution for Hattersley too! Here is a PR with the fix #34. As soon as you give the thumbs up there I will merge and release!