Lower than expected counts from citelang contrib
dtrudg opened this issue · 2 comments
Describe the bug
I've been using citelang to analyze contributions for https://github.com/sylabs/singularity between v3.9.0 and 3.10.0 and am seeing some unexpected counts. The 'count' for a contributor is lower than I'd expect given the number of lines changed by their commits.
If I understand correctly, citelang is intending to count authors' contributions on a git blame
line by line basis through each commit in the specified range. I haven't had a chance to investigate the underlying cause yet, but I have been able to isolate a smaller example on the singularity repository looking at a small range of v3.9.1...v3.9.2 with a contribution from "Richard Hattersley"...
citelang contrib --start v3.9.1 --end v3.9.2 --filters ~/filters.yaml
Found 18 commits.
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Name ┃ Count ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ David Trudgian │ 264 │
│ Dave Trudgian │ 2 │
│ Richard Hattersley │ 1 │
└────────────────────┴───────┘
Richard's commit in this range is sylabs/singularity@b871d84 and there are 4 lines changed here in 3 files.
The --detail
report shows citelang is only counting the change in CHANGELOG.md, and not the other files:
citelang contrib --start v3.9.1 --end v3.9.2 --filters ~/filters.yaml --detail
Found 18 commits.
Loading cached result /Users/dtrudg/Git_sylabs/singularity-analysis/.contrib/v3.9.1-v3.9.2.json
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name ┃ Paths ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ David Trudgian │ debian/copyright: 63 │
│ │ e2e/plugin/plugin.go: 19 │
│ │ debian/s... │
│ Dave Trudgian │ CHANGELOG.md: 2 │
│ Richard Hattersley │ CHANGELOG.md: 1 │
└────────────────────┴──────────────────────────┘
To Reproduce
git clone git@github.com:sylabs/singularity singularity-analysis
cd singularity analysis
citelang contrib --start v3.9.1 --end v3.9.2 --filters ~/filters.yaml --detail
My ~/filters.yaml
is:
ignore_files:
- go.mod
- go.sum
Expected behavior
The count for Richard Hattersley
is 4... matching the 4 lines that git blame
attributes in the files modified in the commit in question:
git blame CHANGELOG.md | grep Hattersley
b871d846ca (Richard Hattersley 2021-11-25 09:29:02 +0000 12) - Correct documentation for sign command r.e. source of key index.
git blame CONTRIBUTORS.md | grep Hattersley
b871d846ca CONTRIBUTORS.md (Richard Hattersley 2021-11-25 09:29:02 +0000 88) - Richard Hattersley
git blame cmd/internal/cli/sign.go | grep Hattersley
b871d846ca cmd/internal/cli/sign.go (Richard Hattersley 2021-11-25 09:29:02 +0000 20) privKey int // -k encryption key (index from 'key list --secret') specification
b871d846ca cmd/internal/cli/sign.go (Richard Hattersley 2021-11-25 09:29:02 +0000 71) Usage: "private key to use (index from 'key list --secret')"
Version that produced the bug
0.0.28
Taking a look! Sorry about the bug, I'll get insight about this asap for you.
okay it looks like I was using --first-parent
which doesn't reveal the commit in the result:
git log --first-parent --all --format=%H | grep b871d84
# no result
And without it:
git log --all --format=%H | grep b871d84
b871d846caef3e9b7bbb1f9c7bf5e5f309f05b63
The reason is counting twice, e.g., this article explains it a bit: https://marcgg.com/blog/2015/08/04/git-first-parent-log/. So, what I think we need here is just an option to disable using --first-parent
, meaning that the contribution might be represented twice, but you don't miss it. Once I added that, your commit above was added to the listing to parse, and I also confirmed I can see the files sign.go, CONTRIBUTING.md, and CHANGELOG.md in the listing of those we parse:
for x in listing:
...: if "sign.go" in x or "CONTRIBUTING" in x or "CHANGE" in x:
...: print(x)
...:
...:
CHANGELOG.md
CONTRIBUTING.md
cmd/internal/cli/sign.go
e2e/sign/sign.go
internal/app/singularity/sign.go
And then when we do the blame, we run:
$ git blame -w -M -C --line-porcelain b871d846caef3e9b7bbb1f9c7bf5e5f309f05b63 -- cmd/internal/cli/sign.go
And I think the second "remove redundancy" specification in the above is M and C,
-C[<score>] find line copies within and across files
-M[<score>] find line movements within and across files
because a line that is then used later (commit by you) I think would fall under one of those categories. So I extended the "deep" boolean to remove them, and I think we see the full history now:
E.g., your count is slightly up but we have the correct contribution for Hattersley too! Here is a PR with the fix #34. As soon as you give the thumbs up there I will merge and release!