Applying clearcase labels to git commits?
tobocop2 opened this issue · 15 comments
I'm not very knowledgeable about clearcase so this may not make sense as an issue.
Currently, the tool does not generate tags that are one to one correspondent with the clearcase labels. I understand that clearcase labels are applied to individual files and directories, but I would like to apply a clearcase label to a git commit. There is only one tag generated after I run gitcc rebase and that's "master_cc"
I'd like to extract all of the clearcase labels and make tags out of them. Is that something that is doable? Prior to committing, would I just need to extract the label of that version and then create a git tag with that label name? It seems straight forward, but given my lack of clearcase knowledge, I just want to make sure that this is not absolutely unfeasible.
Thanks.
@t-g-p I remember vaguely dismissing this idea years ago. The problem with Clearcase is that it has no notion of atomicity of commits. git-cc does a very naive thing and tries to gather up all the files with the same commit message at the same time. The same label is applied to multiple file/versions, and the question becomes how do you then correlate that with other files that make up the git commits.
However, on second thoughts, I guess you could also do a naive thing with the labels and as you see them in the history attach them to the latest commit at the time. I'm not sure if that makes sense or not.
I should warn you, I'm not actively maintaining git-cc (and haven't for 7-8 years), so I can't make this change. I'm happy to keep offering suggestions though if that helps.
On that, if you could take a look at the .git/lshistory.bak
file and look for any mention of labels and perhaps paste a few examples of them with their surrounding context (ie checkins) I might be able to see if my idea is at least feasible and we could go from there.
So currently, the lshistory command doesn't extract any label info. However, if you add the
%l
to the format string in the rebase module, it extracts all the label information. However, the big problem is that each file has multiple labels. Some files will have something like the following in the lshistory.bak once the format string is configured to extract label info:
some_stuff|more_stuff|(label1, label2, label3, label4)
gitcc makes commits as you described using the commit messages. It basically makes commits out of each unique activity, which is the desired behaviour. Where I'm working currently, labels are typically made during a release. So in my case, I suppose I'd need to create a git tag for every release and I"m not sure how to do this the best way.
@t-g-p Can you possibly paste a small(ish) snippet of the lshistory output with %l
enabled? I don't have a clearcase instance to test with...
After that I'm happy to explain what could be done to make it work, depending on what I see.
Using the following format string (modifications at end)
"%o%m|%Nd|%u|%En|%Vn|%Nc|Labels: %l
I wind up having something that looks like the following:
checkoutversion|20180703.142908|<linux_user>|/path/to/file|/main/CHECKEDOUT|"Some comment string"|Labels:
checkoutversion|20180703.143908|<linux_user>|/path/to/file|/main/CHECKEDOUT|"Some comment string"|Labels:
checkoutversion|20180703.144908|<linux_user>|/path/to/file|/main/CHECKEDOUT|"Some comment string"|Labels:
checkinversion|20180703.145908|<linux_user>|/path/to/file|/main/CHECKEDOUT|"Some comment string"|Labels: (LabelA, LabelB, LabelC, LabelD)
checkinversion|20180703.145608|<linux_user>|/path/to/file|/main/CHECKEDOUT|"Some comment string"|Labels: (LabelA, LabelB, LabelC, LabelD, LabelE)
mkelemversion|20180629.162343|<linux_user>|/path/to_file|/main/0||Labels:
mkelembranch|20180629.162343|<linux_user>|path/to/file|/main/||Labels:
mkelemfile element|20180629.162343|<linux_user>|path/to/file|"Some comment"|Labels:
My initial strategy was to find the latest commit tied to a label and just set the tag there. However, the labels persist throughout the file history so if I just find the latest commit in which a label exists, then it looks like the last commit would just have all the tags. If I find the first commit that corresponds to a label and tag there, I will miss subsequent commits that are tied to the same label since the commits are not one to one with the labels.
@t-g-p Sorry I can't test this myself. Just a question about what you have:
checkinversion|20180703.145908|<linux_user>|/path/to/file|/main/CHECKEDOUT|"Some comment string"|Labels: (LabelA, LabelB, LabelC, LabelD)
checkinversion|20180703.145608|<linux_user>|/path/to/file|/main/CHECKEDOUT|"Some comment string"|Labels: (LabelA, LabelB, LabelC, LabelD, LabelE)
I'm curious about this, and not sure how much is just the sanitisation you've done to the snippet. Does it really look like:
checkinversion|20180703.145908|<linux_user>|/path/to/file|/main/branch/1|"Some comment string"|Labels: (LabelA, LabelB, LabelC, LabelD)
checkinversion|20180703.145608|<linux_user>|/path/to/file|/main/branch/2|"Some comment string"|Labels: (LabelA, LabelB, LabelC, LabelD, LabelE)
I would be very surprised that LabelA
is applied to two different version of path/to/file
. Is that right? Or is it really:
checkinversion|20180703.145908|<linux_user>|/path/to/file1|/main/branch/1|"Some comment string"|Labels: (LabelA, LabelB, LabelC, LabelD)
checkinversion|20180703.145608|<linux_user>|/path/to/file2|/main/branch/1|"Some comment string"|Labels: (LabelA, LabelB, LabelC, LabelD, LabelE)
In which case, I think whenever you see a tag, for that changeset/commit git-cc should be able to create a matching tag in git as it rolls through the history.
Does that make sense?
@charleso
So each change set group represents a commit. Our labels are created once a release is made, so only the top most commit / group should have the tags. Also in clearcase labels are appended to a file so a file will have all labels since it was first labeled, so to get the labels to be one to one with tags, I came up with something like the following:
pseudo code
# iterate over group list returned from mergeHistory() in rebase.py
for i, group in enumerate(groups):
next_labels = groups[i+1].labels # assumes extracted labels from all change sets are in group
if next_labels == group.labels:
group.labels.clear() # remove labels from this group because the labels are identical between groups / commits
else:
# don't delete labels of previous group and preserve
# cascade and delete all shared labels this current group and beyond
group.tags = filter_labels(group.labels) # implement function to filter out unecessary labels)
for j, group in range(i+1, len(groups)):
group[j].labels.difference_update(group.tags) # remove all intersecting tags
# set the final commits tags to be its labels (not sure if this is wanted behavior
groups[-1].tags = groups[-1].labels
@t-g-p I'm probably just slowing you down on this, my head isn't in the problem at the moment. Seems like you're already most of the way to having something conceptually working.
My only suggestion would be to have a snippet of the lshistory
input and the expected git output that you can use for a unit test in git-cc, is something I've long regretted not adding which has lead to gitcc being so fragile. I also think in this case it might help explain to me or others what the logic is.
Sorry I can't do anything else to help
no worries @charleso
This tool has been extremely helpful and the fragility doesn't bother me. You did an excellent job. Unfortunately, i haven't been able to get deleted file history though and I'm not sure I am going to make the necessary modifications to get deleted file history.
I will close this issue. If you think you'd be interested in accepting pull requests, I can make the changes. I can leave it open or you can close it . I believe that the label behaviour I outlined is fairly standard. I would imagine most people would label a release recursively, but I could be wrong which is why the behaviour I outlined may not be standard.
Given a file will have every label ever attached to it. And a commit is a group of files with the same labels. The tag that should correspond to that commit should be the min of all labels that have a date greater than or equal to the commit date (i.e the label that was introduced with the change).
So say file_a.py only changed twice. But the package that file_a.py was released 50 times and labeled. file_a would then have 50 labels, but only two of them would correspond to actual changes in history.
Files that are unchanged just get labels appended to them when the directory is released.
@t-g-p Apologies for the delay, it's been a hectic week at work.
Unfortunately, i haven't been able to get deleted file history though and I'm not sure I am going to make the necessary modifications to get deleted file history.
Yeah I remember it being extremely complicated in my mind at the time. :(
If you think you'd be interested in accepting pull requests, I can make the changes.
I am. Sort of. I have basically been merging PRs that look reasonable and not too risky, but I worry because I'm not using or testing this tool and it might break for others. That said, it's probably better that things go here than on random forks where people can't see or find them.
The tag that should correspond to that commit should be the min of all labels that have a date greater than or equal to the commit date
Relying on dates instead of file versions doesn't sound quite right to me. I don't mind if you want to use that logic, given that I don't have any lshistory
output to play with I can't really say for sure. As mentioned previously I would strongly suggest (either way) use a (sanitised) lshistory file to write a python test than can document and verify the behaviour.
Good luck!!!
So say you had some_file.py
It was released 30 times
It's label information would look something like this
/main/1: some_file.py (label1, label2, label3, label4,label5,label6....label30)
But really, the only tag that should be there is label1 because it is the first label applied in this situation. This is a trivial case because the labels are already sorted. However, what if the labels were not sortable? In that case, you would need to launch a cleartool subprocess to get the dates of all the labels so you can make sense of them, right?
@t-g-p You're quite right. Sorry it took me so long to catch on. Your logic from before makes much more sense now. Let me know if you have any luck implementing something.
@charleso
Thanks, glad we are on the same page. I think I was able to successfully implement a solution that creates a one to one mapping to commits. I am still doing some testing. I don't think there is a standard solution for this, so it really depends on all of the unique label patterns that are tied to a change set. You can't simply just grab the oldest label greater than or equal to a commit because this would then leave out labels that may have been applied later for unknown reasons. I basically gather all unique label prefixes and then grab the oldest label tied to each prefix.
So I ported several codebases to git with a heavily modified version of rebase.py. It would have not been possible without you, so I am extremely grateful for your help.
I modified rebase to map all labels to git tags and I added functionality to the config portion to allow for specifying a start date as well as a label regex (in case you only want a subset of labels). Additionally, I made modifications so that the clearcase attributes are leveraged in the event of an empty comment. It turns out that the checkin utilities we have magically propagate the clearquest CR number as part of the clearcase attribute, so I was able to get away with having many useful comments in place of "".
The labels are nearly one to one for the most part with the exception of cases in which a file was deleted or the there is an empty directory (a known limitation of this tool).
In addition to the updates this package, I also needed to add functionality that would "fix" all empty files. This tool does not capture empty files either..and we both know that empty files might be critical (think python init.py). I "fixed" empty files by incrementing their versions and then migrating all of their version zero labels to version 1. I then inserted these newly checked in files at the right time in history by using their labels and finding the first commit whose oldest label matches the empty files oldest label. In the case of empty files, it was very useful to have the label for proper history insertion.
After all of this, I then wrote a script to compare repos and vob subfolders tags and labels by checking out each version in both git and clearcase. For the most part, the only differences I saw were A) empty directories B) Deleted files. In the end, this worked out very well and the company will be able to phase out clearcase while still having valuable history.
I'd offer to submit a PR, but this was implemented on a closed network without internet access. I don't have a local clearcase instance to test with if I were to recreate the changes, so I apologize for not being able to share source code.
Thanks again @charleso
@t-g-p Congratulations. Thank you so much for the update! I'm so happy that you managed to migrate to git from Clearcase. Sounds like you've made some critical improvements to capture as much as you could from your Clearcase history.
Does that mean you're using git now as the primary version control?
Regarding sharing the source code. Is it was possible to push what you have modified to your own fork? I might hesitate to merge such a large, bespoke change to git-cc. However you could still link to your updated fork from this ticket which might help someone who is in your position in future. I wasn't quite sure whether that wasn't technically possible, or due to some other reason.
Best of luck with everything! :)
Some of our components are now in git, yes. It'll be months to years before Clearcase is fully abandoned. Also, I came up with a system to tie all of the git work to ClearQuest via git hooks because we don't have access to a normal issue tracker at the moment.
All of my work is on a closed network, so I can't retrieve it. I will ask my employer about getting a copy. I would hesitate to merge it all in as well haha. The reason why I developed it all on the closed system is because I didn't have Clearcase available at all otherwise.
I personally have always used git for version control. I didn't have gitcc before as a bridge, so I just had scripts to bridge my git repos to clearcase vobs.