gitpython-developers/GitPython

Commit_ish is much broader than commit-ish

EliahKagan opened this issue · 4 comments

In git, if I understand correctly, a commit-ish is a git object from which a commit can be reached by dereferencing it zero or more times, which is to say that all commits are commit-ish, some tag objects are commit-ish--those that, through (possibly repeated) dereferencing, eventually reach a commit--and no other types of git objects are ever commit-ish.

As gitglossary(7) says:

commit-ish (also committish)

A commit object or an object that can be recursively dereferenced to a commit object. The following are all commit-ishes: a commit object, a tag object that points to a commit object, a tag object that points to a tag object that points to a commit object, etc.

Therefore, all instances of GitPython's Commit class, and some instances of GitPython's TagObject class, encapsulate git objects that are actually commit-ish.

But GitPython has a Commit_ish union type in the git.types module, and that Commit_ish type is considerably broader:

Commit_ish = Union["Commit", "TagObject", "Blob", "Tree"]

These four classes are the GitPython classes whose instances encapsulate any of the four types of git objects (of which blobs and trees are never actually commit-ish):

object type

One of the identifiers "commit", "tree", "tag" or "blob" describing the type of an object.

GitPython uses its Commit_ish type in accordance with this much broader concept, at least some of the time and possibly always. For example, Commit_ish is given as the return type of Object.new:

@classmethod
def new(cls, repo: "Repo", id: Union[str, "Reference"]) -> Commit_ish:

Commit_ish cannot simply be replaced by Object because GitPython's Object class is also, through IndexObject, a superclass of Submodule (and the RootModule subclass of Submodule):

class Submodule(IndexObject, TraversableIterableObj):

The submodule type does not have a string type associated with it, as it exists
solely as a marker in the tree and index.

type: Literal["submodule"] = "submodule" # type: ignore
"""This is a bogus type for base class compatibility."""

However, elsewhere in GitPython, Commit_ish is used where it seems only a commit is intended to be allowed, though it is unclear if this is unintentional, intentional but only to allow type checkers to allow some code that can only reasonably be checked at runtime, or intentional for some other reason. For example, the Repo.commit method, when called with one argument, looks up a commit in the repository it represents from a Commit_ish or string, and returns the commit it finds as a Commit:

def commit(self, rev: Union[str, Commit_ish, None] = None) -> Commit:

This leads to a situation where one can write code that type checkers allow and that may appear intended to work, but that always fails, and in a way that may be unclear to users less familiar with git concepts:

>>> import git
>>> repo = git.Repo()
>>> tree = repo.tree()
>>> repo.commit(tree)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\ek\source\repos\GitPython\git\repo\base.py", line 709, in commit
    return self.rev_parse(str(rev) + "^0")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ek\source\repos\GitPython\git\repo\fun.py", line 379, in rev_parse
    obj = to_commit(obj)
          ^^^^^^^^^^^^^^
  File "C:\Users\ek\source\repos\GitPython\git\repo\fun.py", line 221, in to_commit
    raise ValueError("Cannot convert object %r to type commit" % obj)
ValueError: Cannot convert object <git.Tree "d5538cc6cc8839ccb0168baf9f98aebcedfd9c2c"> to type commit

An argument that this specific situation with Repo.commit is not a typing bug is that this operation is fundamentally one that can only be checked at runtime in some cases. After all, an argument of type str is also allowed and it cannot known until runtime what object a string happens to name. Even so, the method docstring should possibly be expanded to clarify this issue. Or perhaps if the situation with Commit_ish is improved, then the potential for confusion will go away.

One way to improve this situation is to clearly document it in a docstring for the Commit_ish type. But if possible it seems to me that more should be done:

  • If known, the reason for the current situation should be stated there.
  • Its relationship to other types should be clarified where otherwise confusing. For example, Object may benefit from greater clarity about what it ideally represents (git objects) versus the entirety of what it represents (that an Object can also be a Submodule), and the way that Tree_ish is narrower than all tree-ish git objects while Commit_ish is broader than all commit-ish git objects can be noted in one of their docstrings.
  • Maybe Commit_ish should be deprecated and one or more new types introduced, replacing all uses of it in GitPython.

If I am making a fundamental mistake about git concepts here, and GitPython's Commit_ish has a closer and more intuitive relationship to commit-ish git objects than I think it does, then I apologize.

I have not figured out very much from GitPython's revision history what the reason for defining Commit_ish as it is currently defined is, or alternatively why this union of all four actual git object types was introduced with the narrower-seeming name Commit_ish. However, the Commit_ish type was introduced in 82b131c (#1282), where the annotations it was used to replace had listed all four types Commit, TagObject, Tree, and Blob as explicit alternatives.

Thanks for bringing this to my attention.

To me it seems that, no matter what, the Commit_ish type is too broad even though it is clearly defined. This seems like a bug that would better be fixed. A fix should only affect the type-checker as well, which I would think is not disruptive in most cases, particularly because failing to pass an actual commit-ish will always cause a runtime failure

Along with that, I agree that it would be good to further clarify that git.Object is technically more than four possible Git object types, simply because it's something that can probably not be fixed without being potentially breaking.

Lastly, Tree_ish is described as narrower here, and I wonder if eventually this can be fixed beyond making this clear in the documentation initially. I realize though that this must very much depend on the site that accepts a tree-ish, as they would have to resolve it anyway.

In summary, I think Commit_ish can be fixed, while the documentation of git.Object and Tree_ish can be imrpoved.

To me it seems that, no matter what, the Commit_ish type is too broad even though it is clearly defined. This seems like a bug that would better be fixed. A fix should only affect the type-checker as well [...]

I had at first feared that this might change the runtime behavior of reasonable code, but it looks like that may not be the case. In particular, due to the way it is written, Commit_ish is resolved as a Union of ForwardRefs. Unlike unions of "complete" types, which can be used as the second argument of isinstance or issubclass, this union cannot be used that way.

>>> import git
>>> isinstance(git.Repo().tree(), git.types.Commit_ish)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.752.0_x64__qbz5n2kfra8p0\Lib\typing.py", line 1564, in __instancecheck__
    return self.__subclasscheck__(type(obj))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.752.0_x64__qbz5n2kfra8p0\Lib\typing.py", line 1568, in __subclasscheck__
    if issubclass(cls, arg):
       ^^^^^^^^^^^^^^^^^^^^
TypeError: issubclass() arg 2 must be a class, a tuple of classes, or a union
>>> git.types.Commit_ish
typing.Union[ForwardRef('Commit'), ForwardRef('TagObject'), ForwardRef('Blob'), ForwardRef('Tree')]

Even using values obtained from inspect.get_annotations or inspect.signature does not appear to readily give anything that can be used in a runtime check.

This is good news for removing the never-treeish alternatives because it suggests that only static typing could break--and depending on how people are using Commit_ish, that might really be revealing bugs rather than creating a false positive.

I think I'm going to look at bit more into whether there are runtime implications of this change, even though a cursory examination suggests there are not. There is also the question of how GitPython uses it. There are many occurrences of it in GitPython's type annotations. Some appear to intend a union of all four actual git object types, while other appear to intend only those types that can actually be commit-ish. If it turns out that this impression is wrong and that GitPython uses it in a consistent way--closer examination will tell--then changing it may not be justified. But so long as that is not so--which it seems it may not be--then I think changing its definition may be justifiable.

A new union can be created for all four actual git object types. There is a question of how it should be named; I can let you know if I have trouble coming up with a good name, but if you have a particular name or names that should be preferred then you can let me know.

Assuming these changes can be made, I think there are two reasonable approaches. One is for me to expand and retitle #1859 to include these changes, assuming I am able to make them. The other approach is that I could weaken or remove some of the unjustified wording in the Commit_ish docstring there, and have the actual change to Commit_ish and creation of a new union for all four actual git object types be a subsequent PR.

Thanks for looking into and validating it!

A new union can be created for all four actual git object types. There is a question of how it should be named; I can let you know if I have trouble coming up with a good name, but if you have a particular name or names that should be preferred then you can let me know.

I thought that maybe git.Objects (with a plural 's') is a very sensible name, particularly when used in method signatures that expect any git object. As it might imply that multiple objects should be passed maybe git.ObjectKind or git.ObjectType would be even better.

Thanks again for your help with this, I am sure you will find a good path forward.

I went with AnyGitObject, which seems to be a bit better than GitObject in that--though I struggle to articulate exactly why--it seems less likely to be confused with the Git or Object classes and more naturally to capture the concept. A possible disadvantage of "Any" in AnyGitObject is that it could be confused with other "Any"s in Python types such as typing.Any or the AnyStr type variable, but I think this risk is minimal (those are two different prominent uses of "Any", so it is not as though there is a single fixed use in type names that this is going against).

The reason I didn't go with Objects is that I agree that it has the problem of implying multiple objects. Both AnyGitObject and Objects have the problem that they are not natural to put for ... in "x is an ...". I think this is very slightly less severe with AnyGitObject than Objects, but still this may be a reason to prefer GitObject. Alternatives like TrueGitObject, ActualGitObject, and RealGitObject (or those without the Git part) seem unnatural and also prone to their own confusions (e.g., "True" could be thought of as having to do with evaluation as a boolean, and all of those could be thought of as referring to being on disk or otherwise in an actual git object database).

The reason I didn't go with ObjectKind or ObjectType is that the is a phrase does sound right... but is wrong. For example, "A Blob is an ObjectKind" expresses that Blob is an instance of ObjectKind (a falsehood), not that Blob is a subtype of ObjectKind. In addition, the other meaning of object types in git is the string literals that identify them (e.g., `"blob"), so those would more be the object kinds.