Versioning: Commit + Repo Datastructures
jbenet opened this issue · 13 comments
Versioning has been a long time coming.
We need to construct the necessary data types before we start making any tooling around it. The types
The SYNTAX of the "merkldag DSL" is still TBD (#22), but for now using go-like
first, some types we need
// Any is any merkledag Node
type Identity struct {
Key SigningKey // link to a signing key
Data struct {
Name string // the "name" of the identity
}
}
type Authorship struct {
Author Identity
Data struct {
Date string // ISO timestamp in UTC?
}
}
type Signature struct {
Object Any // link to the signed object
Key SigningKey // link to the signing key
Data struct {
Signature []byte // the signature bytes
}
}
// generic type that terminates in a certain other leaf type
type Tree<LEAF_TYPE> struct {
NAME Link<Tree | LEAF_TYPE>
...
}
the versioning data types
type Commit struct {
Parents []Commit // "parent0" ... "parentN"
Author Authorship // link to an Authorship
Committer Authorship // link to an Authorship
Object Any // what we version ("tree" in git)
Data struct {
Comment String // describes the commit
}
}
type VersionRepository struct {
Refs Tree<Commit> // hierarchy of {branches, tags, heads, remotes, ... }
Logs Tree<File> // reflogs, etc... (maybe should be other than files...)
}
It seems to me that in Git when you sign a commit the signature is part of the commit. So you cannot remove the signature without changing the commit sha1.
And commit trailers (like Signed-off-by) are very useful in Git and may deserve something special.
Also Tree and VersionRepository are defined but not used.
Should we have email as part of Identity too, like Git does? Probably don't want that since they key is a stronger identifier.
In the example Commit
struct there is no Signature
, I guess this would be optional depending on if you want to sign the commit or not. Having the Signature in the commit would make the signature part of the commit as @chriscool mentioned.
Please consider adding a variant of a merge commit whose meaning is history rewriting.
The first parent will point to the new history. User interfaces are supposed to act as if it was the only parent unless the user requests otherwise. Where possible, the rewrite commit could be rendered as something like an unobtrusive collapsed bar between commit messages.
The hidden secondary parent will point to the history that was “rewritten”.
This would alleviate much of the need Git has for forced pushes in development branches to keep the commit history clean. It would also let one view what changed in the “rewrite” unlike with Git rewrites.
@chriscool as a relevant note, I'd often argue against adding first-class tagging to a git-ish system.
(There's some relevant discussion on my own approach to avoiding that on top of Git itself, see ELLIOTTCABLE/.gitlabels.)
Git's lack of multiple author support is an oft cited limitation, I think a logical AND of authorships would be useful to include instead of a post hoc way of embedding that in the identity, since that would require parsing, etc.
@ion1, @ELLIOTTCABLE I think the most appealing way to address that is to have more than just a "parent" relationship between commits, which ties this into the debate about first class tagging and potentially also various trailers in the comments.
Since there's nothing preventing the Object
field from being a commit, parallel histories could be related by decorating both of them from the outside with a third one, for example, but that's far from the only approach.
Do the data structures imply that the native Git objects would need to be translated when crossing the ipfs boundary?
I understand that the ipfs hashtree structure is different from the Git blobid, so the two aren't directly compatible. Is it necessary to generate a new id to store a git object (or pack, if the tools could find out what any of them might be called) in the DHT?
My concern is that if the data structures don't provide an exact isomorphism with the Git objects used in any given repo, there will be a lossy translation. It has to be lossless, doesn't it?
(On objectid->packid, serving something like a 302 Found or an extra returned header might help efficiency, then you only need DHT entries for the commits and the rest can come from a pack. Or maybe I need to read more about ipfs.)
@mcast with CID and IPLD, we'll be able to just reference the unchanged git objects/packs/blobs/trees.
Hi, it's been a while since the last update. Is there any update on this topic? Thanks for all your hard work. We would like to try IPFS in our product but we need the versioning feather to be ready. Where can I track the status of this feature?
Hi everyone. Same question than @kehao95 here :)
Anyway to track the status of this issue ?
Unfortunately, no. We don't have native versioning.
We do now have git object support in IPLD: https://github.com/ipfs/go-ipfs/blob/master/docs/plugins.md, https://github.com/ipfs/go-ipld-git/. However, that has some limitations (no sharding, for one).
It would be nice if we can add a diff file to each commit. This would enable us to remove the pinning for the sub-cid of the older version and just keep the diff pinned.
You may know the creation of patches/diffs of large binary files as very resource-intensive, but zstd now supports the ability to created diffs from two files up to 2 GB - which is extremely space-efficient and fast.
The diffs can just be used in one direction. So creating patches backward makes the most sense. This way IPFS can create on the fly older versions from the patch if necessary.