qri-io/dataset

datasetDiffer commit title and message a billion times cooler than anything a human will write

Closed this issue · 1 comments

So, @b5 & @osterbit

The commit title/message that is generated by datasetDiffer is cool af. Even if a human adds a commit title and message, that information seems super handy. Any thought to having a permanent field in commit that would just show the diffing info?
Commit.Diff string
Just a string that keeps the diffing message? Which can be made into the title/message if one is not provided?

b5 commented

lol right!? Not having to write commit messages feels pretty spiffy 😄 .

Whenever we're thinking about writing something down that is computable, it's important to stop and think about cost of storage, and weather we want to ask our peers to pay that tax or not.

I think we can "have our cake and eat it too" when it comes to these diff strings. As long as we have an intact history of datasets, this diff string can be computed at any time. This means we don't have to store it, and gives us one less thing we have to hold on to.

That being said, once we're a "mature and stable project", people will start to put very large datasets into version control, at which point we'll need to start to think about shipping features that re-write dataset histories as diffs to save space. There's a situation where we may want this field.

Our first line of defense before resorting to history will be a semantic chunker (lining up IPFS blocks with dataset entries), which should bring a bunch of storage savings, especially in append-only style datasets.

The second fallback will be a format much like git's packfiles, with a list of diffs instead of raw files. Even then we that may not be enough storage saving.

At that point peers will want the option to destroy history data in the name of storage economy, keeping only the changelog. In this extreme case we may want to consider adding this Commit.Diff string field to have a clear annotation of what changed, because we'll no longer be able to generate it.

In the meantime tho, there's nothing stopping us from shipping a qri log --diff command that checks for "non-standard" commit messages and computes the diff string there, showing a log of only machine-generated diff messages.