attic-labs/noms

sync slow with large number of commits

Opened this issue · 12 comments

The nomfs repo that @ahl linked to in #2220 has 20K+ commits. Syncing each of these commits is currently serialized because we need to each commit chunk to discover the ref of the previous one. We may need to add a server endpoint to do some of this work on the low-latency side of the connection

Added format label because this probably implies changing the schema of Commit which may as well be a format change.

Here is a simple and concrete proposal for how to fix this:

// There are two fairly distinct use cases for a Commit struct
//
// 1. Working with the current value programmatically
// 2. Exploring history (e.g., for sync or noms log)
//
// Optimizing these two use cases requires conflicting strategies. For use case (1),
// we want to minimize serial fetches before we get to the actual user data. So we
// need to use as much of possible of the root 4k chunk for user data and get
// everything else (metadata, history) out-of-line.
//
// For use case (2), we want to get as much of the history graph as possible in each
// chunk. In particular, we want to get multiple levels of the graph in each chunk
// because typically the history graph is tall and narrow. So we need the user data
// and metadata out-of-line, and the parents (and their parents, and so on) inline.
//
// To address this conflict, this design splits Commit into two different structs. The
// new Commit struct is optimized for use case (1) and the HistoryEntry struct is
// optimized for use case (2).

struct Commit {
	// meta and history out-of-line so that as much of the user data as possible
	// ends up in root chunk.
	meta?: Ref<Struct>
	history: Ref<Set<HistoryEntry>>
	value: Value
}

struct HistoryEntry {
	// commit is out-of-line, but history is inline. Thus each chunk has only nodes
	// of history graph in it, therefore you get about 4k/20 ~= 200 nodes of the
	// graph in each chunk. commit details for each node can then be fetched in
	// parallel.
	commit: Ref<Commit>
	history: Set<HistoryEntry>
}

Is each HistoryEntry.history the set (i.e. usually just one) of immediate predecessors for the referenced commit? If not, how is the commit DAG navigated?

arv commented

Seems promising to me

Right now a collection with 1 item will always inline item (it effectively can't chunk), so the common case of a lineage of commits each with one parent wouldn't ever chunk.

We could change chunking behavior so that a collection of one element would chunk.

We could change chunking behavior so that a collection of one element would chunk.

seems reasonable to me

side note: I sorta feel like we should redefine "format change" to mean "if we make this change and you have an old noms codebase, it can't successfully talk to us anymore" - regardless of whether the format was actually changed.

Another possible solution:

struct Commit {
  meta?: Ref<Struct>
  parents: Set<Ref<Commit>>
  history: Ref<Map<Number, Set<Ref<Commit>>
}

The addition here is the reference to a new map which contains the full set of refs to commits prior to this one. The mapping is height(Ref<Commit>) -> Set<Ref<Commits with that height>>

This is useful for sync because the common thing to want is to quickly enumerate all commits between two points (a) -> (b). height(b) will always be > height(a), so just enumerating all commit refs with heights in that range will retrieve all of the commits you care about (it might return too many if there are branches which merge together, but it would frequently be exactly right.

Also, I think we could implement this without a noms version change. it's just an additional field on commit which sync, log or anything else can use if it's present to rapidly fetch a set of commits.

Lastly, for common cases of simple lineages of commits, the chunks in the map will dedup frequently because entries are effectively only ever added to the end of the map sequence.

This is a pretty good idea for a quick workaround that doesn't require a format change. Cool. I guess whether to do this depends on how complicated it is to maintain the fork in the code that uses this if it is available or falls back to not using it.

It's also worth remembering that your solution is asymptotically faster (and probably faster in practice) for the specific problem of sync, because it changes the problem from graph traversal to just an index scan.

New idea. Use a skip list. I.e. imagine using a prolly tree as an index. Observe that we only ever append to the end. If we use a prolly tree, we'll end up writing log n new chunks and reading log n chunks in order to mutate for each commit.

Using a skip list, means that we'll read (fewer than -on average- I think) log n chunks, but we write only the existing commit chunks.

We should be able to pick the level probability to have the same effective fan out as a prolly tree.