sync slow with large number of commits

Question

sync slow with large number of commits

Opened this issue 8 years ago · 12 comments

The nomfs repo that @ahl linked to in #2220 has 20K+ commits. Syncing each of these commits is currently serialized because we need to each commit chunk to discover the ref of the previous one. We may need to add a server endpoint to do some of this work on the low-latency side of the connection

Answer 1 · 2016-11-08T19:34:38.000Z

Added format label because this probably implies changing the schema of Commit which may as well be a format change.

Answer 2 · 2016-12-02T06:54:22.000Z

Here is a simple and concrete proposal for how to fix this:

// There are two fairly distinct use cases for a Commit struct
//
// 1. Working with the current value programmatically
// 2. Exploring history (e.g., for sync or noms log)
//
// Optimizing these two use cases requires conflicting strategies. For use case (1),
// we want to minimize serial fetches before we get to the actual user data. So we
// need to use as much of possible of the root 4k chunk for user data and get
// everything else (metadata, history) out-of-line.
//
// For use case (2), we want to get as much of the history graph as possible in each
// chunk. In particular, we want to get multiple levels of the graph in each chunk
// because typically the history graph is tall and narrow. So we need the user data
// and metadata out-of-line, and the parents (and their parents, and so on) inline.
//
// To address this conflict, this design splits Commit into two different structs. The
// new Commit struct is optimized for use case (1) and the HistoryEntry struct is
// optimized for use case (2).

struct Commit {
	// meta and history out-of-line so that as much of the user data as possible
	// ends up in root chunk.
	meta?: Ref<Struct>
	history: Ref<Set<HistoryEntry>>
	value: Value
}

struct HistoryEntry {
	// commit is out-of-line, but history is inline. Thus each chunk has only nodes
	// of history graph in it, therefore you get about 4k/20 ~= 200 nodes of the
	// graph in each chunk. commit details for each node can then be fetched in
	// parallel.
	commit: Ref<Commit>
	history: Set<HistoryEntry>
}

Answer 3 · 2016-12-02T15:25:37.000Z

Is each HistoryEntry.history the set (i.e. usually just one) of immediate predecessors for the referenced commit? If not, how is the commit DAG navigated?

Answer 4 · 2016-12-02T15:55:31.000Z

Yes. Because everything is inlined and the probability of any individual set item chunking still exists you should end up with tall skinny chunks of about 4K I think.

…

On Fri, Dec 2, 2016 at 7:25 AM Rafael Weinstein ***@***.***> wrote: Is each HistoryEntry.history the set (i.e. usually just one) of immediate predecessors for the referenced commit? If not, how is the commit DAG navigated? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2233 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAE6BJ0LrDagcLLZbfias7HBQQ3vF0Iuks5rEDhygaJpZM4JaItc> .

Answer 5 · 2016-12-02T19:22:55.000Z

Seems promising to me

Answer 6 · 2016-12-02T19:28:54.000Z

Right now a collection with 1 item will always inline item (it effectively can't chunk), so the common case of a lineage of commits each with one parent wouldn't ever chunk.

We could change chunking behavior so that a collection of one element would chunk.

Answer 7 · 2016-12-02T19:34:12.000Z

We could change chunking behavior so that a collection of one element would chunk.

seems reasonable to me

Answer 8 · 2016-12-02T19:35:27.000Z

side note: I sorta feel like we should redefine "format change" to mean "if we make this change and you have an old noms codebase, it can't successfully talk to us anymore" - regardless of whether the format was actually changed.

Answer 9 · 2016-12-02T19:35:37.000Z

Another possible solution:

struct Commit {
  meta?: Ref<Struct>
  parents: Set<Ref<Commit>>
  history: Ref<Map<Number, Set<Ref<Commit>>
}

The addition here is the reference to a new map which contains the full set of refs to commits prior to this one. The mapping is height(Ref<Commit>) -> Set<Ref<Commits with that height>>

This is useful for sync because the common thing to want is to quickly enumerate all commits between two points (a) -> (b). height(b) will always be > height(a), so just enumerating all commit refs with heights in that range will retrieve all of the commits you care about (it might return too many if there are branches which merge together, but it would frequently be exactly right.

Also, I think we could implement this without a noms version change. it's just an additional field on commit which sync, log or anything else can use if it's present to rapidly fetch a set of commits.

Lastly, for common cases of simple lineages of commits, the chunks in the map will dedup frequently because entries are effectively only ever added to the end of the map sequence.

Answer 10 · 2016-12-02T20:01:16.000Z

This is a pretty good idea for a quick workaround that doesn't require a format change. Cool. I guess whether to do this depends on how complicated it is to maintain the fork in the code that uses this if it is available or falls back to not using it.

Answer 11 · 2016-12-02T20:11:30.000Z

It's also worth remembering that your solution is asymptotically faster (and probably faster in practice) for the specific problem of sync, because it changes the problem from graph traversal to just an index scan.

Answer 12 · 2017-10-10T22:23:07.000Z

New idea. Use a skip list. I.e. imagine using a prolly tree as an index. Observe that we only ever append to the end. If we use a prolly tree, we'll end up writing log n new chunks and reading log n chunks in order to mutate for each commit.

Using a skip list, means that we'll read (fewer than -on average- I think) log n chunks, but we write only the existing commit chunks.

We should be able to pick the level probability to have the same effective fan out as a prolly tree.