Level/levelup

Batch get option

NirmalVatsyayan opened this issue Β· 23 comments

Hi,
Is there any batch get option or trick to get for multiple keys in minimum pass. We have a use case where we need to get values of 15 K keys, firing so many get requests is not a proper way probably.

Kindly assist.

Depending on what your keys look like, you could do a db.createReadStream() with a key interval that incorporates all keys and just filter out the values for the keys that you are interested in.

@juliangruber Any other ideas?

If the keys are adjacent, then db.createReadStream({ limit: 15000 }) will do the trick.

If there are small gaps (say you have keys a, b, c in the database and you want to get a and c), then @ralphtheninja's suggestion of filtering is fine.

If there are big gaps (you have keys a-z but only want to get a and z), you could consider using seek() on an iterator, but this is a lower-level mechanism and only supported by leveldown and not exposed in levelup. So only go there if you really need it.

If you can design your keys in such a way (making use of the lexicographical sort order) that they are indeed adjacent, that's the preferred way. You wouldn't need to skip anything.

Retrieving a "bag of keys" is pretty valid and common scenario which is not totally straightforward due the to asnyc-iness of the API. Just now the API handles batches of noncontiguous writes, so it seems reasonable for it to handle batches of reads.

If you want to fetch noncontiguous values, at the moment you have to carefully read the whole README just to confirm that there isnt a builtin, and then you have to work out some sort of boilerplate incantation to wire up a sequence of gets, and package up the result (slightly tricky owing to the async nature of the API).

Changing the API would be a bit drastic at this stage in the 2.x.x release, but I think the lib would be a lot more accessible if there was at least a quick novice-friendly copy-paste example for "batch get" in the docs.

@fergiemcdowall Can you share a use case? I'd try to shift the thinking to storing a bag of keys. Key design is important. If the keys are contiguous on disk, reading is too. In other words, if you need to read a lot of keys that "belong together", shouldn't they be stored together?

Indexes are another option.

A "batch get" is doable to implement BTW, with leveldown at least. Performance might depend on whether the input keys are sorted because forward seeks are faster. In any case it would beat multiple gets, Γ‘nd give you snapshot guarantees. But I'm not yet convinced levelup should have this interface.

@fergiemcdowall Can you share a use case? I'd try to shift the thinking to storing a bag of keys. Key design is important. If the keys are contiguous on disk, reading is too. In other words, if you need to read a lot of keys that "belong together", shouldn't they be stored together?

@vweevers I get why you are saying this- one of the great things about leveldb is that it handles ranges really well, and doing clever things with sorted indexes is where leveldb really shines.

That said, there are plenty of situations where you have to grab a set of keys that are spread out all over an index. For example- I use level mostly to build the search-index module. search-index might handle a search for "The quick brown fox jumps over the lazy dog" in which case search-index looks for keys related to 'quick', 'brown', 'fox', 'jumps', 'over', 'lazy' and 'dog'. There is no clever key structure that can make these arbitrary, unpredictable keys exist contiguously in the index.

Although I use the hell out of .createReadStream(), I also have quite a few "bag of keys" scenarios. A "batch get" would be mostly about reducing "code smell" for level in these instances, and if there was actually a performance benefit to boot, as you mention that there might be, then that would be even better.

Good point. I've had similar use cases. There's a difference though between needing a few keys (quick brown fox) and 15k. So I'm interested in the specifics of your use case too @NirmalVatsyayan.

I think, rather than a batch get, I prefer to officially expose iterators with a seek() method. This not only enables fetching multiple non-contiguous keys but also getting multiple ranges of keys. You seek, read some, seek again. Also you get backpressure and can do things like skip scans (for compound aka composite indexes).

For the simple use cases, "batch get" can be a userland module on top of iterators.

It's also doable in memdown (because there, iterators operate on an immutable tree; seeking is theoretically possible) and level.js (without backpressure though, because of how IndexedDB cursors work). If we choose to forego snapshot guarantees, then it's doable everywhere, but probably not with the same performance.

@vweevers I am using the cache for re-ranking of results from Support vector machine, i get N most similar items for some input vector and need to re rank them based upon some business logic (fuzzy match + noun intersection etc), so the keys could be any thing among the data set. Basically i need to cache 2-3 million objects and want to access any random set of 15 K from them.

@dominictarr this sounds like something more up your alley. Any advice for @NirmalVatsyayan?

@vweevers I prefer to officially expose iterators with a seek() method.

I was thinking about this too. For those *-down without iterator.seek(), we may provide a polyfill in levelup by swapping internal iterators:

Iterator.prototype.seek = function(target) {

	// Native implementation
	if (typeof(this._iterator.seek) == "function") {
		this._iterator.seek(target);
		return this;
	}

	// Polyfill

	// 1. End the internal iterator
	this._iterator.end(noop);

	// 2. New options
	var options = Object.assign({}, this._options);
	options[this._options.reverse === true ? "lte" : "gte"] = target;
	// ... what about 'limit'?

	// 3. Swap!
	this._iterator = this._db.iterator(options);

	return this;
};

@peakji what was your use case for seeking again? Just so we have a full picture (and another argument for exposing iterators πŸ˜‰).

Also do you need snapshot guarantees? Because with this proposed polyfill, seek() operates on a new snapshot. If we can't provide consistent semantics, I suggest to just throw if the *down doesn't support seek().

@vweevers I'm using seek() to collect the intersection of two sorted lists (contiguous keys).
For example, if we have two sorted lists stored in two namespaces, sharing the same ID scheme:

A = [a:001, a:002, a:008, ... , a:070, a:089, a:091, a:111]
B = [b:072, b:089, b:091, b:200]

If we want to get the intersection (089 and 091), we only need two iterators (a:* and b:*) and several skips:

a:001 -> b:072 -> a:089 -> b:089 -> a:091 -> b:091 - ...

This is much faster than naive scanning on large datasets, because we are able to skip a lot of keys (a:002 to a:070).

Also do you need snapshot guarantees?

Ahh... I forgot the snapshot thing... πŸ‘ for throwing error.

A little history:

  • @dominictarr proposed multiread (aka "batch get", "multiget") 4 years ago (Level/leveldown#13)
  • @juliangruber said "With a multiget the gain would increase with the numbers of keys given. I'm not sure yet about when this would pay off. Someone would have to do an implementation so we can see."
  • @rvagg warned "of potential size problems, perhaps we should put an upper limit on it or at least enforce the use of a limit property so at least they know what they are doing."
  • @dominictarr replied "I'm opposed to adding safety features to leveldown", referencing The Hole Hawg: "It is dangerous because it does exactly what you tell it to. [..] The danger lies not in the machine itself but in the user's failure to envision the full consequences of the instructions he gives to it."

  • @juliangruber proposed exposing snapshots (#138), which would allow multiple operations on a single snapshot. No compelling use cases were found.


  • @MeirionHughes proposed implementing Symbol.asyncIterator (#477) (userland for now, levelup or abstract-leveldown iterators later).

Hi, I'm currently try to implement a multi get with iterator and seek. My use case is we have a db with around 40m records in there. And we have to random get up to 15k of them when we do lookup.

So for our use case (a random sparse list of keys), I have an implementation a bit like below:

const iterator=level.iterator();

// wrap next in promise
const next = async()=>{
  return new Promise((resolve, reject)=>{
      iterator.next((err,key,value)=>{
           // resolve, reject
          // ...
      });
  })
}

for (let key of sortedKeys) {
  iterator.seek(key);
  const result =  await next();
  // do something
}

Or should I just use next to iterator every possible value than check it against the target?
My guts feeling is that using seek will be much faster because it's possible to skip a potential big gap.

Also I have a question about seek behaviour, in the document it says:

By calling seek(key), subsequent calls to next(cb) will return key/values larger or smaller than key, based on your reverse setting in the iterator constructor.

Does this mean after calling the seek(key), next() will land on a key that is larger but not equal to the given key? If this is the case, maybe I should make the key offset 1 character code before I call seek?

I guess for my use case, the most ideal method is using the purposed multiGet feature.

By calling seek(key), subsequent calls to next(cb) will return key/values larger or smaller than key, based on your reverse setting in the iterator constructor.

Where is it documented like that? We should update that, because in abstract-leveldown we have a more precise explanation that should answer your question:

Seek the iterator to a given key or the closest key. Subsequent calls to iterator.next() will yield entries with keys equal to or larger than target, or equal to or smaller than target if the reverse option passed to db.iterator() was true.

Thanks for the quick response, the doc is in https://github.com/Level/leveldown

My guts feeling is that using seek will be much faster because it's possible to skip a potential big gap.

Potentially, yes. With leveldown it's recommended to use db.iterator({ highWaterMark }) (Level/leveldown#468 (comment)).

I'm actually using rocksdb in my project. And was passing {gte,lte} based on the sorted key when I created the iterator. I assume it will be the same as highWaterMark ?

rocksdb also has the highWaterMark option. If you're expecting gaps between the keys (within your gte/lte range) then it can significantly boost perf.

Thanks, that'll be really useful. I've investigated a bit around the highWaterMark, and seems there is an internal batchNext function as well. Should I use batchNext together with highWaterMark or the normal seek will benefit from it as well?

That doesn't ring a bell, can you link to batchNext in source code?

Nah, sorry to border you. I've find it in a third party libaray and made a bad assumption it's calling the internal method of leveldown. Thanks for the help again, that's really useful.