stephenmathieson/node-sophist

iterator works slow

gabrieldodan opened this issue · 9 comments

Hi, I have made some performance tests and I noticed that the iterator works very slow. I think is because of callback for each iteration. On each iteration it call the callback function, that's expensive. I think a Sync version for next() function would be very useful, e.g. iterator.nextSync() that should return the current key/value pair. What you think ?

Besides that, the library is really fast !

hmm yeah a sync version may be nice. I haven't hit performance issues with #next(), but we do jump from JS to c++ a lot, so I believe it may be slow.

the next addition I make to this lib should be node 0.11 support, but I may find time for sync iterarors. if you opened a pr, I'd happily merge, btw ;)

Thanks :) Most probably is slow because of callbacks. I have also compared the performance of Async get/set vs Sync get/set. Async versions are much slower.

Also, a new suggestion. Would be very useful to have batch operations, for example to read multiple keys in one shot , a functionality somehow similar to Redis MULTI. e.g. getMulti(['k1', 'k2', 'k3'], cb) where cb is cb(err, values) In this way you can avoid so much jumps from JS to C++, you call only one callback ;)

yeah, the async stuff has to:

  1. cross from JS to c++
  2. be added to the libuv pool
  3. wait its turn for execution
  4. actually run
  5. callback js

whereas the sync versions:

  1. cross from JS to c++
  2. run

quite a few less things to do ;)

a mutli-get makes sense too. it lands out of scope of the iterator, unless we'd do something like:

var iterator = db.iterator()
iterator.next(5 /* get 5 the next keys */, function (err, values) {
  if (err) handleError(err)
  // values = [ ... ]
})

also, if you run $ make benchmarks in the root of this repo, you'll be able to see iterator performance vs LevelDB. i think you'll be impressed ;)

edit:

                      Sophist#iterator
              70 op/s » iterate 1000 keys

                      LevelDOWN#iterator
              46 op/s » iterate 1000 keys

I am already impressed by this library :) seems it is perfect for my needs. It is really fast except Async iterators, according to sophia benchmarks http://sphia.org/benchmarks.html reading data using iterators should by much much faster than using get(key)

iterator.next(5, cb) would be useful also but not as critical as Sync iterator. I have made many tests with thousands of databases and hundreds of keys per database, it reads data very fast, I am curious how fast it reads using a Sync iterator. Any idea how would take you to implement the Sync iterator ? If you are very busy I could take a look at C++ source maybe I can implement myself Sync iterator.

a sync #next() shouldn't be hard. the async method is here.

for sync, maybe something like (this has not been tested!):

NAN_METHOD(Iterator::NextSync) {
  NanScope();
  Iterator *self = ObjectWrap::Unwrap<Iterator>(args.This());
  sophia::IteratorResult *result = self->it->Next();
  if (result) {
    v8::Local<v8::Object> obj = NanNew<v8::Object>();
    obj->Set(NanNew("key"), NanNew(result->key));
    obj->Set(NanNew("value"), NanNew(result->value));
    NanReturnValue(obj);
  } else {
    NanReturnUndefined();
  }
}

@gabrieldodan I've been working on a new release (1.0.0). I've added Iterator#nextSync() as well as Iterator#endSync(). If you've got time, checkout the wip-node-11 branch ;)

Thanks, will do some tests tomorrow :) Also if you could implement muti-read that will be great! The idea is to read batch of data in only one call, to avoid too many jumps from JS to C++. An iterator like this:

iterate({start: 'k1', end:'k100', reverse: false}, function (err, values) {
});

iterate({start: 'k1', limit:20, reverse: false}, function (err, values) { // read only 20 records starting from k1
});

It doesn't make sense to return object by object, do the iteration on C++ and give me only the results! Will be also useful an option to return only the keys or only the values, or both. Thanks.

hmm.. iterating like that over a large repository would cause significant performance issues when dealing with any sort of concurrency. assuming we'd only push one task into uv's queue (read everything), everything else would be blocked (http stuff, other iterators, etc.).

now, if we queued a job for every read in uv's pool, we'd avoid blocking the main thread, but the iterator would be slow (again). all we'd do is reduce the number of times we go from js <-> c++.

also: opened separate thread: #13.

version 1.0.0 now has iterator#nextSync() implemented.

Will address iterator#next(N, fn) later on.