Support for 'implicit snapshot iterator'

Question

Support for 'implicit snapshot iterator'

hmalphettes opened this issue 10 years ago · 22 comments

"Redisdown fails the iterator test 'iterator create snapshot correctly' where an iterator is created, a record removed and the iterator still successfully finds the removed record." (thanks @maxogden for the edit)

I am don't know how to support this behavior with redis at the moment.

Answer 1 · 2014-09-09T20:30:55.000Z

More info about abstract-leveldown (where the test suite comes from): https://github.com/rvagg/abstract-leveldown#abstract-leveldown-

and also the test in question: https://github.com/rvagg/abstract-leveldown/blob/master/abstract/iterator-test.js#L458

@hmalphettes I think your description is slightly wrong. It should be:

"Redisdown fails the iterator test 'iterator create snapshot correctly' where an iterator is created, a record removed and the iterator still successfully finds the removed record."

The idea of snapshots is that when you open an iterator it also makes a snapshot of the db state so that the iterator will iterate against the rows in the db at the state they were in when the snapshot was taken. Meanwhile outside of the iterator other operations can change rows at the same time but the iterator will match the correct versions

Answer 2 · 2014-09-09T20:41:52.000Z

is this done to a single record or the whole db?

Answer 3 · 2014-09-09T20:49:51.000Z

@NHQ leveldb snapshots are for the entire state of the db at the time of the snapshot

Answer 4 · 2014-09-09T21:09:33.000Z

I believe redis has to be configured to make snapshots, to make snapshots. http://redis.io/topics/persistence

Answer 5 · 2014-09-09T22:11:48.000Z

@NHQ As far as I understand we cant take advantage of this: it is limited to how often one wants to dump the redis content on the file system in case of a catastrophic failure.

Answer 6 · 2014-09-09T22:46:38.000Z

you can tell redis to snapshot on the fly http://redis.io/commands/bgsave

Answer 7 · 2014-09-09T23:36:31.000Z

@NHQ, oh ok. Is there a way to make queries against such snapshot?

Answer 8 · 2014-09-10T00:44:35.000Z

my guess is to start a new redis process configured to read the new .rdb file and serve clients on another port
like this http://serverfault.com/questions/393169/how-do-i-load-the-dump-rdb-file-into-redis

Answer 9 · 2014-09-10T00:49:07.000Z

@NHQ that would not work against a standard rediscloud instance for example.
A variation: make a clone of the data when the iterator is started and query that dataset.

Answer 10 · 2014-09-10T00:57:46.000Z

To put things in context, it is not clear that this feature is a core requirement of an abstract-level implementation: Level/abstract-leveldown#38 (comment)

It might remain a leveldown only feature.
It is expected that only a subset of the abstract-level implementations can support this.
It is not documented on the level-up API.

The agreement at this point is to document the difference in behavior with levelddown when it is not supported.

I am closing this issue for now. We can re-open if we find a fantastic way to support this.

Answer 11 · 2014-09-12T18:07:53.000Z

If I understand it correctly, this library uses a zset as the backing store for the level data in Redis, correct?

If so, you could just use zunionstore to create a "snapshot"

Answer 12 · 2014-09-14T13:30:33.000Z

@brycebaril correct it is a zset. Great, I have never used a zunionstore before. Let me see if I can figure that out.

Answer 13 · 2014-09-15T05:10:44.000Z

@brycebaril we are using a zset to store the keys and an hset to store the values.
Can you give a very high level overview of what you have in mind to support snapshots?

Answer 14 · 2014-09-15T17:50:17.000Z

Ahh, so the index and data are split into two different locations, that complicates things. If it was only the ZSET then simply zunionstore could create a copy of the ZSET as the snapshot.

Given the split storage, this would mean that the snapshot index (zset) could point to data that is either changed or removed from the data backing (hash). You could do the "snapshot" locally to the node process (ship the entire zset and hash) or partially (zunionstore and ship only the hash) but that results in some memory overhead and result processing for data you may not need.

I'm assuming the O(log(N)) lookup for a single value (e.g. for GET) is why you didn't go with putting all of the data in the ZSET? Other than that I think it would work. For a read-stream heavy use-case this would likely improve performance to not have to go to the hash for each record.

Answer 15 · 2014-09-15T23:33:19.000Z

Thanks a lot for the review!

Regarding performance of read heavy streams, iterators are using lua to fetch batches of the values in a single redis call: https://github.com/hmalphettes/redisdown/tree/master/luascripts
Let me know if you think that would deserve some benchmarking.

I did not put all the data in the ZSET mostly because I did not think that it was possible.
Are you suggesting to encode the key and value together in the ZSET by concatenating them?

That implementation would be a lot simpler at the cost of O(log(N)) to lookup values by keys.
Writes would be more costly: first search for the key and delete it (ZRANGEBYLEX+ZDEL); then write (ZADD).

If complexity and size of the collection are not the issue we could support a hybrid:
keys and values in a ZSET for iterations and a separate HSET maintained for instant key lookups.

I am not familiar with enough levelup use cases to decide what is the best implementation really.

What do you think?

Answer 16 · 2014-09-16T19:13:05.000Z

The other problem I just thought of with concatenating the key/value into the ZSET is it incurs a removal cost for updates, since in the current setup updates to existing keys is just a HSET and a noop ZADD, but in a ZSET-only world you'd need to remove the ZSET member that contains the old data.

The other general problem with "snapshotting" on Redis is that you're doubling (at least) your memory footprint in Redis whenever you perform it.

In terms of Level use-cases, it depends on how strongly you want to guarantee of readstreams being read from snapshot data. I will try and find some time to read through iterator.js in detail to see if this is being violated or not -- at first glance it looks like the stream may be virtual in that you're fetching all of the data at once and buffering it locally vs. using some sort of cursor and re-fetching per batch...

At this point, I don't think that snapshotting beyond generating consistent read streams is a commonly-used (or used at all) feature of levelup. If the streams here (even if they are non-streaming from Redis and locally buffered as streams) are consistent, I don't think you really need to support snapshotting.

To do this truly natively streaming you'd need a lexicographic version of ZSCAN in conjunction with ZUNIONSTORE for snapshotting. I don't know offhand if ZSCAN reliably replies in lexicographic order -- it seems to, but most likely it isn't guaranteed.

The best one could probably do when the dataset becomes so large that you can't buffer locally is one could emulate it with ZUNIONSTORE and a cursor on the Node.js-side with ZRANGEBYLEX with a LIMIT that re-queries from the snapshot for each batch. E.g. see https://www.npmjs.org/package/redis-scanstreams for inspiration.

Both of these would work if the backing store was the single ZSET, though you might be able to do basically the same thing if it was easy enough to write a HASH copy LUA command that was performant. Then in a single "snapshot" command, you could ZUNIONSTORE to copy the ZSET, then do the HASH copy to get an in-redis-memory snapshot of the data to stream from.

Answer 17 · 2014-09-17T05:56:00.000Z

OK @brycebaril thanks again for all the ideas.

iterator.js fetches the values from redis little by little: not everything at once.
The number of values fetched is the value of the option highWaterMark: 256 by default.
Let me know if we could do better than that.

The ZSCAN documentation says there is no guarantee that elements wont be returned multiple times and keeps silent regarding the order.

So snapshotting support in redis would require to clone the data: prohibitive in many situations.

Once we guarantee that even if we add or remove objects during an iteration we still see values only once and in the right order: #11.
That seems very near to being a consistent read-stream.

Answer 18 · 2014-09-17T06:11:20.000Z

Another alternative to snapshots would be a mutex lock on top of the redis
API

On Tuesday, September 16, 2014, Bryce Baril notifications@github.com
wrote:

The other problem I just thought of with concatenating the key/value into
the ZSET is it incurs a removal cost for updates, since in the current
setup updates to existing keys is just a HSET and a noop ZADD, but in a
ZSET-only world you'd need to remove the ZSET member that contains the old
data.

The other general problem with "snapshotting" on Redis is that you're
doubling (at least) your memory footprint in Redis whenever you perform it.

In terms of Level use-cases, it depends on how strongly you want to
guarantee of readstreams being read from snapshot data. I will try and find
some time to read through iterator.js in detail to see if this is being
violated or not -- at first glance it looks like the stream may be virtual
in that you're fetching all of the data at once and buffering it locally
vs. using some sort of cursor and re-fetching per batch...

At this point, I don't think that snapshotting beyond generating
consistent read streams is a commonly-used (or used at all) feature of
levelup. If the streams here (even if they are non-streaming from Redis and
locally buffered as streams) are consistent, I don't think you really need
to support snapshotting.

To do this truly natively streaming you'd need a lexicographic version of
ZSCAN in conjunction with ZUNIONSTORE for snapshotting. I don't know
offhand if ZSCAN reliably replies in lexicographic order -- it seems to,
but most likely it isn't guaranteed.

The best one could probably do when the dataset becomes so large that you
can't buffer locally is one could emulate it with ZUNIONSTORE and a cursor
on the Node.js-side with ZRANGEBYLEX with a LIMIT that re-queries from the
snapshot for each batch. E.g. see
https://www.npmjs.org/package/redis-scanstreams for inspiration.

Both of these would work if the backing store was the single ZSET, though
you might be able to do basically the same thing if it was easy enough to
write a HASH copy LUA command that was performant. Then in a single
"snapshot" command, you could ZUNIONSTORE to copy the ZSET, then do the
HASH copy to get an in-redis-memory snapshot of the data to stream from.

—
Reply to this email directly or view it on GitHub
#10 (comment)
.

Answer 19 · 2014-09-17T22:11:46.000Z

@hmalphettes ok, I gave it a good read-through and I see how you're batching.

Unfortunately without a snapshotting solution this is very problematic.

This doesn't provide a consistent read as from a snapshot (or like LevelDB) unfortunately, so that's the problem we've discussed above -- which means since records can be removed or inserted between index reads, old results can repeat and old records can be skipped, or new records returned inconsistently depending on the current offset.

Example: (let's say the HWM is artificially set to 1)

127.0.0.1:6379> del zs
(integer) 1
127.0.0.1:6379> zadd zs 0 A 0 B 0 R 0 Y 0 Z
(integer) 5
127.0.0.1:6379> zrangebylex zs - + limit 0 1
1) "A"
127.0.0.1:6379> zrangebylex zs - + limit 1 1
1) "B"
127.0.0.1:6379> zrangebylex zs - + limit 2 1
1) "R"
127.0.0.1:6379> zadd zs 0 D
(integer) 1
127.0.0.1:6379> zadd zs 0 E
(integer) 1
127.0.0.1:6379> zrangebylex zs - + limit 3 1
1) "E"
127.0.0.1:6379> zrem zs A
(integer) 1
127.0.0.1:6379> zrem zs B
(integer) 1
127.0.0.1:6379> zrangebylex zs - + limit 4 1
1) "Z"
127.0.0.1:6379> zrangebylex zs - + limit 5 1
(empty list or set)

D but we saw E even though it was inserted later, and Y is missed even though they existed prior to the first streamed value.

Back to alternative solutions one other way is to harness Redis' replication for this -- while you'd limit the number of snapshots you could have at any time, you could simply "snapshot" by reading from a replica where you "snapshot" by disabling replication until you are done with the snapshot.

Answer 20 · 2014-09-18T00:04:53.000Z

@brycebaril yes understood, so the fix to avoid this very inconsistent read-stream consist of not using an offset between batches.
Instead the iterator will keeping track of the last value fetched and make a new range query where the excluded starting key is the last fetched key: #11

This wont give us a snapshot but it will guarantee that we don't skip values or get them multiple times.

Answer 21 · 2014-09-18T17:40:39.000Z

That is the same short-term fix I came up with. When it comes down to it not all back-ends can provide the exact same feature-sets, I think this is something that the level ecosystem should not only accept but welcome. Pick the right tool for the job -- if all the back-ends were exactly the same feature-wise it wouldn't make sense to have more than one.

127.0.0.1:6379> del zs
(integer) 1
127.0.0.1:6379> zadd zs 0 A 0 B 0 R 0 Y 0 Z
(integer) 5
127.0.0.1:6379> zrangebylex zs - + limit 1
(error) ERR syntax error
127.0.0.1:6379> zrangebylex zs - + limit 0 1
1) "A"
127.0.0.1:6379> zrangebylex zs (A + limit 0 1
1) "B"
127.0.0.1:6379> zrangebylex zs (B + limit 0 1
1) "R"
127.0.0.1:6379> zadd zs 0 D 0 E
(integer) 2
127.0.0.1:6379> zrangebylex zs (R + limit 0 1
1) "Y"
127.0.0.1:6379> zrem zs A B
(integer) 2
127.0.0.1:6379> zrangebylex zs (Y + limit 0 1
1) "Z"
127.0.0.1:6379> zrangebylex zs (Z + limit 0 1
(empty list or set)

👍

Answer 22 · 2014-09-19T08:28:39.000Z

Thanks a lot for the review; very much appreciated (sorry I keep repeating myself).
Great, I'll pick #11 shortly.