Level/levelup

Bite the bullet and remove WriteStream completely?

rvagg opened this issue ยท 42 comments

WriteStream is only there for the sake of symmetry but that's not a great reason in itself if there are better reasons for it not to be there. I think we're starting to realise that there are different use-cases that require different implementations and we're already seeing some alternative implementations so why not set them free to coexist and compete in user-land?

Discuss

If we conclude that it should be removed then we can add documentation to the ReadStream section saying that if you want a WriteStream then go to a particular place on the wiki to see a list of implementations and what they are good for. We could even dedicate a whole wiki page to this if need be.

if we concluded that it should stay, then I'll try and get these competing pull requests for new WriteStreams reconciled and sort something out because having a streams2 ReadStream and a streams1 WriteStream is a mess and this needs to be resolved.

+1 for moving it into userland! levelup should only maintain for common cases. do like we want with node core, expose the critical primitives and let userland do the bike-shedding.

๐Ÿ‘ for user land as well.

My userland precursor to level.js (levelup on top of indexeddb https://github.com/Raynos/levelidb/blob/master/stream.js#L2) also had write-stream as a user land module rather then being implemented as a core primitive.

Hmm, I'm less enthusiastic about moving it to userland. I've used WriteStream enough to make me worry that a userland implementation would fracture support when combining level-* modules. For example, level-version wraps WriteStream with special behavior to handle the additional version metadata. Having WriteStream in the core (and thus wrapped by level-version) means that level-gc can assume that level-version handles anything version-related with WriteStream.

The ReadStream/WriteStream symmetry I think is important for modules given it is the most logical way to do things like replication, which to me feels like an important common use-case.

It feels levelup core to me.

@brycebaril I think you make good/fair points, except i'd say that streaming APIs aren't necessarily coupled with replication schemes.

@brycebaril what's stopping you from wrapping the actual primitives that any WriteStream implementation will have to use? You're still in a situation where you may have to deal with alternative WriteStream implementations as people may have data that's suited to a different mechanism, like @maxogden is finding out. If you can wrap & extend only the primitives then you have the most portable solution in any case.

Having fewer "primtives" to deal with ought to be a bonus for libraries extending levelup, WriteStream is a false primitive because it just builds on the write operations (put, del, batch).

No, that's fine -- any sufficient levelup wrapper should fully wrap the put/del/batch primitives, so I guess any WriteStream library that correctly wrapped the primitives should plug into anything else.

I'm still in favor of keeping it for the symmetry, but I guess I haven't fully kept on top of why the WriteStream implementations differ in ways that make them more/less suitable for different use-cases. Is there a summary somewhere I can catch up on?

I'm not sure that I can get behind the 'less work for an extending library' argument unless we can also do something like get rid of one of the two batch implementations. Each of those requires significantly more work than wrapping WriteStream.

So far it looks like I'm the lone supporter of WriteStream, and I can certainly live without it, though one of the things that I feel sells this database is the native Streams support. I can't help but feel removing streaming features will chip away at that veneer of magic.

One the one โœ‹ I agree with @brycebaril about the read/write API parity being convenient, on the other โœ‹ I don't think whatever writestream implementation gets included will ever be the fastest possible way to write data into leveldb for all use cases.

Basically, if you are using leveldown, and you wanna optimize write throughput, you wanna write data in batches no larger than the leveldb writeBufferSize. The logic for doing that is sufficiently complex to warrant it's module (byte-stream), so including byte-stream or something like it in levelup wouldn't really make sense as the approach there might not make any sense at all with other backends.

So in the case of using leveldown you will end up using a third party module for write streams anyway because it will be faster, which begs the question "why have a write stream in levelup at all?".

So it kind of boils down to keeping them in for convenience/API parity with read streams, which means that its that much more API surface area to support in levelup which means that people might ask questions like 'why am I only getting 2000 inserts per second when the leveldb benchmarks say I should be getting 100,000?' because they used the write stream that came with levelup but didn't realize they needed to install a third party module that fit their use case.

These kinds of API decisions are hard to make because you have to anticipate what complex effects they will have down the road. My gut feeling would be to err on the side of limited API, but I'm not really 100% confident either way on this issue.

And you can keep level module (bundle) as user- (consumer-) friendly alternative for regular peps, as plug-n-play solution and example of how modules work together.

I'm definitely ๐Ÿ‘ on this.

First, it reduces API area. Less stuff to maintain!
Second, you end up having a different implementation anyway.

I'm with @alexindigo that we should 'bless' an implementation and add it to the level module.

I think we can deprecate it in the next release, and then remove it in a month or so.

Sounds like there is unanimous agreement. Removing it means less surface area, fewer primitives to wrap and less to maintain in the current code. :)

pgte commented

I'm also +1 on this one. From the benchmarks we saw that optimizing for different use cases is nearly impossible, so the writestream needs to be pushed into userland since any default "optimization" will most probably be wrong.

Also, this opens up even more the debate and documentation of write performance, which is always good.

I think we should mantain some kind of high-level tests for a writestream to make them interchangeable.

We could maintain a reference implementation that does the simplest thing, like @substack's PR does now, it could live in a repo under the level org and have a set of reference tests that other WriteStream implementers could use as a foundation for their test suites, like we do with abstract-leveldown.

Hmm, so with replication modules, I've usually needed more exact schemantics,
so I've implemented my own write-stream that enforces things like updates a sequence number in a controlled way.

I think this is really about bulk loads. We need more research in this area, and test cases.
@maxogden you have the most experience here, can you explain, or link us to the techniques needed?

This is something that could benefit from test cases that could be reused across different writestream implementations.

I'm in favor of this in principle, but I think we should hold off until we know a bit more about the problems.

@dominictarr at the end of the day bulk loading will depend on the backend being used. what I have here https://gist.github.com/maxogden/6551333 would probably be fine for all levelup instances as long as they are writing to leveldown, but it might not make sense for other backends

well, something has to be done asap, if we're going to hold off on this then we need to get a streams2 WriteStream in to core for one of the next couple of releases, the disparity is causing some people pain. I'll work on reconciling the existing PRs or will come up with a recommendation for which one to proceed, perhaps opting for the simplest option for now, or I might even revisit my original PR for this!

๐Ÿ‘

For the following reasons:

  • leveldb does not really support streaming puts or batches therefore the streaming API is just a convenience
  • same is true for read stream, but it's more obvious on how to iterate over the range and most likely no alternative solutions are needed
  • writing in a performant manner seems to be hard to get right (see @maxogden's work and hyperlevel), while reading seems to be not an issue
  • therefore write streams will need to be tailored to the particular use case, so it is more likely that custom ones need to be used in order to achieve good performance

Therefore it makes sense to not include a writestream bundled with levelup, but instead allow them to be added as a wrapper and/or mixin (i.e. ala level-sublevel).

I believe however it would be a good idea to provide a writeStream already mixed in with the level package in order to make things easier for newcomers.

Additionally lots of documentation relating to writeStream implementations will help to prevent people from getting too confused about this.

I'm not really sold on the "put it in level" idea, does level become our monolith then? Its original purpose was to make a default levelup/down package easier to install and use, what's the boundary for "easier to install and use"? What else might go in there?

The "put into level" part is not that important at this point IMO. The main point however is that I believe it is the right choice to remove writeStream from core.

If we make level this newcomer friendly - "ready to go" pack or just leave it to provide a default *down implementation is another discussion.

I have no strong opinions about that part either way, but suffice it to say that if I use levelup I usually install levelup and leveldown separately and hook them together myself.
I have more control then and can easily swap in leveldown-hyper or memdown for testing and/or performance.

-1 "level becomes our monolith".

ok, so there are a couple of things going on here, it sounds like some of you guys are actually using WriteStream and others are arguing on the basis of symmetry and what we think would be worthwhile for newbies in terms of a "complete" API.

TBH, I really have little use for WriteStream, I prefer batch when I have multiple operations and don't tend to need streaming writes for anything. So, if you are using WriteStream and want to keep it in core then make your case now. If you don't use WriteStream then your non-use should be an agreement that we should remove it.

+1 to get rid of wireStream. Just bump the major version and move on. Somebody will publish a module to do var writeStream = levelWriteStream(db) if it's really needed.

I'd still prefer if it stayed, but if it goes away life will go on.

I actually was playing with writing my own read/write streams directly atop LevelDOWN this weekend and if writeStream support went away tomorrow I wouldn't be far from publishing a stand-alone library for it.

The symmetry case is nice for the concept of "levelup is a node database with stream support" streams out, streams in. Showcasing Streams is great for Node, and great for the prototypical nodebase. Sure that support could come from an external library, but It seems to fit the bill for a core feature. I may be biased here because I almost entirely use this library in a streaming manner. Given we iron out the pull requests does the marketing value outweigh the support cost?

One hypothetical reason to keep it that I just thought of is: current back-ends only wrap put/del/batch primitives for writeStream. Will there ever be a LevelDOWN implementation (on LevelDB or some other back-end) that could be designed to optimize for writeStreams? I don't know if this is likely, but there certainly are databases that are optimized heaver for writes than reads. It seems possible, and if that were to happen would we want to pull the writeStream interface back into levelup?

In terms of the use-case argument: For me, 85%+ of the time I'm using writeStream is the simple case where the current implementation is fine. E.g. readStream.pipe(transform).pipe(writeStream) where I'm only working with a small-moderate amount of records. The other 15% of the time falls in the data-loading category where the current model has problems. However with @maxogden 's work and my current LevelDOWN-direct pure-buffer writestream experimentation I'm happy sourcing external solutions for those problems.

+1 to get rid of writeStream, batch is great and we can abstract on top of that.

19h commented

We're using writeStream for in-place duplications of databases on our HadoopFS where we just spawn a readStream and pipe it to an arbitrary writeStream. This applies to the same context of @brycebaril's readStream.pipe(writeStream) โ€” I find it too simple as an API for it to be removed.

http is only in node core because it's used so frequently

(not an exact quote)

I have to agree with @KenanSulayman and @brycebaril here. The writeStream is too simple (IMO) to just be removed. While I do agree with the idea of retaining the smallest possible core, I think the writeStream/readStream combo is too powerful for on boarding newcomers into the level world. With the couple presentations I have given on leveldb and levelup, I have found the streaming examples to be the most powerful. It shows how we can read, transform and write to our database with a simple readStream.pipe(transform).pipe(writeStream). Its an excellent way to get people more familiar with node streams as well as node databases with level.

Is there anyway to give an estimate on a maintenance cost on keeping some sort of generic writeStream in levelup or in level?

It might be a little early to use on-boarding/noob-friendliness as an argument for a poorly performing api. I also think we agree on maximizing developer happiness while providing actual value.

Streams are a simple abstraction, is that an argument in itself? It could be. But there is a generic/primitive. Batch and Put can be used to create simple streaming abstractions in user-land; so there can be readStream | transform | require('writeStream')()

I'm still thinking Rule of Simplicity in this case.

If write stream is too heavy / complex and we should use batch or user land instead then isn't read stream also too heavy / complex and we should use the iterator or user land instead.

I think the iterator in level down is a nice simpler API.

@Raynos read stream doesn't suffer from the same problems, but yes the iterator api is nice ;)

@hij1nx that is fair. The value I see is that when performance is not yet in the equation (which usually does not come until later), it becomes a game of "which writeStream do I use?" While this could be "fixed" with documentation of writeStream implementations, it becomes another barrier to use. While I still do dislike the idea of having a poorly performing api in the core lib, the people who are concerned about performance will know to use the other implementations or will write one for their specific use case.

I believe the question comes down to how much "cruft" does the writeStream api endpoint really add? While I don't believe it adds enough to warrant a full removal, we are in a position where we are all giving our subjective opinions. How far will the Rule of Simplicity take us? ;)

The "rule of simplicity" will take us all the way; the lesson we're learning here is that the core needs to be small and only have the primitives needed to to build more complex features. The more we can encourage activity, innovation and even competition on top of levelup the more win for everyone.

I do agree that there is a certain beautiful simplicity in having a symmetrical API but it's still just sugar on top of primitives. Perhaps there is a case for keeping it in level but I'm strongly leaning towards removing it from levelup and it would seem that there is enough of a consensus here to go forward with that.

I'm not sure how I feel about the iterator/readstream issue though, that might be a separate question than the one we're dealing with here as it's more about what the exact nature of the primitives we're exposing is, not whether we want to call something a primitive.

@rvagg I think you've nailed exactly what is dividing the issue here -- primitives vs sugar.

The way I see it, LevelDOWN (et al.) provide the primitives, and LevelUp provides the node goodness on top.

Certain primitives get less sugar (e.g. put/get/del/batch mostly get smart content type conversion), but then the iterator is completely wrapped up into a ReadStream. Looking at it this way, WriteStream makes sense in the library -- it is adding a little node sugar on top of the primitives.

I don't think this prevents or discourages community innovation, to expose it like this.

However, I also don't want to come off to strongly against removing WriteStream. While I think it should stay, it has one inherent problem that honestly is going to be hard to help people be clear and informed about inside or outside of levelup.

A WriteStream for data loading or bulk writing is NOT the same WriteStream you want for smaller scale or low throughput writing (and visa versa). If you're doing bulk writing, there is no real magic bullet when it comes to loading the data and while options can be presented, it is going to be up to the user to decide which one will best suit them. A simple WriteStream may be suitable now, but will it still be once their database or data source grows past a certain point? When you start that's hard (to impossible) to know, and the tipping point is case-by-case and can't be automatically ascertained by LevelUP or any other library.

Since it looks like it is being removed, here's in my opinion, what needs to exist in the ecosystem:

  • level-bulk-load (some great work by @maxogden here)
  • level-bulk-writestream (I've done some unpublished pilot work here, about a 75% speed, 200+% memory footprint improvement from the core WriteStream for readstream.pipe(writestream))
  • level-writestream (Some simplified incarnation of the current WriteStream.)

I am moderately busy the rest of this week building a couple presentation slide decks, but I'm happy to implement/help with/provide input on any of these efforts.

Here's the real reason WriteStream has to go: it's a sink for endless argument, discussion and tweaking to suit many use-cases and none of that is to do with "core", or what's emerging as "core" in the Level* space.

I'm happy for us to continue to maintain a WriteStream as a group, even if just a simple implementation that serves the basic use-case but it has to get out of LevelUP so we don't pollute discussion about core with issues that are only about WriteStream. It's a distraction and needs to be partitioned.

So, now we have level-packager that is bundling the various level-* packages that combine LevelUP and the backend packages, we could very easily include a WriteStream with the shippable level-*s. It's not core but I think we may be agreeing that it's fairly important to the usability of LevelUP. So how about we pull it out, put it in to the Level org and bundle it with level-packager so you get one if you do something like npm install level, but discussion about what's best can be restricted to that repo and those who care enough about getting the general case right can contribute there and work on it, those who care about specific cases can maintain their own which provide drop-in replacements.

I am convinced. Mr. Vagg, tear down this wall.

say hello to level-ws, a basic writestream for levelup:

var level = require('level')
var levelws = require('level-ws')
var db = level('/path/to/db')

db = levelws(db)
db.createWriteStream() // ...

also should be able to be used like this if preferred:

var level = require('level')
var WriteStream = require('level-ws').WriteStream
var db = level('/path/to/db')

new WriteStream({ options }, db)....

It's streams2, I just took the implementation that @substack did in #177 because it's the simplest starting point.

@pgte has level-writestream, I think @maxogden and/or @brycebaril should publish their own too to match their particular use case(s) and we can document them in level-ws and also levelup somewhere.

I'm happy to bundle level-ws with level-packager so they get it when they use level, level-hyper, level-basho, level-lmdb and level-mem, unless others object to doing this?

Next step is a PR to remove it from LevelUP, I don't think we have any killer arguments against not removing it from core, I convinced myself yesterday with the argument I posted in my last comment above and I think the rough consensus here is to remove it.

nice!

19h commented

With a working drop-in in place there's no reason not to let go anymore. Go ahead!

+1 @rvagg I think its a good fit to bundle it with level-packager.

It seems concensus have been reached on removing write-stream. Closing.

The one missing thing here for someone who has done streaming with NodeJS but is still puzzled over the alternatives is a discussion what the strengths and weaknesses of some WriteStream implementations are and what to look out for. Any pointers?