dscape/clarinet

Implement pause()

Closed this issue · 5 comments

Hi,
I'm trying to parse a large JSON file without blowing up my RAM, and this seems like a good solution. Specifically, I want to reach each object, restructure it, and write it to its own file, then go onto the next object.

However, the stream doesn't appear to support the native stream.pause() method. So it can't wait for the file to write before it keeps reading new objects. In your example code, you seem to be implementing a buffer/stack to capture the objects as they're parsed - but that kind of defeats the purpose of reading them one at a time and not using too much memory.

Is there an intrinsic reason Clarinet doesn't support pause(), or is it something that could be potentially added?

Thanks!

I'm open to a solid pause() implementation. Please do send a PR.

That said this is not necessary in order not to "blow up your ram". Or it shouldn't be, since clarinet is a streaming parser and it forgets really fast :)

Do you have skype? I'm nunojob, would love to see clarinet blowing up your memory!

I'm running one of the benchmarks tests on loop and memory seems as stable as it gets.

We need to work together on reproducing this, or send me a .js file that reproduces thisc ase.

Haha, thanks for the super-quick reply.

Perhaps I don't understand how streams work: For each object read from my file, it has to spawn a process to do some outside logic, then write the results to a file - so it can take a little while. In that time the stream could have read the entire source file. If I'm buffering the results waiting for the write process to finish, that buffer could easily be as big as the entire source file. In which case, what's the advantage of using a stream?

Thanks

(To be clear: spawning a process has nothing to do with streams, it's just what I need to do for each object. Hence it would be good to pause the stream, do that slow work, and resume the stream.)

pgte commented

It doesn't make sense to pause the parser for your case. The parser is a write stream, and you don't pause a write stream.

Let me see if I get this right:

You have this piping:

readStream -> clarinet parser -> writeStream

Your problem is that the parser will keep outputting events and you are somehow flushing these events to the writeStream and they get buffered to be written out.

This is problematic if the write stream is slower than the read stream, buffering up. Is this right?

If so, you need to pause the readStream while the writeStream is not flushed so you don't buffer up too much.

Makes sense?