Bad data paging / streaming

Question

Bad data paging / streaming

Closed this issue 7 years ago · 1 comments

According to Node.js documentation for readable streams: https://nodejs.org/api/stream.html#stream_readable_push_chunk_encoding

We are supposed to use method push on the stream class for the entire chunk of data. This way the data is delivered at the same pace as it arrives.

But in the current implementation here we for some reasons chop it into individual records: https://github.com/brianc/node-pg-query-stream/blob/master/index.js#L55

for (var i = 0; i < rows.length; i++) {
        this.push(rows[i])
      }

This creates the following problems:

The data is delivered at a different pace than it arrives, throwing event data on every single row rather than on each page, which makes it impossible to sync data processing/scaling with the rate at which it arrives.
The extra loop stalls a lot of the CPU time, as it results in a needless separate processing for each row, which includes firing event data for each row.
It makes impossible for the client to efficiently handle pages of data, forcing it to do this row-by-row, thus counteracting what is otherwise good in streams scaling.

What the code should do instead - throw that loop (above) away, and just do:

this.push(rows);

This will also make sense out of parameter batchSize that we can pass, which in the current implementation doesn't affect the client-side at all.

Answer 1 · 2017-08-09T15:05:32.000Z

The purpose of this module is to provide a stream of rows from the database, where each row is an item on the stream. Doing what you suggest would fundamentally change the behavior of this module to behave in a way it hasn't since it's inception. If you want to push a chunk of rows at once onto a stream your best bet is to implement your own module on top of pg-cursor. You could even fork this, change its behavior, and rename it - the power of open source!

node streams use back pressure with high and low water marks internally to manage the rate the data arrives with the rate the data is consumed from the stream. This stream is no different, and will not consume more from the cursor than required.
Doing a for loop on 100 items stalls a lot of the CPU cpu time? 😑 Again, the point of this module is to have each row be its own item in the stream, so pushing items one at a time is the way to go here.
If you want a stream of pages or chunks I suggest implementing a different stream on top of pg-cursor which doesn't push one row at a time. Should be a pretty quick one to bang out!

Batch size affects how many rows will be read at a time from the cursor. This has impact on back pressure and the amount of data kept in memory while the stream is flowing.