ZJONSSON/node-etl

Missing implementation of `maxBuffer` in Postgres

dennismphil opened this issue ยท 9 comments

@willfarrell @ZJONSSON

The documentation states

# etl.postgres.script(pool, schema, table [,options])

Collects data and builds up a postgres statement to insert/update data until the buffer is more than maxBuffer (customizable in options). Then the maxBuffer is reached, a full sql statement is pushed downstream. When the input stream has ended, any remaining sql statement buffer will be flushed as well.

However the code of Postgres is missing the maxBuffer implementation.

Thanks @dennismphil it would make sense to add this. We've had good experience with the mysql implementation as we can better calibrate the ETL load with maxBuffer (instead of number of records).

Do you want to take a stab at this in a PR?

Yes I will take a stab at this. Please assign to me

@ZJONSSON Taking a look at the current code, wondering if we should bring in a dependency something like Squel.js for a better readable code? Thoughts?

I generally like to avoid adding dependencies unless there is a good reason. Do you have an example of how the code above would look with Squel.js ?

I agree that we should limit adding in dependencies. If it cannot be avoided, I would recommend knex over squel. Mostly because it supports more database engines and has a larger community.

We could use Knex. I'll rewrite the existing code using knex in one file and let's see if you like the structure better. Will let @ZJONSSON / @willfarrell to veto it out if it does not look like it adds more clarity.

I am not liking knex after working for a while for all the workarounds to make it use as a query builder. Will drop the library and use plain string concatenation itself. WIll see how to organize it better.

Hi, I was reading about this topic. Would be interested to see this implemented wrt Postgres. How are things going?

As far as I am understanding, can you confirm that the change would look like the following?:

from:
INSERT INTO tmp(col_a,col_b) VALUES('a1','b1')
INSERT INTO tmp(col_a,col_b) VALUES,('a2','b2')
INSERT INTO tmp(col_a,col_b) VALUES,('a3','b3')

to (as long as MaxBuffer is not exceeded):
INSERT INTO tmp(col_a,col_b) VALUES('a1','b1'),('a2','b2'),('a3','b3')

Pros: faster performance

Yes that's basically what the maxBuffer is for - i.e. to capture as many values as possible until a certain maximum and then send off the query