singer-io/tap-postgres

Support for large transactions

Opened this issue · 2 comments

idris commented

Our integration (using Stitch, but anyone using tap-postgres would experience this) keeps hitting the Out of Memory error when it encounters a large transaction.

There are a few ways around this:

  1. For us, a specific table gives the problem, and we're not consuming anything for that table anyway, so provide a way for us to specify the filter-tables or add-tables parameter to wal2json. From Stitch, I'd imagine add-tables would be populated by the actual "tables to replicate" selected in the UI.
  2. Use write-in-chunks=true or format-version=2 to get around the issue. These are the officially-suggested solutions from the wal2json maintainers. format-version=2 seems like the easier/preferred route, however that's a pretty recent option, so it may not be widely supported yet. So maybe write-in-chunks is the way to go.
idris commented

FWIW, filter-tables/add-tables would also provide some cost/bandwidth savings because you wouldn't be streaming a bunch of extra data that you're just going to throw out.

Just to add one more workaround (though it won't help with full-table syncs), I'm just about to implement the undocumented incremental_limit from tap-mysql (see https://github.com/singer-io/tap-mysql/blob/master/tap_mysql/__init__.py#L653) in our fork.

What it does is to set a ceiling in the number of rows queried and emitted in a single run, so there's a fixed upper bound on query size. In our case this is the only way to deal with at as Postgres in Google Cloud SQL doesn't have wal2json support and also the target doesn't play well with massive chunks.