Provide more configuration for avro extraction

Question

Provide more configuration for avro extraction

Closed this issue 6 years ago · 3 comments

It would be good to expose more configuration to extract data for dbeam:

--fetchSize: jdbcStatement fetchSize configuration
--avroCodec / --codecCompression: configure avro codec

Answer 1 · 2018-07-06T00:31:19.000Z

@labianchin The "ResultSet" section of the MySQL JDBC documentation suggests even the existing setFetchSize is ignored unless ?useCursorFetch=true was specified on the connection string.

Can dbeam be used to dump an entire MySQL database? I'm interested in dumping 20M+ tables (mysqldump --tab output isn't 100% compatible with BigQuery CSV parser), and am interested in how to modify the SELECT * from <table> to partition by the auto-increment primary key and run in parallel, which would hopefully also avoid any OOM related to fetchSize.

Answer 2 · 2018-08-14T08:39:08.000Z

@hienle-hps apologies for the late response.

I'm interested in dumping 20M+ tables

I don't think that would be a problem. We've been exporting 60M+ PSQL view with joins and 400M+ tables (with no joins).

am interested in how to modify the SELECT * from <table> to partition by the auto-increment primary key and run in parallel

There isn't automatic support for that yet. Sqoop does support that by running select max(id), min(id) and partitioning by that. If you still want to run exports in parallel I suggest you create N views SELECT * FROM table WHERE id BETWEEN 1 AND 10000000 and N dbeam exports (one for each view.

which would hopefully also avoid any OOM related to fetchSize

DBeam should not run into OOM issues. Can you configure the database connection to do ?useCursorFetch=true as you described?

Answer 3 · 2018-08-14T16:38:03.000Z

I ended up using the streaming value Integer.MIN_VALUE to avoid the OOM, and kept the CloudSQL and small Dataflow instances in the same region/zone to minimize latency.