Provide more configuration for avro extraction
Closed this issue · 3 comments
It would be good to expose more configuration to extract data for dbeam:
-
--fetchSize
:jdbcStatement
fetchSize
configuration -
--avroCodec
/--codecCompression
: configure avro codec
@labianchin The "ResultSet" section of the MySQL JDBC documentation suggests even the existing setFetchSize
is ignored unless ?useCursorFetch=true
was specified on the connection string.
Can dbeam be used to dump an entire MySQL database? I'm interested in dumping 20M+ tables (mysqldump --tab
output isn't 100% compatible with BigQuery CSV parser), and am interested in how to modify the SELECT * from <table>
to partition by the auto-increment primary key and run in parallel, which would hopefully also avoid any OOM related to fetchSize
.
@hienle-hps apologies for the late response.
I'm interested in dumping 20M+ tables
I don't think that would be a problem. We've been exporting 60M+ PSQL view with joins and 400M+ tables (with no joins).
am interested in how to modify the
SELECT * from <table>
to partition by the auto-increment primary key and run in parallel
There isn't automatic support for that yet. Sqoop does support that by running select max(id), min(id)
and partitioning by that. If you still want to run exports in parallel I suggest you create N views SELECT * FROM table WHERE id BETWEEN 1 AND 10000000
and N dbeam exports (one for each view.
which would hopefully also avoid any OOM related to fetchSize
DBeam should not run into OOM issues. Can you configure the database connection to do ?useCursorFetch=true
as you described?
I ended up using the streaming value Integer.MIN_VALUE
to avoid the OOM, and kept the CloudSQL and small Dataflow instances in the same region/zone to minimize latency.