pacman82/odbc2parquet

Issue with MySQL JSON columns

theOvercloud opened this issue · 8 comments

I am trying to export a table which has a json column type. However, it seems like the tool is unable to handle this currently.

The debug output of the column in question is as follows:

2023-02-02T10:37:12+01:00 - DEBUG - ODBC column description for column 5: ColumnDescription { name: [<removed>], data_type: LongVarchar { length: 4294967295 }, nullability: Nullable } 2023-02-02T10:37:12+01:00 - DEBUG - ODBC buffer description for column 5: Text { max_str_len: 4294967295 }

As you can see, it assumes a very large string length for the column. When it then calculates the memory usage per row, it gets a very large value. Memory limit option does not help, as a single row is now already in the gigabyte range. When using a row limit, it tries to allocate huge files, which also fails. For example with 1000 rows:

2023-02-02T10:37:12+01:00 - INFO - Memory usage per row is 4294970028 bytes. This excludes memory directly allocated by the ODBC driver. 2023-02-02T10:37:12+01:00 - INFO - Batch size set to 1000 rows. memory allocation of 4294967296000 bytes failed

Is there a workaround for that, or is this simple a limitation of parquet itself?

Hello @theOvercloud ,

thanks for the excelletly written issue. To my own suprise (unlike odbcsv) odbc2parquet does not offer a flag yet to limit the size of an indivdual column. You are unlikely the first person to run into this issue, but you are kind of the first to ask for it.

In terms of workarounds. Could you try to cast the JSON column into a VARCHAR(4096) column or the like in order to limit the size? You would need to of course apply some kind of domain knowledege how big your JSONS actually get, in order to avoid truncation.

This is more a limitation of ODBC rather than parquet. ODBC requires you for bulk (i.e. fast) fetching of rows to allocate the buffer before hand. Yet at that point in time you do not know the length of the actual values (otherwise they would be already fetched) so you have to assume the largest possible element there is.

Hope this answer is as helpful to you as your issue is to me. Otherwise feel free to ask again.

Best, Markus

Hallo Markus,

thanks a lot for your fast feedback! Good idea! I will certainly try that out. Hopefully the conversion does not introduces issues when inserting the data later into an other database. I will report back:)

Cheers,
Yanick

I was able to get it running using your suggestion. Thanks!

However, i noticed that the driver will cache all the batches before passing it to Rust. You can see that when looking at the network traffic and comparing it to the log outputs of the script. So i would still run into memory issues. The fix was to pass the driver argument NO_CACHE=1; (mysql docu) to the connection string. With that, the batches were directly passed to Rust and it worked as expected. Can you see the same behavior? Would it make sense to add this to the documentation or even add this argument by default? I am using the MySQL ODBC 8.0 Unicode Driver on Ubuntu.

Short update on the JSON front: As I said above, i was able to successfully export the table using the cast( <column_name> as char(..)) as <column_name>. However, when importing i had the following error:
2023-02-08T15:57:24+01:00 - WARN - State: HY000, Native error: 3144, Message: [MySQL][ODBC 8.0(w) Driver][mysqld-8.0.31-0ubuntu0.22.04.1]Cannot create a JSON value from a string with CHARACTER SET 'binary'.
I was able to fix that by setting the charset when connecting to the database via the initstmt. My full connection string now looks like this for inserts:

connection_string="Driver={MySQL ODBC 8.0 Unicode Driver};initstmt={set charset utf8mb4;};charset=utf8mb4;Server=$db_ip;Port=$db_port;UID=$username;PWD=$password;NO_CACHE=1;FORWARD_CURSOR=1;COMPRESSED_PROTO=1;"

Hello Yanick,

thanks for your feedback. I am a bit underwater at the moment, otherwise I would have already implemented an option to limit the column size for querying. I'll leave this issue open at least until I can do that.

I am happy it works now for you, and there is some value in persisting that knowledge. However, a lot of your issues are specific to MySQL, I am a bit hesitant to driver specific stuff, as I want for the most part to sit firmly on top of ODBC.

How about an odbc2parquet wiki here at GitHub with driver specific issues as a first step?

Thanks again for all your work and your feedback, I want to answer in greater detail, but I am pretty spend today.

Cheers, Markus

FYI: A new flag column-length-limit has just been added to the new version odbc2parquet 0.15.2.

Hi Markus,

Same here;) Always too much on the plate...

FYI: A new flag column-length-limit has just been added to the new version odbc2parquet 0.15.2.

I was not expecting that feature so soon, especially as you are doing this for free! Thanks a lot for the great tool and your work!

How about an odbc2parquet wiki here at GitHub with driver specific issues as a first step?

I agree. Would be great to have a knowledge base on that.

I think you could close the issue if you want.

Cheers,
Yanick

Hello Yanick,

thanks for the feedback. I have created a minimal homepage for the odbc2parquet wiki. Feel free to create a page to share you knowledge about odbc2parquet and MySQL JSON there. I did not use GitHub wiki before, so tell me if I missed something in case you wanna give it a spin.

Closing this issue.

Cheers, Markus