pacman82/odbc2parquet

Enable compression options as argument

chrisfw opened this issue · 13 comments

Hello,

I was excited to utilize your utility to export some microsoft access tables to parquet, but unfortunately I discovered that it appears no compression is applied so the file size is much larger than I anticipated. I am not sure if compression is something you are considering adding, but I would just like to mention that it would be great if it could be relatively easily implemented and the command line accepted parameters for compression options.

Regards,
Chris Whelan

Hello Chris,

I am happy you are excited about this tool. Sorry it does not seem to do the job right now. I am trying to understand what is missing though. The query subcommand does offer a --column-compression-default option. It defaults to gzip.

Cheers, Markus

See issue #51

Thanks for your response @pacman82. Apologies, I didn't see that option initially and it turns out that the initial batch size (1) was causing the output file to be a large size in spite of the gzip compression that as you indicated was applied by default. I tried a number of batch sizes that errored out due to memory allocation failures, but when I found a size that worked, the resulting file was much smaller and more manageable.
Some of my initial difficulties may stem from me testing the tool with the MS Access ODBC driver.

FYI, below is an example of the errors I was seeing.

2021-07-11T10:17:52-04:00 - WARN - State: 01000, Native error: 0, Message: [Microsoft][ODBC Driver Manager] The driver doesn't support the version
of ODBC behavior that the application requested (see SQLSetEnvAttr).
2021-07-11T10:17:52-04:00 - INFO - Memory usage per row is 1073741931 bytes. This excludes memory directly allocated by the ODBC driver.
memory allocation of 80530636800 bytes failed

Thanks again for your assistance.
Chris

Hello Chris,

yeah the tool chooses the batch size, so it does not exceed 2GB on a 64 Bit platform. The --batch-size-mib option might also be helpful to find the right size for your system, as you are likely to know how much memory you want to spend on each batch. I fully intend to add a row by row mode, which sacrifices performance for the ability to stream large values from the database. However I would need to dedicate some time to this, and my life is pretty busy right now, so no timeline on that.

Thanks for giving this tool a spin, and I am happy it worked for you after all.

Cheers, Markus

Thanks for following up Markus and for making this helpful utility. The 2GB default limit makes sense. I do have a problem importing (into a table or creating a view over) any file exported by odbc2parquet with batch size > 1 into DuckDB, but I am not sure where the problem lies at this point. DuckDB reports the following error:

>>> con.execute("CREATE TABLE chist as SELECT * from parquet_scan('/Users/cwhelan/Downloads/outcompressed.par')")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: INTERNAL Error: Invalid unicode detected in segment statistics update!

This could likely be an issue on the DuckDB side, but I thought I'd make you aware of it in case you have any ideas or have seen this problem occur before.

Regards,
Chris

@pacman82 , I installed parquet-tools and cat-ting the problematic file shows the below. Could the device fields with the \u0000 values be causing the issue?

cwhelan@Christophers-MacBook-Pro Downloads % parquet-tools cat --json outcompressed.par  
{"application":3,"accountnumber":112293300,"readsequence":2,"device":"035247N02002"}
{"application":3,"accountnumber":112294300,"readsequence":1,"device":"035446N02002"}
{"application":1,"accountnumber":112291000,"readsequence":2,"device":"035446N02002"}
{"application":1,"accountnumber":112291000,"readsequence":3,"device":"030216H04001"}
{"application":1,"accountnumber":112291000,"readsequence":4,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":3,"accountnumber":112292300,"readsequence":1,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":3,"accountnumber":112292300,"readsequence":2,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":1,"accountnumber":112292000,"readsequence":3,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":1,"accountnumber":112292000,"readsequence":4,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":1,"accountnumber":112292000,"readsequence":5,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":1,"accountnumber":112292000,"readsequence":6,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":3,"accountnumber":112293300,"readsequence":1,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}

When I cat the file produced where batch size = 1, all of the device values are populated with actual values.

Thanks for making me aware, but since it says INTERNAL error I am leaning towards an error within DuckDB. However, you may want to inspect the schema you are exporting carefully. Verbose mode might help. Check if there are for example some columns with a ridiculously large column size in there. If so, try to cast them down into something smaller in the query. 2GB per row sounds like a lot, you may get away with way less and Duck DB is then more likely to be able to process this.

Right now the tool just uses the schemas reported by ODBC, but they might not always be sensible.

Sorry for my last response I had not seen, your last post yet

If the files contain the Nuls before catting, please open a bug

@pacman82 , Thanks for your info, I will investigate the schema using verbose mode as you suggested. I suspect you are correct and the schema provided by the driver may be causing the problem. If that does prove true and I still have issues after casting to more reasonable column sizes and the nulls are there in the output file, I'll create a bug per your instruction.

FYI: Nils in the output are allowed, as long as they also appear in the input. That doesn't seem to be the case here though. See: #36

Thanks for keeping me in the loop

@pacman82 You were correct. It was a schema issue caused in this case by using the Long Text data type within MS Access. Once I changed all of the Long Text columns to Short Text columns, my problems both on the generation side with batching and compression and on the consumption side with the DuckDB import failure were resolved. Thanks for all of your assistance.

@chrisfw You are welcome.

Cheers, Markus