pacman82/odbc2parquet

Feature Request - Support column encryption in the generated parquet file

chrisfw opened this issue · 4 comments

Hi,

First, I'd like to thank you for creating and maintaining this great tool!

I'd like to ask if you would consider adding columnar encryption as an optional feature? It is supported by the Arrow libraries and it would be really nice to retrieve but be able to protect PII columns in the exported parquet file.

I haven't found many examples, but this one demonstrates the functionality in python/arrow with Vault as a KMS.

https://arrow.apache.org/docs/_downloads/2713f3cdaa3fc0dc691cd51bac09c6d4/sample_vault_kms_client.py

Thanks,
Chris Whelan

Hello Chris,

thank you for the nice feedback.

I'd like to ask if you would consider adding columnar encryption as an optional feature? It is supported by the Arrow libraries [...]

First time I am hearing of this 😅 . It may surprise you, but I am not using any actual arrow arrays in this implementation. Rather I pass the buffers I am using for transfer from the ODBC data source as directly as I can to the parquet writer. For this I use the Rust parquet crate. In its documentation (https://docs.rs/parquet/33.0.0/parquet/) I see a lot about compression no options for encrypting columns. So, at the moment this feels out of scope for me.

However

I do actually maintain to artifacts which use Arrow arrays and connect them to ODBC.

In general if you want to do some post processing after fetching the data, you can use these to get it into arrow arrays and then save them into parquet. Do these help?

Best, Markus

Hello @chrisfw ,

did arrow-odbc work for you? I'll be closing this issue, but feel free to reopen it again, if you still feel odbc2parquet is lacking a feature here. Currently though, I see little I could do.

Best, Markus

Hi Markus,

I have not had a chance to experiment with arrow-odbc, but I agree it fits the use case nicely here so I should be all set. Thanks for your assistance.

Regards,
Chris