Feature Request - Support column encryption in the generated parquet file

Question

Feature Request - Support column encryption in the generated parquet file

chrisfw opened this issue 2 years ago · 4 comments

Hi,

First, I'd like to thank you for creating and maintaining this great tool!

I'd like to ask if you would consider adding columnar encryption as an optional feature? It is supported by the Arrow libraries and it would be really nice to retrieve but be able to protect PII columns in the exported parquet file.

I haven't found many examples, but this one demonstrates the functionality in python/arrow with Vault as a KMS.

https://arrow.apache.org/docs/_downloads/2713f3cdaa3fc0dc691cd51bac09c6d4/sample_vault_kms_client.py

Thanks,
Chris Whelan

Answer 1 · 2023-02-17T18:47:17.000Z

Hello Chris,

thank you for the nice feedback.

I'd like to ask if you would consider adding columnar encryption as an optional feature? It is supported by the Arrow libraries [...]

First time I am hearing of this 😅 . It may surprise you, but I am not using any actual arrow arrays in this implementation. Rather I pass the buffers I am using for transfer from the ODBC data source as directly as I can to the parquet writer. For this I use the Rust parquet crate. In its documentation (https://docs.rs/parquet/33.0.0/parquet/) I see a lot about compression no options for encrypting columns. So, at the moment this feels out of scope for me.

However

I do actually maintain to artifacts which use Arrow arrays and connect them to ODBC.

For Rust there is arrow-odbc: https://github.com/pacman82/arrow-odbc
For Python there is arrow-odbc-py: https://pypi.org/project/arrow-odbc/

In general if you want to do some post processing after fetching the data, you can use these to get it into arrow arrays and then save them into parquet. Do these help?

Best, Markus

Answer 2 · 2023-02-17T19:39:44.000Z

Hello Markus,Thanks a lot for your quick reply and the helpful info. I haven’t done any rust programming, but I will definitely explore the Python project! I really appreciate your assistance on this topic.Regards,ChrisOn Feb 17, 2023, at 1:47 PM, Markus Klein ***@***.***> wrote: Hello Chris, thank you for the nice feedback. I'd like to ask if you would consider adding columnar encryption as an optional feature? It is supported by the Arrow libraries [...] First time I am hearing of this 😅 . It may surprise you, but I am not using any actual arrow arrays in this implementation. Rather I pass the buffers I am using for transfer from the ODBC data source as directly as I can to the parquet writer. For this I use the Rust parquet crate. In its documentation (https://docs.rs/parquet/33.0.0/parquet/) I see a lot about compression no options for encrypting columns. So, at the moment this feels out of scope for me. However I do actually maintain to artifacts which use Arrow arrays and connect them to ODBC. For Rust there is arrow-odbc: https://github.com/pacman82/arrow-odbc For Python there is arrow-odbc-py: https://pypi.org/project/arrow-odbc/ In general if you want to do some post processing after fetching the data, you can use these to get it into arrow arrays and then save them into parquet. Do these help? Best, Markus —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 3 · 2023-03-17T00:41:41.000Z

Hello @chrisfw ,

did arrow-odbc work for you? I'll be closing this issue, but feel free to reopen it again, if you still feel odbc2parquet is lacking a feature here. Currently though, I see little I could do.

Best, Markus

Answer 4 · 2023-03-17T10:53:16.000Z

Hi Markus,

I have not had a chance to experiment with arrow-odbc, but I agree it fits the use case nicely here so I should be all set. Thanks for your assistance.

Regards,
Chris