xitongsys/parquet-go

How to get bytearray from select columns of a row group?

hkpeaks opened this issue · 0 comments

Yesterday, I published the first pre-release version of my project. You can find it at https://github.com/hkpeaks/peaks-consolidation/releases. The project supports processing billion-row CSV files by streaming mode and running all ETL processes in parallel. My in-memory dataset do not have datascheme. However, it can be changed on demand e.g. filter/aggregate of real number (user need to set float for filter, and sum for aggregate).

My next step is to support Parquet format. I’m exploring which Go library is most suitable for me to maintain my current processing speed. Currently, I keep my in-memory dataset as a bytearray read from CSV. I do byte-to-byte conversion to support ETL functions such as Distinct, GroupBy, JoinTable and Filter.

To implement Parquet format, I want to use the same processing model if I can get bytearray from Parquet file "by select columns of a row group" directly. So I can use Goruntine to read each group in parallell. For the first step of development, I plan to focus on read. If I can achieve processing speed no worse than CSV when reading Parquet files, I will proceed to handle writing Parquet format by bytearray. Currently I use DuckDB and Polars helping to convert csv file to parquet format.

I have download your code examples, and doing some test, seem work after I test Apache Parquet-Go (I try, but nothing can be worked propertly).