Read Parquet byte sizes in batches, rather than individually

Question

Read Parquet byte sizes in batches, rather than individually

Closed this issue 3 months ago · 0 comments

In order to improve performance, this PR switches from reading the byte sizes of each string in a Parquet file in batches, rather than one at a time, which results in a significant performance improvement.

Here are results from a Cray XC with Lustre filesystem:

Batch byte calculation:

test	sec
single-file	4.981
fixed-single	4.037
scaled-five	2.135
fixed-scaled-five	1.744
five	10.525
fixed-five	9.068
scaled-ten	1.094
fixed-scaled-ten	0.971
ten	11.532
fixed-ten	9.747

Old:

test	sec
single-file	7.907
fixed-single	4.026
scaled-five	4.021
fixed-scaled-five	1.754
five	17.076
fixed-five	4.997
scaled-ten	1.782
fixed-scaled-ten	0.978
ten	17.802
fixed-ten	9.499

Noting that none of the "fixed" rows are impacted, since this only affects byte calculation, which is skipped when using the fixed length optimizaiton.