Bears-R-Us/arkouda

Read Parquet byte sizes in batches, rather than individually

Closed this issue · 0 comments

In order to improve performance, this PR switches from reading the byte sizes of each string in a Parquet file in batches, rather than one at a time, which results in a significant performance improvement.

Here are results from a Cray XC with Lustre filesystem:

Batch byte calculation:

test sec
single-file 4.981
fixed-single 4.037
scaled-five 2.135
fixed-scaled-five 1.744
five 10.525
fixed-five 9.068
scaled-ten 1.094
fixed-scaled-ten 0.971
ten 11.532
fixed-ten 9.747

Old:

test sec
single-file 7.907
fixed-single 4.026
scaled-five 4.021
fixed-scaled-five 1.754
five 17.076
fixed-five 4.997
scaled-ten 1.782
fixed-scaled-ten 0.978
ten 17.802
fixed-ten 9.499

Noting that none of the "fixed" rows are impacted, since this only affects byte calculation, which is skipped when using the fixed length optimizaiton.