Read Parquet byte sizes in batches, rather than individually
Closed this issue · 0 comments
bmcdonald3 commented
In order to improve performance, this PR switches from reading the byte sizes of each string in a Parquet file in batches, rather than one at a time, which results in a significant performance improvement.
Here are results from a Cray XC with Lustre filesystem:
Batch byte calculation:
test | sec |
---|---|
single-file | 4.981 |
fixed-single | 4.037 |
scaled-five | 2.135 |
fixed-scaled-five | 1.744 |
five | 10.525 |
fixed-five | 9.068 |
scaled-ten | 1.094 |
fixed-scaled-ten | 0.971 |
ten | 11.532 |
fixed-ten | 9.747 |
Old:
test | sec |
---|---|
single-file | 7.907 |
fixed-single | 4.026 |
scaled-five | 4.021 |
fixed-scaled-five | 1.754 |
five | 17.076 |
fixed-five | 4.997 |
scaled-ten | 1.782 |
fixed-scaled-ten | 0.978 |
ten | 17.802 |
fixed-ten | 9.499 |
Noting that none of the "fixed" rows are impacted, since this only affects byte calculation, which is skipped when using the fixed length optimizaiton.