How does hive.exec.orc.default.buffer.size affect the file size?
Opened this issue · 1 comments
i write orc file use spark sql 3.3。
I noticed that in the production environment, many ORC files had small stripe sizes. So, I decided to adjust the value of hive.exec.orc.default.buffer.size from 256K to 1K. I observed a significant increase in the stripe size, and the number of stripes in a single file decreased significantly. Unexpectedly, I found that the file size generated with the same dataset was different for the two parameter values. The final file size with hive.exec.orc.default.buffer.size set to 1K was twice the size of 256K.
Generally, when the stripe size increases, we would expect the compression ratio to be higher. However, it is surprising that reducing the buffer size affects the final file size.
Could you share some sample reproducible data, @loukey-lj ?