use zero-prefixed digits in multi file output for sort-by-filename firendly
aploium opened this issue · 2 comments
Currently, the --batch-size-row
produce file output like foo_1.par
foo_2.par
foo_10.par
foo_100.par
However, this is not sort-by-filename friendly
Because in most string sort, foo_9.par
is larger than foo_10.par
because the compare char one-by-one.
So if someone want load the parquet file back by order, he had to sort it by time, or rename them somehow.
A general solution is prefix the number by zeros, to a fixed width, for example:
foo_00009.par
foo_00090.par
So they can sort by string correctly. Maybe use 7 width is big enough?
The split
command of linux use a suffix like foo.aaaaa
foo.aaaba
https://man7.org/linux/man-pages/man1/split.1.html
Great feature suggestion! Personally I think a configurable suffix length with default length of 2
is the way to go. I would stay with "normal" numbers for the suffix.
Cheers, Markus
odbc2parquet 0.12.0
has been released. Its query
subcommand now offers a --suffix-length
option. It defaults to two. See odbc2parquet help query
for details.
Cheers, Markus