pacman82/odbc2parquet

use zero-prefixed digits in multi file output for sort-by-filename firendly

aploium opened this issue · 2 comments

Currently, the --batch-size-row produce file output like foo_1.par foo_2.par foo_10.par foo_100.par
However, this is not sort-by-filename friendly

Because in most string sort, foo_9.par is larger than foo_10.par because the compare char one-by-one.

So if someone want load the parquet file back by order, he had to sort it by time, or rename them somehow.

A general solution is prefix the number by zeros, to a fixed width, for example:
foo_00009.par foo_00090.par
So they can sort by string correctly. Maybe use 7 width is big enough?

The split command of linux use a suffix like foo.aaaaa foo.aaaba https://man7.org/linux/man-pages/man1/split.1.html

Great feature suggestion! Personally I think a configurable suffix length with default length of 2 is the way to go. I would stay with "normal" numbers for the suffix.

Cheers, Markus

odbc2parquet 0.12.0 has been released. Its query subcommand now offers a --suffix-length option. It defaults to two. See odbc2parquet help query for details.

Cheers, Markus