CSV 2 Parquet and CSV 2 ORC converter (blend of individual tools with aligned interface)
-
csv to parquet conversion
-
csv to orc conversion
-
listing of meta information for orc/parquet (schema, statistics, encoding choices)
-
control some serialization formats (e.g. Strip size/BLock length, dictionary enable/disable)
-
special features for generating test data ** allows binary notation of input in CSV to force specific values into the parquet/orc file for test purposes (e.g. various float/double NAN, "out of range" int96 julian dates, other dates/timestamp, broken or wrong encoded unicode charaters etc.) ** allows to write int96 "Impala" timestamps
repository | parquet version | orc version | Comment |
---|---|---|---|
csv2parquet2orc | 1.9.0 | 1.4.x | parquet signed byte ordering for binary ! |
csv2parquet2orc_p1_10 | 1.10.x | 1.5.x |
% mvn clean compile assembly:single
% mvn test
the mvn build builds a single jar with all dependencies packaged
as target/csv2parquet2orc-0.0.X-SNAPSHOT-jar-with-dependencies.jar
Note: the project can be built on a jdk8 and jdk7 environment, see here https://travis-ci.org/jfseb/csv2parquet2orc
Note: one requires a jdk with tools.jar present, e.g. oracle 8 jdk, sapjvm sdk and may have to set JAVA_HOME to point at the JDK or install a default, not openjdk jdk see https://stackoverflow.com/questions/5730815/unable-to-locate-tools-jar
tools.jar was removed with oracle 9. Feel free to enable an JDK9 specific build variant and provide a pull request. Thanks.
% java -jar csv2parquet2orc-0.0.4-*.jar schema afile.orc [-x|--extended]
% java -jar csv2parquet2orc-0.0.4-*.jar schema afile.parquet [-x|--extended]
% java -jar csv2parquet2orc-0.0.4-*.jar meta abc.orc
% java -jar csv2parquet2orc-0.0.4-*.jar meta abc.parquet
% java -jar csv2parquet2orc-0.0.4-* convert -D parquet.compression=GZIP input.csv -s input.csv.schema -o out.parquet -S '|'
options:
-
-D parquet.BLOCK_SIZE= (in Bytes)
-
-D parquet.PAGE_SIZE=
-
-D parquet.compress=[GZIP|SNAPPY|NONE] default GZIP
-
-D parquet.enabledictionary=[true|false]
-
-S '|' csv column separator, default ','
-
-H 1 skip 1 line in csv (e.g. header line)
java -jar csv2parquet2orc-0.0.2-* convert -D orc.compression=ZIP input.csv -s input.csv.schema -o out.orc
the following is a subset of options
option | option example | parquet | orc | Comment |
---|---|---|---|---|
explicit schema file spec | -s abc.schema.orc | yes | yes | |
Skip header lines | -H 1 | yes | yes | |
Separator | -S '|' | yes | yes | |
Binary 0x12EFx0 (1) | -Dcsvformat=binary | yes | yes(2) |
- (1) -D csvformat=binary shall be entered before the command |
- (2) Timestamp and decimal writing for orc via semi-typed classes of orc reader which may limit full byte range
when setting -D csvformat=binary
csv columns matching the pattern /0x([A-Fa-f0-9][A-Fa-f0-9])+x0/,
e.g. |0xFFFFFFFFx0|0xffefx0|0x41x0|
are interpreted as binary data.
The converter will take the binary data ( left-padding it with 0x00 or truncating it from the left where required ) as big-endian data and move it into the respective column,
so 0x41x0 will yield 'A' on a varchar column, 65 on an integer/long/ column etc.
for the latter format, all columns starting with 0x and ending with x0 (e.g. 0xFFEFx0
and containing an even number of contiguous hexadecimal characters will be interpreted as
binary representation.
The data will be interpreted as Big-Endian representation of the data
(7FFF) is the value 32768 = 0x7FFF etc.
Where needed, it will be left-padded with 00 or truncated to the target width.
Subsequently is is interpreted as the binary representation of the data
java -jar csv2parquet2orc-0.0.2-* meta abc.parquet
java -jar csv2parquet2orc-0.0.2-* meta abc.orc
The project is built on parquet 1.9.0 and orc 1.4 using sources from [https://github.com/Parquet/parquet-compatibility] and [https://github.com/apache/orc/tree/master/tools]
Apache License 2.0
to operate the functionality on windows, one may have to set
environment variable HADOOP_HOME and install at least
winutils.exe from a hadoop installation or https://github.com/steveloughran/winutils
in HADOOP_HOME.
example:
- HADOOP_HOME= c:\progs\hadoop
- PATH= %PATH%;HADOOP_HOME\bin
(A full hadoop installation is not required)
(aka impala timestamps)
Timestamps are stored in a 12 bytes
int64_t nanoseconds of the day (from midnight!) int32_t julianday (offset see below)
The converter used the messed up "Hive" julian calendar conversion calculation below:
Hive converts "1970-01-01 00:00:00.0" to Julian timestamp:
(julianDay=2440588, timeOfDayNanos=0)
and "1970-01-01 12:00:00" to Julian timestamp: (julianDay=2440588, timeOfDayNanos= 12606010001000*1000) still have the julianDay 2440588 (!)
while 1970-01-02 00:00:00 (julianDay=2440588, timeOfDayNanos= 0) has julianDay 2440589(!)
Beware: in contrast to other parquet data the int96 impala timestamp is stored in little-endian order:
0x<nanos in little endian (int64_t, 8 bytes)><juliandays in little endian (int32_t)>0x
Example:
- 16 nanos after 1970-01-01 00:00:00.0000000 :
- juliandays: 2440588 = 0x253D8C ; timeOfDayNanos: 16 = 0x10; are represented by bytes:
|0x01000000000000008C3D2500x0|
<nanos LE;8 bytes > <days LE 4b>
|0x 0100 0000 0000 0000 8C 3D 25 00 x0|
(Normal integers/date/time/character strings etc. are to be written in big-endian order, so the number
2440588 would be 0x00253D8Cx0
, "ABC" is 0x414243x0
etc..)
message m {
OPTIONAL int32 d32 (DATE);
OPTIONAL binary c3 (UTF8);
OPTIONAL int64 d9_5 (DECIMAL(9,5);
OPTIONAL binary d3_8 (DECIMAL(38,5);
OPTIONAL int32 ui16 (UINT_32);
OPTIONAL int32 i32;
OPTIONAL int32 t_millis (TIME_MILLIS) ;
OPTIONAL int64 t_micros (TIME_MICROS);
OPTIONAL int64 ts_millis (TIMESTAMP_MILLIS) ;
OPTIONAL int64 ts_micros (TIMESTAMP_MICROS);
OPTIONAL int96 ts_i96 (TIMESTAMP);
OPTIONAL double dbl;
OPTIONAL float flt;
OPTIONAL int64 plain;
OPTIONAL binary c4 (UTF8);
}
struct<cust_key:int,name:string,nation_keys:smallint,acctbal:double,adate:date,adec:decimal(9,5)>
run meta on https://github.com/apache/orc/tree/master/examples