CSV 2 Parquet and CSV 2 ORC converter (blend of individual tools with aligned interface)
(Parquet 1.10.x, ORC 1.5.x)
-
csv to parquet conversion
-
csv to orc conversion
-
listing of meta information for orc/parquet (schema, statistics, encoding choices)
-
control some serialization formats (e.g. Strip size/BLock length, dictionary enable/disable)
-
special features for generating test data
-
allows binary notation of input in CSV to force specific values into the parquet/orc file for test purposes (e.g. various float/double NAN, "out of range" int96 julian dates, other dates/timestamp, broken or wrong encoded unicode charaters etc.)
-
allows to write int96 "Impala" timestamps
-
repository | parquet version | orc version | Comment |
---|---|---|---|
csv2parquet2orc | 1.9.0 | 1.4.x | parquet signed byte ordering for binary ! |
csv2parquet2orc_p1_10 | 1.10.x | 1.5.x | min max output for parquet |
% mvn clean compile assembly:single
% mvn test
the mvn build builds a single jar with all dependencies packaged
as target/csv2parquet2orc-0.0.X-SNAPSHOT-jar-with-dependencies.jar
Note: the project can be built on a jdk8 and jdk7 environment, see here https://travis-ci.org/jfseb/csv2parquet2orc
Note: one requires a jdk with tools.jar present, e.g. oracle 8 jdk, sapjvm sdk and may have to set JAVA_HOME to point at the JDK or install a default, not openjdk jdk see https://stackoverflow.com/questions/5730815/unable-to-locate-tools-jar
tools.jar was removed with oracle 9. Feel free to enable an JDK9 specific build variant and provide a pull request. Thanks.
% java -jar csv2parquet2orc-0.0.4-*.jar schema afile.orc [-x|--extended]
% java -jar csv2parquet2orc-0.0.4-*.jar schema afile.parquet [-x|--extended]
% java -jar csv2parquet2orc-0.0.4-*.jar meta abc.orc
% java -jar csv2parquet2orc-0.0.4-*.jar meta abc.parquet
% java -jar csv2parquet2orc-0.0.4-* convert -D parquet.compression=GZIP input.csv -s input.csv.schema -o out.parquet -S '|'
options:
-
-D parquet.BLOCK_SIZE= (in Bytes)
-
-D parquet.PAGE_SIZE=
-
-D parquet.compress=[GZIP|SNAPPY|NONE] default GZIP
-
-D parquet.enabledictionary=[true|false]
-
-S '|' csv column separator, default ','
-
-H 1 skip 1 line in csv (e.g. header line)
java -jar csv2parquet2orc-0.0.2-* convert -D orc.compression=ZIP input.csv -s input.csv.schema -o out.orc
the following is a subset of options
option | option example | parquet | orc | Comment |
---|---|---|---|---|
explicit schema file spec | -s abc.schema.orc | yes | yes | |
Skip header lines | -H 1 | yes | yes | |
Null string | -n '' | yes | yes | |
Separator | -S '|' | yes | yes | |
Binary 0x12EFx0 (1) | -Dcsvformat=binary | yes | yes(2) |
- (1) -D csvformat=binary shall be entered before the command
- (2) Timestamp and decimal writing for orc via semi-typed classes of orc reader which may limit full byte range
when setting -D csvformat=binary
csv columns matching the pattern /0x([A-Fa-f0-9][A-Fa-f0-9])+x0/,
e.g. |0xFFFFFFFFx0|0xffefx0|0x41x0|
are interpreted as binary data.
The converter will take the binary data (left-padding it with 0x00 or truncating it from the left where required) as big-endian data and move it into the respective column:
so 0x41x0 will yield 'A' on a varchar column, 65 on an integer/long/ column etc.
0x7FFF0x is the value 32768 = 0x7FFF in a long/int column etc.
This way illegal/out of range values not accessible by the java platform types can be be inserted into the parquet file for testing.
java -jar csv2parquet2orc-0.0.2-* meta abc.parquet
java -jar csv2parquet2orc-0.0.2-* meta abc.orc
The project is built on parquet 1.10.0 and orc 1.5 using sources from [https://github.com/Parquet/parquet-compatibility] and [https://github.com/apache/orc/tree/master/tools]
Apache License 2.0
to operate the functionality on windows, one may have to set
environment variable HADOOP_HOME and install at least
winutils.exe from a hadoop installation or https://github.com/steveloughran/winutils
in HADOOP_HOME.
example:
- HADOOP_HOME= c:\progs\hadoop
- PATH= %PATH%;HADOOP_HOME\bin
(A full hadoop installation is not required)
(aka impala timestamps)
Timestamps are stored in a 12 bytes
int64_t nanoseconds of the day (from midnight!) int32_t julianday (offset see below)
The converter used the messed up "Hive" julian calendar conversion calculation below:
Hive converts "1970-01-01 00:00:00.0" to Julian timestamp:
(julianDay=2440588, timeOfDayNanos=0)
and "1970-01-01 12:00:00" to Julian timestamp: (julianDay=2440588, timeOfDayNanos= 12606010001000*1000) still have the julianDay 2440588 (!)
while 1970-01-02 00:00:00 (julianDay=2440588, timeOfDayNanos= 0) has julianDay 2440589(!)
Beware: in contrast to other parquet data the int96 impala timestamp is stored in little-endian order:
0x<nanos in little endian (int64_t, 8 bytes)><juliandays in little endian (int32_t)>0x
Example:
- 16 nanos after 1970-01-01 00:00:00.0000000 :
- juliandays: 2440588 = 0x253D8C ; timeOfDayNanos: 16 = 0x10; are represented by bytes:
|0x01000000000000008C3D2500x0|
<nanos LE;8 bytes > <days LE 4b>
|0x 0100 0000 0000 0000 8C 3D 25 00 x0|
(Normal integers/date/time/character strings etc. are to be written in big-endian order, so the number
2440588 would be 0x00253D8Cx0
, "ABC" is 0x414243x0
etc..)
message m {
OPTIONAL int32 d32 (DATE);
OPTIONAL binary c3 (UTF8);
OPTIONAL int64 d9_5 (DECIMAL(9,5);
OPTIONAL binary d3_8 (DECIMAL(38,5);
OPTIONAL int32 ui16 (UINT_32);
OPTIONAL int32 i32;
OPTIONAL int32 t_millis (TIME_MILLIS) ;
OPTIONAL int64 t_micros (TIME_MICROS);
OPTIONAL int64 ts_millis (TIMESTAMP_MILLIS) ;
OPTIONAL int64 ts_micros (TIMESTAMP_MICROS);
OPTIONAL int96 ts_i96 (TIMESTAMP);
OPTIONAL double dbl;
OPTIONAL float flt;
OPTIONAL int64 plain;
OPTIONAL binary c4 (UTF8);
}
struct<cust_key:int,name:string,nation_keys:smallint,acctbal:double,adate:date,adec:decimal(9,5)>
run meta on https://github.com/apache/orc/tree/master/examples