csv2parquet

CSV 2 Parquet and CSV 2 ORC converter (blend of individual tools with aligned interface)

csv to parquet conversion
csv to orc conversion
listing of meta information for orc/parquet (schema, statistics, encoding choices)
control some serialization formats (e.g. Strip size/BLock length, dictionary enable/disable)
special features for generating test data ** allows binary notation of input in CSV to force specific values into the parquet/orc file for test purposes (e.g. various float/double NAN, "out of range" int96 julian dates, other dates/timestamp, broken or wrong encoded unicode charaters etc.) ** allows to write int96 "Impala" timestamps

version of tools

repository	parquet version	orc version	Comment
csv2parquet2orc	1.9.0	1.4.x	parquet signed byte ordering for binary !
csv2parquet2orc_p1_10	1.10.x	1.5.x

build
run
csv to parquet
csv to orc
example schemas
running on windows
parquet Int96 timestamp

build

% mvn clean compile assembly:single
% mvn test

the mvn build builds a single jar with all dependencies packaged as target/csv2parquet2orc-0.0.X-SNAPSHOT-jar-with-dependencies.jar

Note: the project can be built on a jdk8 and jdk7 environment, see here https://travis-ci.org/jfseb/csv2parquet2orc

Note: one requires a jdk with tools.jar present, e.g. oracle 8 jdk, sapjvm sdk and may have to set JAVA_HOME to point at the JDK or install a default, not openjdk jdk see https://stackoverflow.com/questions/5730815/unable-to-locate-tools-jar

tools.jar was removed with oracle 9. Feel free to enable an JDK9 specific build variant and provide a pull request. Thanks.

run

display schema of file

% java -jar csv2parquet2orc-0.0.4-*.jar  schema afile.orc      [-x|--extended]
% java -jar csv2parquet2orc-0.0.4-*.jar  schema afile.parquet  [-x|--extended]

display meta information of a file

% java -jar csv2parquet2orc-0.0.4-*.jar meta abc.orc
% java -jar csv2parquet2orc-0.0.4-*.jar meta abc.parquet

csv to parquet

% java -jar csv2parquet2orc-0.0.4-*   convert  -D parquet.compression=GZIP  input.csv  -s input.csv.schema -o out.parquet -S '|'

options:

-D parquet.BLOCK_SIZE= (in Bytes)
-D parquet.PAGE_SIZE=
-D parquet.compress=[GZIP|SNAPPY|NONE] default GZIP
-D parquet.enabledictionary=[true|false]
-S '|' csv column separator, default ','
-H 1 skip 1 line in csv (e.g. header line)

csv to orc

java -jar csv2parquet2orc-0.0.2-*   convert  -D orc.compression=ZIP   input.csv  -s input.csv.schema -o out.orc

some csv options

the following is a subset of options

option	option example	parquet	orc
explicit schema file spec	-s abc.schema.orc	yes	yes
Skip header lines	-H 1	yes	yes
Separator	-S '\|'	yes	yes
Binary 0x12EFx0 (1)	-Dcsvformat=binary	yes	yes(2)

(1) -D csvformat=binary shall be entered before the command |
(2) Timestamp and decimal writing for orc via semi-typed classes of orc reader which may limit full byte range

csv binary notation

when setting -D csvformat=binary
csv columns matching the pattern /0x([A-Fa-f0-9][A-Fa-f0-9])+x0/, e.g. |0xFFFFFFFFx0|0xffefx0|0x41x0| are interpreted as binary data.

The converter will take the binary data ( left-padding it with 0x00 or truncating it from the left where required ) as big-endian data and move it into the respective column,

so 0x41x0 will yield 'A' on a varchar column, 65 on an integer/long/ column etc.

for the latter format, all columns starting with 0x and ending with x0 (e.g. 0xFFEFx0 and containing an even number of contiguous hexadecimal characters will be interpreted as binary representation. The data will be interpreted as Big-Endian representation of the data
(7FFF) is the value 32768 = 0x7FFF etc. Where needed, it will be left-padded with 00 or truncated to the target width. Subsequently is is interpreted as the binary representation of the data

other commands

--help output help

meta output file metadata

java -jar csv2parquet2orc-0.0.2-* meta abc.parquet

java -jar csv2parquet2orc-0.0.2-* meta abc.orc

notes

The project is built on parquet 1.9.0 and orc 1.4 using sources from [https://github.com/Parquet/parquet-compatibility] and [https://github.com/apache/orc/tree/master/tools]

License

Apache License 2.0

running on windows:

to operate the functionality on windows, one may have to set environment variable HADOOP_HOME and install at least
winutils.exe from a hadoop installation or https://github.com/steveloughran/winutils in HADOOP_HOME.

example:

HADOOP_HOME= c:\progs\hadoop
PATH= %PATH%;HADOOP_HOME\bin

(A full hadoop installation is not required)

parquet int96 timestamp

(aka impala timestamps)

Timestamps are stored in a 12 bytes

int64_t nanoseconds of the day (from midnight!) int32_t julianday (offset see below)

The converter used the messed up "Hive" julian calendar conversion calculation below:

Hive converts "1970-01-01 00:00:00.0" to Julian timestamp:
(julianDay=2440588, timeOfDayNanos=0)

see https://issues.apache.org/jira/browse/HIVE-6394?focusedCommentId=14711046&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14711046

and "1970-01-01 12:00:00" to Julian timestamp: (julianDay=2440588, timeOfDayNanos= 12606010001000*1000) still have the julianDay 2440588 (!)

while 1970-01-02 00:00:00 (julianDay=2440588, timeOfDayNanos= 0) has julianDay 2440589(!)

Beware: in contrast to other parquet data the int96 impala timestamp is stored in little-endian order:

  0x<nanos in little endian (int64_t, 8 bytes)><juliandays in little endian (int32_t)>0x

Example:

16 nanos after 1970-01-01 00:00:00.0000000 :
juliandays: 2440588 = 0x253D8C ; timeOfDayNanos: 16 = 0x10; are represented by bytes:

|0x01000000000000008C3D2500x0|
<nanos LE;8 bytes  >  <days LE 4b>
|0x 0100 0000 0000 0000  8C 3D 25 00   x0|

(Normal integers/date/time/character strings etc. are to be written in big-endian order, so the number 2440588 would be 0x00253D8Cx0 , "ABC" is 0x414243x0 etc..)

example schemas

parquet schema

message m {
 OPTIONAL int32 d32 (DATE);
 OPTIONAL binary c3 (UTF8);
 OPTIONAL int64  d9_5 (DECIMAL(9,5);  
 OPTIONAL binary d3_8 (DECIMAL(38,5);  
 OPTIONAL int32 ui16 (UINT_32);
 OPTIONAL int32 i32;
 OPTIONAL int32 t_millis (TIME_MILLIS) ;
 OPTIONAL int64 t_micros (TIME_MICROS);
 OPTIONAL int64 ts_millis (TIMESTAMP_MILLIS) ;
 OPTIONAL int64 ts_micros (TIMESTAMP_MICROS);
 OPTIONAL int96 ts_i96  (TIMESTAMP);
 OPTIONAL double dbl;
 OPTIONAL float  flt;  
 OPTIONAL int64 plain;
 OPTIONAL binary c4 (UTF8);
}

orc schema

struct<cust_key:int,name:string,nation_keys:smallint,acctbal:double,adate:date,adec:decimal(9,5)>

run meta on https://github.com/apache/orc/tree/master/examples

jfseb/csv2parquet2orc