branky/cascading.hive

Support to accept TezConfiguration in ORCFile

Opened this issue · 3 comments

Hi

We were testing PartitionTap for TEZ (our input/output are ORC files ) using cascading 3.0.0-wip-63 libs,Tez -0.5.3 and Cascading.hive 0.0.4 snapshot jar and encountered the following ClassCastException,

Caused by: java.lang.ClassCastException: org.apache.tez.dag.api.TezConfiguration cannot be cast to org.apache.hadoop.mapred.JobConf
at cascading.hive.ORCFile.sinkConfInit(ORCFile.java:72)
at cascading.tap.Tap.sinkConfInit(Tap.java:206)
at cascading.tap.hadoop.Hfs.sinkConfInit(Hfs.java:399)
at cascading.tap.hadoop.Hfs.sinkConfInit(Hfs.java:106)
at cascading.tap.hadoop.io.TapOutputCollector.initialize(TapOutputCollector.java:96)
at cascading.tap.hadoop.io.TapOutputCollector.(TapOutputCollector.java:91)
at cascading.tap.hadoop.PartitionTap.createTupleEntrySchemeCollector(PartitionTap.java:159)
at cascading.tap.partition.BasePartitionTap$PartitionCollector.getCollector(BasePartitionTap.java:130)
at cascading.tap.partition.BasePartitionTap$PartitionCollector.collect(BasePartitionTap.java:228)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:145)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:95)
at cascading.flow.stream.element.SinkStage.receive(SinkStage.java:98)

in the function,
public void sinkConfInit(FlowProcess flowProcess, Tap<JobConf, RecordReader, OutputCollector> tap, JobConf conf) of ORCFile of cascading.hive.

It seems that ORCFile doesnt have the support to receive TezConfiguration. Can you please check this?

Thanks.

This lib only tested with Cascading 2.x. I believe there must be issues to work with Tez right now. Will you be interested to make it support Cascading 3/Tez? Your contribution will benefit whole community, thank you!

fs111 commented

It should be fairly straight forward to support Cascading 3.x. If you run into any trouble, please let me/us know.

I have made all references of JobConf to org.apache.hadoop.conf.Configuration. Code compiles but hit https://issues.apache.org/jira/browse/HIVE-6163 again, OrcOutputFormat doesn't write files with parent path which will cause sink failed. The original workaround doesn't work anymore, need to find another solution or push Hive committers to fix HIVE-6163.