A simple example creating an Apache Spark RDD from an Apache HAWQ table
using the HAWQInputFormat
class and the newAPIHadoopRDD
API.
Note: the job must be submitted by gpadmin or something like HADOOP_USER_NAME=gpadmin
must be used with spark-submit
AND the following options added in Ambari's
custom core-site.xml section:
<property>
<name>hadoop.proxyuser.gpadmin.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.gpadmin.hosts</name>
<value>*</value>
</property>
$ git clone https://github.com/apache/incubator-hawq.git
$ cd incubator-hawq/contrib/hawq-hadoop/
$ mvn package install -DskipTests
$ sbt assembly
$ spark-submit --class "SparkHawqApp" target/scala-2.11/SparkHawqApp-assembly-1.0.jar --master yarn