YotpoLtd/metorikku

read hive data problem

Opened this issue · 12 comments

I use metorikku reading data from hive, then write to hdfs with parquet format. But the out is always empty. I can not figure out what is wrong, can someone give me some advice? Thanks.
The job conf:
metrics:

  • test_metric.yml
    output:
    file:
    dir: /tmp

test_metric conf:
steps:

  • dataFrameName: df1
    sql:
    SELECT * FROM employee

output:

  • dataFrameName: df1
    outputType: parquet
    outputOptions:
    saveMode: overwrite
    path: df1.parquet

Can you share your spark submit as well?

Thank you!
My spark and hive are in one cluster, my other spark programs can read hive table directly, so I submit metorikku without Hive metastore connection config.
I tried two spark-submit, the results are same.

  1. spark-submit --class com.yotpo.metorikku.Metorikku metorikku.jar -c test_job.yaml
  2. spark-submit --conf spark.sql.catalogImplementation=hive --class com.yotpo.metorikku.Metorikku metorikku.jar -c test_job.yaml

if you're running spark-sql -e "select * from employee" you're seeing information?

Also are you seeing an empty parquet? Or nothing is being written?

spark-sql -e "select * from employee" prints some data, when submitting metorikku.jar, there is no output file.

I'm wondering if maybe it's writing to the local FS instead of HDFS, can you add the following to your job config:
showPreviewLines: 10
Can you see in the STDOUT the employee table output?

I add showPreviewLines: 42 and showQuery: true, but stdout does not print sql and select output.
stdout:
19/05/31 11:24:45 INFO Client: Application report for application_1559031778312_14158 (state: RUNNING)
19/05/31 11:24:45 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.202.116.71
ApplicationMaster RPC port: 0
queue: root.default
start time: 1559273077996
final status: UNDEFINED
tracking URL: http://10.202.77.200:54315/proxy/application_1559031778312_14158/
user: hive
19/05/31 11:24:45 INFO YarnClientSchedulerBackend: Application application_1559031778312_14158 has started running.
19/05/31 11:24:45 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57244.
19/05/31 11:24:45 INFO NettyBlockTransferService: Server created on 10.202.77.200:57244
19/05/31 11:24:45 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/05/31 11:24:45 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.202.77.200, 57244, None)
19/05/31 11:24:45 INFO BlockManagerMasterEndpoint: Registering block manager 10.202.77.200:57244 with 366.3 MB RAM, BlockManagerId(driver, 10.202.77.200, 57244, None)
19/05/31 11:24:45 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.202.77.200, 57244, None)
19/05/31 11:24:45 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.202.77.200, 57244, None)
19/05/31 11:24:45 INFO EventLoggingListener: Logging events to hdfs://test-cluster-log/sparkHistory/application_1559031778312_14158
19/05/31 11:24:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.202.116.78:51674) with ID 1
19/05/31 11:24:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.202.116.73:60676) with ID 2
19/05/31 11:24:50 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
19/05/31 11:24:50 INFO BlockManagerMasterEndpoint: Registering block manager CNSZ22PL0529:34413 with 2004.6 MB RAM, BlockManagerId(2, CNSZ22PL0529, 34413, None)
19/05/31 11:24:50 INFO SharedState: loading hive config file: file:/app/spark/conf/hive-site.xml
19/05/31 11:24:50 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/DATA1/home/hive/01379241/spark-warehouse/').
19/05/31 11:24:50 INFO SharedState: Warehouse path is 'file:/DATA1/home/hive/01379241/spark-warehouse/'.
19/05/31 11:24:51 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
19/05/31 11:24:51 INFO StreamingQueryMetricsListener$: Initialize stream listener

This is the entire output from the spark-submit?
If so, it looks like it's not running any steps... malformed YAML? can you paste the job/metric YAML here with backticks so I can see if maybe it has incorrect formatting?

job and metrics config:

test_job.yml
metrics:
  - test_metric.yml
output:
    file:
        dir: /tmp

explain: true
showPreviewLines: 42
showQuery: true

test_metric.yml
steps:
- dataFrameName: df1
  sql:
    SELECT * FROM employee

output:
- dataFrameName: df1
  outputType: parquet
  outputOptions:
    saveMode: overwrite
    path: df1.parquet

Sorry for the late reply... I think outputType: parquet should be outputType: Parquet

Please check if the files are created in this directory path: df1.parquet
For me it generated files inside this directory . Previously I thought this is a file .

@hongtaox did you ever figure out the solution? I'm facing the same issue, Spark and Hive are in the same cluster. Think the issue is not having the inputs section in the job configuration file, but like you said, "authentication" shouldn't be required if the program is run on "localhost".