read hive data problem

Question

read hive data problem

Opened this issue 5 years ago · 12 comments

I use metorikku reading data from hive, then write to hdfs with parquet format. But the out is always empty. I can not figure out what is wrong, can someone give me some advice? Thanks.
The job conf:
metrics:

test_metric.yml
output:
file:
dir: /tmp

test_metric conf:
steps:

dataFrameName: df1
sql:
SELECT * FROM employee

output:

dataFrameName: df1
outputType: parquet
outputOptions:
saveMode: overwrite
path: df1.parquet

Answer 1 · 2019-05-30T14:47:47.000Z

Can you share your spark submit as well?

Answer 2 · 2019-05-30T16:02:01.000Z

Thank you!
My spark and hive are in one cluster, my other spark programs can read hive table directly, so I submit metorikku without Hive metastore connection config.
I tried two spark-submit, the results are same.

spark-submit --class com.yotpo.metorikku.Metorikku metorikku.jar -c test_job.yaml
spark-submit --conf spark.sql.catalogImplementation=hive --class com.yotpo.metorikku.Metorikku metorikku.jar -c test_job.yaml

Answer 3 · 2019-05-30T16:30:00.000Z

if you're running spark-sql -e "select * from employee" you're seeing information?

Answer 4 · 2019-05-30T16:32:17.000Z

Also are you seeing an empty parquet? Or nothing is being written?

Answer 5 · 2019-05-31T01:52:08.000Z

spark-sql -e "select * from employee" prints some data, when submitting metorikku.jar, there is no output file.

Answer 6 · 2019-05-31T03:18:49.000Z

I'm wondering if maybe it's writing to the local FS instead of HDFS, can you add the following to your job config:
showPreviewLines: 10
Can you see in the STDOUT the employee table output?

Answer 7 · 2019-05-31T03:37:05.000Z

I add showPreviewLines: 42 and showQuery: true, but stdout does not print sql and select output.
stdout:
19/05/31 11:24:45 INFO Client: Application report for application_1559031778312_14158 (state: RUNNING)
19/05/31 11:24:45 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.202.116.71
ApplicationMaster RPC port: 0
queue: root.default
start time: 1559273077996
final status: UNDEFINED
tracking URL: http://10.202.77.200:54315/proxy/application_1559031778312_14158/
user: hive
19/05/31 11:24:45 INFO YarnClientSchedulerBackend: Application application_1559031778312_14158 has started running.
19/05/31 11:24:45 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57244.
19/05/31 11:24:45 INFO NettyBlockTransferService: Server created on 10.202.77.200:57244
19/05/31 11:24:45 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/05/31 11:24:45 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.202.77.200, 57244, None)
19/05/31 11:24:45 INFO BlockManagerMasterEndpoint: Registering block manager 10.202.77.200:57244 with 366.3 MB RAM, BlockManagerId(driver, 10.202.77.200, 57244, None)
19/05/31 11:24:45 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.202.77.200, 57244, None)
19/05/31 11:24:45 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.202.77.200, 57244, None)
19/05/31 11:24:45 INFO EventLoggingListener: Logging events to hdfs://test-cluster-log/sparkHistory/application_1559031778312_14158
19/05/31 11:24:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.202.116.78:51674) with ID 1
19/05/31 11:24:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.202.116.73:60676) with ID 2
19/05/31 11:24:50 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
19/05/31 11:24:50 INFO BlockManagerMasterEndpoint: Registering block manager CNSZ22PL0529:34413 with 2004.6 MB RAM, BlockManagerId(2, CNSZ22PL0529, 34413, None)
19/05/31 11:24:50 INFO SharedState: loading hive config file: file:/app/spark/conf/hive-site.xml
19/05/31 11:24:50 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/DATA1/home/hive/01379241/spark-warehouse/').
19/05/31 11:24:50 INFO SharedState: Warehouse path is 'file:/DATA1/home/hive/01379241/spark-warehouse/'.
19/05/31 11:24:51 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
19/05/31 11:24:51 INFO StreamingQueryMetricsListener$: Initialize stream listener

Answer 8 · 2019-05-31T03:52:12.000Z

This is the entire output from the spark-submit?
If so, it looks like it's not running any steps... malformed YAML? can you paste the job/metric YAML here with backticks so I can see if maybe it has incorrect formatting?

Answer 9 · 2019-05-31T04:47:55.000Z

job and metrics config：

test_job.yml
metrics:
  - test_metric.yml
output:
    file:
        dir: /tmp

explain: true
showPreviewLines: 42
showQuery: true

test_metric.yml
steps:
- dataFrameName: df1
  sql:
    SELECT * FROM employee

output:
- dataFrameName: df1
  outputType: parquet
  outputOptions:
    saveMode: overwrite
    path: df1.parquet

Answer 10 · 2019-06-05T14:29:25.000Z

Sorry for the late reply... I think outputType: parquet should be outputType: Parquet

Answer 11 · 2019-07-01T16:41:20.000Z

Please check if the files are created in this directory path: df1.parquet
For me it generated files inside this directory . Previously I thought this is a file .

Answer 12 · 2020-04-28T01:16:18.000Z

@hongtaox did you ever figure out the solution? I'm facing the same issue, Spark and Hive are in the same cluster. Think the issue is not having the inputs section in the job configuration file, but like you said, "authentication" shouldn't be required if the program is run on "localhost".