Can this be used with gcp ?
normalscene opened this issue · 13 comments
Hi, I was wondering if this could be used with google cloud platform?
Hi. There shouldn't be any - the tool is supposed to work with any Spark distribution. Please try and report issues back if any :)
Hi,
I had to replace the stock 'spark-submit' with your 'spark-submit-flamegraph'. Currently it doesn't work and seem to have couple of issues.
- It complained about $HOME not set on line 233. I fixed it by defining HOME inside script. fixed.
- After that it is stuck on line 31 (trying to find a free port) and just keeps on printing the below shown output.
Do you have any suggestions?
gaurav_arya_figmd_com@deltest-m:~$ ls -lrth /usr/bin/spark-submit
lrwxrwxrwx 1 root root 51 Feb 7 11:43 /usr/bin/spark-submit -> /home/gaurav_arya_figmd_com/spark-submit-flamegraph
gaurav_arya_figmd_com@deltest-m:~$
gaurav_arya_figmd_com@deltest-m:~$ gcloud dataproc jobs submit spark --project bda-sandbox --cluster deltest --region us-central1 --properties spark.submit.deployMode=cluster,spark.dynamicAllocation.enabled=false,spark.yarn.maxAppAttempts=1,spark.driver.memory=4G,spark.driver.memoryOverhead=1024m,spark.executor.instances=3,spark.executor.memoryOverhead=1024m,spark.executor.memory=4G,spark.executor.cores=2,spark.driver.cores=1,spark.driver.maxResultSize=2g,spark.extraListeners=com.qubole.sparklens.QuboleJobListener --class com.figmd.janus.deletion.dataCleanerMain --jars=gs://cdrmigration/jars/newDataCleaner.jar,gs://spark-lib/bigquery/spark-bigquery-latest.jar,gs://cdrmigration/jars/jdbc-postgresql.jar,gs://cdrmigration/jars/postgresql-42.2.5.jar,gs://cdrmigration/jars/sparklens_2.11-0.3.1.jar -- cdr 289 PatientEthnicity,PatientRace bda-sandbox CDRDELTEST 20200121 0001
Job [b28c81b219b54ebbafaf2d15ff7e8549] submitted.
Waiting for job output...
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
.
.
.
Looks like a bug. I don't have a way to re-create this now. If you want to help debug this issue, please:
- Add
-x
parameter to the script (#!/bin/bash -eux
) - Find
gcloud dataproc
logs that contain output from the script. - Attach the logs to this page.
Thanks!
Alright - so I figured out the issue finally - I had to install a couple of things like telnet & pip, and I was not aware that the system didn't have it. I got a warning for pip but not for telnet. Maybe you would add a check for required binaries so that if they are not found, the user get's proper indication. Just a suggestion.
So after fixing all the minor issues it errors out on "Couldn't start InfluxDB!".
Question: Is there any additional logging apart from ~/.spark-flamegraph to tackle the below issue?
[2020-02-07T12:18:20.1581077900] Installing dependencies
[2020-02-07T12:18:22.1581077902] Starting InfluxDB
[2020-02-07T12:18:22.1581077902] InfluxDB starting at :48137
ERROR: Couldn't start InfluxDB!
[2020-02-07T12:18:32.1581077912] Spark has exited with bad exit code (1)
[2020-02-07T12:18:32.1581077912] Collecting profiling metrics
[2020-02-07T12:18:32.1581077912] No profiling metrics were recorded!
[2020-02-07T12:18:32.1581077912] Spark has exited with bad exit code (1)
There's log file called influxdb.log
, can you look there please?
Also, if you've replaced original spark-submit
command with this script, make sure to set SPARK_CMD
to the original version, because it's still needed:
mv /usr/bin/spark-submit /usr/bin/spark-submit-orig
cp spark-submit-flamegraph /usr/bin/spark-submit
SPARK_CMD=spark-submit-orig spark-submit ...
influxdb.log
Unfortunately, there are no logs inside the said directory. I have checked thoroughly. :(
gaurav_arya_figmd_com@deltest-m:~/.spark-flamegraph/influxdb$ pwd
/home/gaurav_arya_figmd_com/.spark-flamegraph/influxdb
gaurav_arya_figmd_com@deltest-m:~/.spark-flamegraph/influxdb$ find -name "influxdb.log"
gaurav_arya_figmd_com@deltest-m:~/.spark-flamegraph/influxdb$
Also, if you've replaced original
spark-submit
command with this script, make sure to setSPARK_CMD
to the original version, because it's still needed:mv /usr/bin/spark-submit /usr/bin/spark-submit-orig cp spark-submit-flamegraph /usr/bin/spark-submit SPARK_CMD=spark-submit-orig spark-submit ...
Let me try this. Could you please confirm the third step i.e. SPARK_CMD
one ? It is not that clear to me. I will give it a try right now. Do I need to make the change inside the spark-submit-flamegraph
script ?
influxdb.log
is created in current directory, sorry for misleading you.
SPARK_CMD is a variable that points to the original spark-submit
script. By default it's set to spark-submit
, but it could be spark-shell
or spark-submit-orig
if you moved it away.
Alright I have gone ahead with making a change inside your script, as shown below:
SPARK_CMD=${SPARK_CMD:-spark-submit-orig}
But the job has failed. Here are some logs.
Hadoop logs
Log Type: prelaunch.err
Log Upload Time: Fri Feb 07 12:38:54 +0000 2020
Log Length: 0
Log Type: prelaunch.out
Log Upload Time: Fri Feb 07 12:38:54 +0000 2020
Log Length: 70
Setting up env variables
Setting up job resources
Launching container
Log Type: stderr
Log Upload Time: Fri Feb 07 12:38:54 +0000 2020
Log Length: 119
Error opening zip file or JAR manifest missing : /home/gaurav_arya_figmd_com/.spark-flamegraph/statsd-jvm-profiler.jar
Log Type: stdout
Log Upload Time: Fri Feb 07 12:38:54 +0000 2020
Log Length: 84
Error occurred during initialization of VM
agent library failed to init: instrument
Command line logs
gaurav_arya_figmd_com@deltest-m:~/.spark-flamegraph/influxdb$ time { gcloud dataproc jobs submit spark --project bda-sandbox --cluster deltest --region us-central1 --properties spark.submit.deployMode=cluster,spark.dynamicAllocation.enabled=false,spark.yarn.maxAppAttempts=1,spark.driver.memory=4G,spark.driver.memoryOverhead=1024m,spark.executor.instances=3,spark.executor.memoryOverhead=1024m,spark.executor.memory=4G,spark.executor.cores=2,spark.driver.cores=1,spark.driver.maxResultSize=2g,spark.extraListeners=com.qubole.sparklens.QuboleJobListener --class com.figmd.janus.deletion.dataCleanerMain --jars=gs://cdrmigration/jars/newDataCleaner.jar,gs://spark-lib/bigquery/spark-bigquery-latest.jar,gs://cdrmigration/jars/jdbc-postgresql.jar,gs://cdrmigration/jars/postgresql-42.2.5.jar,gs://cdrmigration/jars/sparklens_2.11-0.3.1.jar -- cdr 289 PatientEthnicity,PatientRace bda-sandbox CDRDELTEST 20200121 0001 2>&1 | tee log ; }
tee: log: Permission denied
Job [47a6046ef73940ee9560d2b56b0a404c] submitted.
Waiting for job output...
[2020-02-07T12:38:42.1581079122] Installing dependencies
[2020-02-07T12:38:44.1581079124] Starting InfluxDB
[2020-02-07T12:38:44.1581079124] InfluxDB starting at :48081
[2020-02-07T12:38:46.1581079126] Executing: spark-submit-orig --jars /home/gaurav_arya_figmd_com/.spark-flamegraph/statsd-jvm-profiler.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/newDataCleaner.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/spark-bigquery-latest.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/jdbc-postgresql.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/postgresql-42.2.5.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/sparklens_2.11-0.3.1.jar --driver-java-options -javaagent:/home/gaurav_arya_figmd_com/.spark-flamegraph/statsd-jvm-profiler.jar=server=10.128.0.31,port=48081,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=sparkapp,tagMapping=spark --conf spark.executor.extraJavaOptions=-javaagent:statsd-jvm-profiler.jar=server=10.128.0.31,port=48081,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=sparkapp,tagMapping=spark --conf spark.driver.cores=1 --conf spark.driver.maxResultSize=2g --conf spark.driver.memory=4G --conf spark.driver.memoryOverhead=1024m --conf spark.dynamicAllocation.enabled=false --conf spark.executor.cores=2 --conf spark.executor.instances=3 --conf spark.executor.memory=4G --conf spark.executor.memoryOverhead=1024m --conf spark.extraListeners=com.qubole.sparklens.QuboleJobListener --conf spark.submit.deployMode=cluster --conf spark.yarn.maxAppAttempts=1 --conf spark.yarn.tags=dataproc_hash_55904610-b3ad-3c58-9ab3-638a84e7c4db,dataproc_job_47a6046ef73940ee9560d2b56b0a404c,dataproc_master_index_0,dataproc_uuid_bb5702d6-bbab-36d1-8fc4-c4aa06211b89 --class com.figmd.janus.deletion.dataCleanerMain /tmp/47a6046ef73940ee9560d2b56b0a404c/dataproc-empty-jar-1581079121265.jar cdr 289 PatientEthnicity,PatientRace bda-sandbox CDRDELTEST 20200121 0001
20/02/07 12:38:49 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at deltest-m/10.128.0.31:8032
20/02/07 12:38:49 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at deltest-m/10.128.0.31:10200
20/02/07 12:38:52 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1581075454418_0006
Exception in thread "main" org.apache.spark.SparkException: Application application_1581075454418_0006 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1166)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1521)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:890)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:217)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[2020-02-07T12:38:54.1581079134] Spark has exited with bad exit code (1)
[2020-02-07T12:38:54.1581079134] Collecting profiling metrics
[2020-02-07T12:38:54.1581079134] No profiling metrics were recorded!
ERROR: (gcloud.dataproc.jobs.submit.spark) Job [47a6046ef73940ee9560d2b56b0a404c] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at 'https://some-gs-bucket-location-for-logs?project=some-project®ion=some-region' and in 'gs://some-gs-bucket-location'.
real 0m17.140s
user 0m0.535s
sys 0m0.071s
gaurav_arya_figmd_com@deltest-m:~/.spark-flamegraph/influxdb$
gcloud logs
gaurav_arya_figmd_com@deltest-m:~$ cat ./.config/gcloud/logs/2020.02.07/12.38.39.592453.log
2020-02-07 12:38:39,593 DEBUG root Loaded Command Group: [u'gcloud', u'dataproc']
2020-02-07 12:38:39,594 DEBUG root Loaded Command Group: [u'gcloud', u'dataproc', u'jobs']
2020-02-07 12:38:39,657 DEBUG root Loaded Command Group: [u'gcloud', u'dataproc', u'jobs', u'submit']
2020-02-07 12:38:39,660 DEBUG root Loaded Command Group: [u'gcloud', u'dataproc', u'jobs', u'submit', u'spark']
2020-02-07 12:38:39,663 DEBUG root Running [gcloud.dataproc.jobs.submit.spark] with arguments: [--class: "com.figmd.janus.deletion.dataCleanerMain", --cluster: "deltest", --jars: "[u'gs://cdrmigration/jars/newDataCleaner.jar', u'gs://spark-lib/bigquery/spark-bigquery-latest.jar', u'gs://cdrmigration/jars/jdbc-postgresql.jar', u'gs://cdrmigration/jars/postgresql-42.2.5.jar', u'gs://cdrmigration/jars/sparklens_2.11-0.3.1.jar']", --project: "bda-sandbox", --properties: "OrderedDict([(u'spark.submit.deployMode', u'cluster'), (u'spark.dynamicAllocation.enabled', u'false'), (u'spark.yarn.maxAppAttempts', u'1'), (u'spark.driver.memory', u'4G'), (u'spark.driver.memoryOverhead', u'1024m'), (u'spark.executor.instances', u'3'), (u'spark.executor.memoryOverhead', u'1024m'), (u'spark.executor.memory', u'4G'), (u'spark.executor.cores', u'2'), (u'spark.driver.cores', u'1'), (u'spark.driver.maxResultSize', u'2g'), (u'spark.extraListeners', u'com.qubole.sparklens.QuboleJobListener')])", --region: "us-central1"]
2020-02-07 12:38:39,929 INFO ___FILE_ONLY___ Job [47a6046ef73940ee9560d2b56b0a404c] submitted.
2020-02-07 12:38:39,929 INFO ___FILE_ONLY___ Waiting for job output...
2020-02-07 12:38:44,317 INFO ___FILE_ONLY___ [2020-02-07T12:38:42.1581079122] Installing dependencies
2020-02-07 12:38:45,501 INFO ___FILE_ONLY___ [2020-02-07T12:38:44.1581079124] Starting InfluxDB
[2020-02-07T12:38:44.1581079124] InfluxDB starting at :48081
2020-02-07 12:38:46,618 INFO ___FILE_ONLY___ [2020-02-07T12:38:46.1581079126] Executing: spark-submit-orig --jars /home/gaurav_arya_figmd_com/.spark-flamegraph/statsd-jvm-profiler.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/newDataCleaner.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/spark-bigquery-latest.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/jdbc-postgresql.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/postgresql-42.2.5.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/sparklens_2.11-0.3.1.jar --driver-java-options -javaagent:/home/gaurav_arya_figmd_com/.spark-flamegraph/statsd-jvm-profiler.jar=server=10.128.0.31,port=48081,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=sparkapp,tagMapping=spark --conf spark.executor.extraJavaOptions=-javaagent:statsd-jvm-profiler.jar=server=10.128.0.31,port=48081,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=sparkapp,tagMapping=spark --conf spark.driver.cores=1 --conf spark.driver.maxResultSize=2g --conf spark.driver.memory=4G --conf spark.driver.memoryOverhead=1024m --conf spark.dynamicAllocation.enabled=false --conf spark.executor.cores=2 --conf spark.executor.instances=3 --conf spark.executor.memory=4G --conf spark.executor.memoryOverhead=1024m --conf spark.extraListeners=com.qubole.sparklens.QuboleJobListener --conf spark.submit.deployMode=cluster --conf spark.yarn.maxAppAttempts=1 --conf spark.yarn.tags=dataproc_hash_55904610-b3ad-3c58-9ab3-638a84e7c4db,dataproc_job_47a6046ef73940ee9560d2b56b0a404c,dataproc_master_index_0,dataproc_uuid_bb5702d6-bbab-36d1-8fc4-c4aa06211b89 --class com.figmd.janus.deletion.dataCleanerMain /tmp/47a6046ef73940ee9560d2b56b0a404c/dataproc-empty-jar-1581079121265.jar cdr 289 PatientEthnicity,PatientRace bda-sandbox CDRDELTEST 20200121 0001
2020-02-07 12:38:49,876 INFO ___FILE_ONLY___ 20/02/07 12:38:49 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at deltest-m/10.128.0.31:8032
2020-02-07 12:38:50,982 INFO ___FILE_ONLY___ 20/02/07 12:38:49 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at deltest-m/10.128.0.31:10200
2020-02-07 12:38:54,249 INFO ___FILE_ONLY___ 20/02/07 12:38:52 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1581075454418_0006
2020-02-07 12:38:55,360 INFO ___FILE_ONLY___ Exception in thread "main" org.apache.spark.SparkException: Application application_1581075454418_0006 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1166)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1521)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:890)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:217)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[2020-02-07T12:38:54.1581079134] Spark has exited with bad exit code (1)
[2020-02-07T12:38:54.1581079134] Collecting profiling metrics
[2020-02-07T12:38:54.1581079134] No profiling metrics were recorded!
2020-02-07 12:38:56,441 DEBUG root (gcloud.dataproc.jobs.submit.spark) Job [47a6046ef73940ee9560d2b56b0a404c] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at 'https://console.cloud.google.com/dataproc/jobs/47a6046ef73940ee9560d2b56b0a404c?project=bda-sandbox®ion=us-central1' and in 'gs://dataproc-ded4155e-8ecc-4627-aab5-15befb5c5e37-us-central1/google-cloud-dataproc-metainfo/dec63309-39e1-4c03-84a4-ccecd8b6a54b/jobs/47a6046ef73940ee9560d2b56b0a404c/driveroutput'.
Traceback (most recent call last):
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 981, in Execute
resources = calliope_command.Run(cli=self, args=args)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 807, in Run
resources = command_instance.Run(args)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/command_lib/dataproc/jobs/submitter.py", line 102, in Run
stream_driver_log=True)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/api_lib/dataproc/util.py", line 441, in WaitForJobTermination
job_ref.jobId, job.status.details))
JobError: Job [47a6046ef73940ee9560d2b56b0a404c] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at 'https://console.cloud.google.com/dataproc/jobs/47a6046ef73940ee9560d2b56b0a404c?project=bda-sandbox®ion=us-central1' and in 'gs://dataproc-ded4155e-8ecc-4627-aab5-15befb5c5e37-us-central1/google-cloud-dataproc-metainfo/dec63309-39e1-4c03-84a4-ccecd8b6a54b/jobs/47a6046ef73940ee9560d2b56b0a404c/driveroutput'.
2020-02-07 12:38:56,442 ERROR root (gcloud.dataproc.jobs.submit.spark) Job [47a6046ef73940ee9560d2b56b0a404c] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at 'https://console.cloud.google.com/dataproc/jobs/47a6046ef73940ee9560d2b56b0a404c?project=bda-sandbox®ion=us-central1' and in 'gs://dataproc-ded4155e-8ecc-4627-aab5-15befb5c5e37-us-central1/google-cloud-dataproc-metainfo/dec63309-39e1-4c03-84a4-ccecd8b6a54b/jobs/47a6046ef73940ee9560d2b56b0a404c/driveroutput'.
gaurav_arya_figmd_com@deltest-m:~$
influxdb.log
That is alright Michael. No issues. :).
Unfortunately there is no log with that name. I have pasted additional logs (whatever I could find and have access to at the moment). If something comes up - please let me know. If something is missing - please also do let me know and I will try to get them as soon as possible.
I am willing to assist/help to debug this issue as I really want to have that flamegraph.
Hello Michael. I am just following up with you on this. Do you have any suggestions to troubleshoot this any further? Thank you in advance.
Cheers,
Gaurav