Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can this be used with gcp ? #8

Open
normalscene opened this issue Feb 7, 2020 · 13 comments
Open

Can this be used with gcp ? #8

normalscene opened this issue Feb 7, 2020 · 13 comments

Comments

@normalscene
Copy link

Hi, I was wondering if this could be used with google cloud platform?

@spektom
Copy link
Owner

spektom commented Feb 7, 2020

Hi. There shouldn't be any - the tool is supposed to work with any Spark distribution. Please try and report issues back if any :)

@spektom spektom closed this as completed Feb 7, 2020
@normalscene
Copy link
Author

Hi,

I had to replace the stock 'spark-submit' with your 'spark-submit-flamegraph'. Currently it doesn't work and seem to have couple of issues.

  1. It complained about $HOME not set on line 233. I fixed it by defining HOME inside script. fixed.
  2. After that it is stuck on line 31 (trying to find a free port) and just keeps on printing the below shown output.

Do you have any suggestions?

gaurav_arya_figmd_com@deltest-m:~$ ls -lrth /usr/bin/spark-submit
lrwxrwxrwx 1 root root 51 Feb  7 11:43 /usr/bin/spark-submit -> /home/gaurav_arya_figmd_com/spark-submit-flamegraph
gaurav_arya_figmd_com@deltest-m:~$ 


gaurav_arya_figmd_com@deltest-m:~$ gcloud dataproc jobs submit spark --project bda-sandbox --cluster deltest --region us-central1  --properties spark.submit.deployMode=cluster,spark.dynamicAllocation.enabled=false,spark.yarn.maxAppAttempts=1,spark.driver.memory=4G,spark.driver.memoryOverhead=1024m,spark.executor.instances=3,spark.executor.memoryOverhead=1024m,spark.executor.memory=4G,spark.executor.cores=2,spark.driver.cores=1,spark.driver.maxResultSize=2g,spark.extraListeners=com.qubole.sparklens.QuboleJobListener --class com.figmd.janus.deletion.dataCleanerMain --jars=gs://cdrmigration/jars/newDataCleaner.jar,gs://spark-lib/bigquery/spark-bigquery-latest.jar,gs://cdrmigration/jars/jdbc-postgresql.jar,gs://cdrmigration/jars/postgresql-42.2.5.jar,gs://cdrmigration/jars/sparklens_2.11-0.3.1.jar  -- cdr 289 PatientEthnicity,PatientRace bda-sandbox CDRDELTEST 20200121 0001
Job [b28c81b219b54ebbafaf2d15ff7e8549] submitted.
Waiting for job output...
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
.
.
.

@spektom
Copy link
Owner

spektom commented Feb 7, 2020

Looks like a bug. I don't have a way to re-create this now. If you want to help debug this issue, please:

  • Add -x parameter to the script (#!/bin/bash -eux)
  • Find gcloud dataproc logs that contain output from the script.
  • Attach the logs to this page.

Thanks!

@spektom spektom reopened this Feb 7, 2020
@normalscene
Copy link
Author

Alright - so I figured out the issue finally - I had to install a couple of things like telnet & pip, and I was not aware that the system didn't have it. I got a warning for pip but not for telnet. Maybe you would add a check for required binaries so that if they are not found, the user get's proper indication. Just a suggestion.

So after fixing all the minor issues it errors out on "Couldn't start InfluxDB!".

Question: Is there any additional logging apart from ~/.spark-flamegraph to tackle the below issue?

[2020-02-07T12:18:20.1581077900] Installing dependencies
[2020-02-07T12:18:22.1581077902] Starting InfluxDB
[2020-02-07T12:18:22.1581077902] InfluxDB starting at :48137
ERROR: Couldn't start InfluxDB!
[2020-02-07T12:18:32.1581077912] Spark has exited with bad exit code (1)
[2020-02-07T12:18:32.1581077912] Collecting profiling metrics
[2020-02-07T12:18:32.1581077912] No profiling metrics were recorded!
[2020-02-07T12:18:32.1581077912] Spark has exited with bad exit code (1)

@spektom
Copy link
Owner

spektom commented Feb 7, 2020

There's log file called influxdb.log, can you look there please?

@spektom
Copy link
Owner

spektom commented Feb 7, 2020

Also, if you've replaced original spark-submit command with this script, make sure to set SPARK_CMD to the original version, because it's still needed:

mv /usr/bin/spark-submit /usr/bin/spark-submit-orig
cp spark-submit-flamegraph /usr/bin/spark-submit
SPARK_CMD=spark-submit-orig spark-submit ...

@normalscene
Copy link
Author

influxdb.log

Unfortunately, there are no logs inside the said directory. I have checked thoroughly. :(

gaurav_arya_figmd_com@deltest-m:~/.spark-flamegraph/influxdb$ pwd
/home/gaurav_arya_figmd_com/.spark-flamegraph/influxdb
gaurav_arya_figmd_com@deltest-m:~/.spark-flamegraph/influxdb$ find -name "influxdb.log"
gaurav_arya_figmd_com@deltest-m:~/.spark-flamegraph/influxdb$ 

@normalscene
Copy link
Author

normalscene commented Feb 7, 2020

Also, if you've replaced original spark-submit command with this script, make sure to set SPARK_CMD to the original version, because it's still needed:

mv /usr/bin/spark-submit /usr/bin/spark-submit-orig
cp spark-submit-flamegraph /usr/bin/spark-submit
SPARK_CMD=spark-submit-orig spark-submit ...

Let me try this. Could you please confirm the third step i.e. SPARK_CMD one ? It is not that clear to me. I will give it a try right now. Do I need to make the change inside the spark-submit-flamegraph script ?

@spektom
Copy link
Owner

spektom commented Feb 7, 2020

influxdb.log is created in current directory, sorry for misleading you.
SPARK_CMD is a variable that points to the original spark-submit script. By default it's set to spark-submit, but it could be spark-shell or spark-submit-orig if you moved it away.

@normalscene
Copy link
Author

normalscene commented Feb 7, 2020

Alright I have gone ahead with making a change inside your script, as shown below:

SPARK_CMD=${SPARK_CMD:-spark-submit-orig}

But the job has failed. Here are some logs.

Hadoop logs

Log Type: prelaunch.err

Log Upload Time: Fri Feb 07 12:38:54 +0000 2020

Log Length: 0


Log Type: prelaunch.out

Log Upload Time: Fri Feb 07 12:38:54 +0000 2020

Log Length: 70

Setting up env variables
Setting up job resources
Launching container

Log Type: stderr

Log Upload Time: Fri Feb 07 12:38:54 +0000 2020

Log Length: 119

Error opening zip file or JAR manifest missing : /home/gaurav_arya_figmd_com/.spark-flamegraph/statsd-jvm-profiler.jar

Log Type: stdout

Log Upload Time: Fri Feb 07 12:38:54 +0000 2020

Log Length: 84

Error occurred during initialization of VM
agent library failed to init: instrument

Command line logs

gaurav_arya_figmd_com@deltest-m:~/.spark-flamegraph/influxdb$ time { gcloud dataproc jobs submit spark --project bda-sandbox --cluster deltest --region us-central1  --properties spark.submit.deployMode=cluster,spark.dynamicAllocation.enabled=false,spark.yarn.maxAppAttempts=1,spark.driver.memory=4G,spark.driver.memoryOverhead=1024m,spark.executor.instances=3,spark.executor.memoryOverhead=1024m,spark.executor.memory=4G,spark.executor.cores=2,spark.driver.cores=1,spark.driver.maxResultSize=2g,spark.extraListeners=com.qubole.sparklens.QuboleJobListener --class com.figmd.janus.deletion.dataCleanerMain --jars=gs://cdrmigration/jars/newDataCleaner.jar,gs://spark-lib/bigquery/spark-bigquery-latest.jar,gs://cdrmigration/jars/jdbc-postgresql.jar,gs://cdrmigration/jars/postgresql-42.2.5.jar,gs://cdrmigration/jars/sparklens_2.11-0.3.1.jar  -- cdr 289 PatientEthnicity,PatientRace bda-sandbox CDRDELTEST 20200121 0001 2>&1 | tee log ; }
tee: log: Permission denied
Job [47a6046ef73940ee9560d2b56b0a404c] submitted.
Waiting for job output...
[2020-02-07T12:38:42.1581079122] Installing dependencies
[2020-02-07T12:38:44.1581079124] Starting InfluxDB
[2020-02-07T12:38:44.1581079124] InfluxDB starting at :48081
[2020-02-07T12:38:46.1581079126] Executing: spark-submit-orig --jars /home/gaurav_arya_figmd_com/.spark-flamegraph/statsd-jvm-profiler.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/newDataCleaner.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/spark-bigquery-latest.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/jdbc-postgresql.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/postgresql-42.2.5.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/sparklens_2.11-0.3.1.jar --driver-java-options -javaagent:/home/gaurav_arya_figmd_com/.spark-flamegraph/statsd-jvm-profiler.jar=server=10.128.0.31,port=48081,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=sparkapp,tagMapping=spark --conf spark.executor.extraJavaOptions=-javaagent:statsd-jvm-profiler.jar=server=10.128.0.31,port=48081,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=sparkapp,tagMapping=spark --conf spark.driver.cores=1 --conf spark.driver.maxResultSize=2g --conf spark.driver.memory=4G --conf spark.driver.memoryOverhead=1024m --conf spark.dynamicAllocation.enabled=false --conf spark.executor.cores=2 --conf spark.executor.instances=3 --conf spark.executor.memory=4G --conf spark.executor.memoryOverhead=1024m --conf spark.extraListeners=com.qubole.sparklens.QuboleJobListener --conf spark.submit.deployMode=cluster --conf spark.yarn.maxAppAttempts=1 --conf spark.yarn.tags=dataproc_hash_55904610-b3ad-3c58-9ab3-638a84e7c4db,dataproc_job_47a6046ef73940ee9560d2b56b0a404c,dataproc_master_index_0,dataproc_uuid_bb5702d6-bbab-36d1-8fc4-c4aa06211b89 --class com.figmd.janus.deletion.dataCleanerMain /tmp/47a6046ef73940ee9560d2b56b0a404c/dataproc-empty-jar-1581079121265.jar cdr 289 PatientEthnicity,PatientRace bda-sandbox CDRDELTEST 20200121 0001
20/02/07 12:38:49 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at deltest-m/10.128.0.31:8032
20/02/07 12:38:49 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at deltest-m/10.128.0.31:10200
20/02/07 12:38:52 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1581075454418_0006
Exception in thread "main" org.apache.spark.SparkException: Application application_1581075454418_0006 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1166)
	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1521)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:890)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:192)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:217)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[2020-02-07T12:38:54.1581079134] Spark has exited with bad exit code (1)
[2020-02-07T12:38:54.1581079134] Collecting profiling metrics
[2020-02-07T12:38:54.1581079134] No profiling metrics were recorded!
ERROR: (gcloud.dataproc.jobs.submit.spark) Job [47a6046ef73940ee9560d2b56b0a404c] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at 'https://some-gs-bucket-location-for-logs?project=some-project&region=some-region' and in 'gs://some-gs-bucket-location'.

real	0m17.140s
user	0m0.535s
sys	0m0.071s
gaurav_arya_figmd_com@deltest-m:~/.spark-flamegraph/influxdb$ 

gcloud logs

gaurav_arya_figmd_com@deltest-m:~$ cat ./.config/gcloud/logs/2020.02.07/12.38.39.592453.log
2020-02-07 12:38:39,593 DEBUG    root            Loaded Command Group: [u'gcloud', u'dataproc']
2020-02-07 12:38:39,594 DEBUG    root            Loaded Command Group: [u'gcloud', u'dataproc', u'jobs']
2020-02-07 12:38:39,657 DEBUG    root            Loaded Command Group: [u'gcloud', u'dataproc', u'jobs', u'submit']
2020-02-07 12:38:39,660 DEBUG    root            Loaded Command Group: [u'gcloud', u'dataproc', u'jobs', u'submit', u'spark']
2020-02-07 12:38:39,663 DEBUG    root            Running [gcloud.dataproc.jobs.submit.spark] with arguments: [--class: "com.figmd.janus.deletion.dataCleanerMain", --cluster: "deltest", --jars: "[u'gs://cdrmigration/jars/newDataCleaner.jar', u'gs://spark-lib/bigquery/spark-bigquery-latest.jar', u'gs://cdrmigration/jars/jdbc-postgresql.jar', u'gs://cdrmigration/jars/postgresql-42.2.5.jar', u'gs://cdrmigration/jars/sparklens_2.11-0.3.1.jar']", --project: "bda-sandbox", --properties: "OrderedDict([(u'spark.submit.deployMode', u'cluster'), (u'spark.dynamicAllocation.enabled', u'false'), (u'spark.yarn.maxAppAttempts', u'1'), (u'spark.driver.memory', u'4G'), (u'spark.driver.memoryOverhead', u'1024m'), (u'spark.executor.instances', u'3'), (u'spark.executor.memoryOverhead', u'1024m'), (u'spark.executor.memory', u'4G'), (u'spark.executor.cores', u'2'), (u'spark.driver.cores', u'1'), (u'spark.driver.maxResultSize', u'2g'), (u'spark.extraListeners', u'com.qubole.sparklens.QuboleJobListener')])", --region: "us-central1"]
2020-02-07 12:38:39,929 INFO     ___FILE_ONLY___ Job [47a6046ef73940ee9560d2b56b0a404c] submitted.

2020-02-07 12:38:39,929 INFO     ___FILE_ONLY___ Waiting for job output...

2020-02-07 12:38:44,317 INFO     ___FILE_ONLY___ [2020-02-07T12:38:42.1581079122] Installing dependencies

2020-02-07 12:38:45,501 INFO     ___FILE_ONLY___ [2020-02-07T12:38:44.1581079124] Starting InfluxDB
[2020-02-07T12:38:44.1581079124] InfluxDB starting at :48081

2020-02-07 12:38:46,618 INFO     ___FILE_ONLY___ [2020-02-07T12:38:46.1581079126] Executing: spark-submit-orig --jars /home/gaurav_arya_figmd_com/.spark-flamegraph/statsd-jvm-profiler.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/newDataCleaner.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/spark-bigquery-latest.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/jdbc-postgresql.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/postgresql-42.2.5.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/sparklens_2.11-0.3.1.jar --driver-java-options -javaagent:/home/gaurav_arya_figmd_com/.spark-flamegraph/statsd-jvm-profiler.jar=server=10.128.0.31,port=48081,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=sparkapp,tagMapping=spark --conf spark.executor.extraJavaOptions=-javaagent:statsd-jvm-profiler.jar=server=10.128.0.31,port=48081,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=sparkapp,tagMapping=spark --conf spark.driver.cores=1 --conf spark.driver.maxResultSize=2g --conf spark.driver.memory=4G --conf spark.driver.memoryOverhead=1024m --conf spark.dynamicAllocation.enabled=false --conf spark.executor.cores=2 --conf spark.executor.instances=3 --conf spark.executor.memory=4G --conf spark.executor.memoryOverhead=1024m --conf spark.extraListeners=com.qubole.sparklens.QuboleJobListener --conf spark.submit.deployMode=cluster --conf spark.yarn.maxAppAttempts=1 --conf spark.yarn.tags=dataproc_hash_55904610-b3ad-3c58-9ab3-638a84e7c4db,dataproc_job_47a6046ef73940ee9560d2b56b0a404c,dataproc_master_index_0,dataproc_uuid_bb5702d6-bbab-36d1-8fc4-c4aa06211b89 --class com.figmd.janus.deletion.dataCleanerMain /tmp/47a6046ef73940ee9560d2b56b0a404c/dataproc-empty-jar-1581079121265.jar cdr 289 PatientEthnicity,PatientRace bda-sandbox CDRDELTEST 20200121 0001

2020-02-07 12:38:49,876 INFO     ___FILE_ONLY___ 20/02/07 12:38:49 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at deltest-m/10.128.0.31:8032

2020-02-07 12:38:50,982 INFO     ___FILE_ONLY___ 20/02/07 12:38:49 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at deltest-m/10.128.0.31:10200

2020-02-07 12:38:54,249 INFO     ___FILE_ONLY___ 20/02/07 12:38:52 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1581075454418_0006

2020-02-07 12:38:55,360 INFO     ___FILE_ONLY___ Exception in thread "main" org.apache.spark.SparkException: Application application_1581075454418_0006 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1166)
	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1521)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:890)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:192)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:217)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[2020-02-07T12:38:54.1581079134] Spark has exited with bad exit code (1)
[2020-02-07T12:38:54.1581079134] Collecting profiling metrics
[2020-02-07T12:38:54.1581079134] No profiling metrics were recorded!

2020-02-07 12:38:56,441 DEBUG    root            (gcloud.dataproc.jobs.submit.spark) Job [47a6046ef73940ee9560d2b56b0a404c] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at 'https://console.cloud.google.com/dataproc/jobs/47a6046ef73940ee9560d2b56b0a404c?project=bda-sandbox&region=us-central1' and in 'gs://dataproc-ded4155e-8ecc-4627-aab5-15befb5c5e37-us-central1/google-cloud-dataproc-metainfo/dec63309-39e1-4c03-84a4-ccecd8b6a54b/jobs/47a6046ef73940ee9560d2b56b0a404c/driveroutput'.
Traceback (most recent call last):
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 981, in Execute
    resources = calliope_command.Run(cli=self, args=args)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 807, in Run
    resources = command_instance.Run(args)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/command_lib/dataproc/jobs/submitter.py", line 102, in Run
    stream_driver_log=True)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/api_lib/dataproc/util.py", line 441, in WaitForJobTermination
    job_ref.jobId, job.status.details))
JobError: Job [47a6046ef73940ee9560d2b56b0a404c] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at 'https://console.cloud.google.com/dataproc/jobs/47a6046ef73940ee9560d2b56b0a404c?project=bda-sandbox&region=us-central1' and in 'gs://dataproc-ded4155e-8ecc-4627-aab5-15befb5c5e37-us-central1/google-cloud-dataproc-metainfo/dec63309-39e1-4c03-84a4-ccecd8b6a54b/jobs/47a6046ef73940ee9560d2b56b0a404c/driveroutput'.
2020-02-07 12:38:56,442 ERROR    root            (gcloud.dataproc.jobs.submit.spark) Job [47a6046ef73940ee9560d2b56b0a404c] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at 'https://console.cloud.google.com/dataproc/jobs/47a6046ef73940ee9560d2b56b0a404c?project=bda-sandbox&region=us-central1' and in 'gs://dataproc-ded4155e-8ecc-4627-aab5-15befb5c5e37-us-central1/google-cloud-dataproc-metainfo/dec63309-39e1-4c03-84a4-ccecd8b6a54b/jobs/47a6046ef73940ee9560d2b56b0a404c/driveroutput'.
gaurav_arya_figmd_com@deltest-m:~$  

@normalscene
Copy link
Author

influxdb.log

That is alright Michael. No issues. :).

Unfortunately there is no log with that name. I have pasted additional logs (whatever I could find and have access to at the moment). If something comes up - please let me know. If something is missing - please also do let me know and I will try to get them as soon as possible.

I am willing to assist/help to debug this issue as I really want to have that flamegraph.

@normalscene
Copy link
Author

@spektom

Hello Michael. I am just following up with you on this. Do you have any suggestions to troubleshoot this any further? Thank you in advance.

Cheers,
Gaurav

@spektom
Copy link
Owner

spektom commented Feb 10, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants