HDFS sink problem?

Question

HDFS sink problem?

Closed this issue 7 years ago · 23 comments

When I try to run spark job I have the following problem, and I am using Spark 1.6.2.
ERROR metrics.MetricsSystem: Source class org.apache.spark.metrics.source.SigarSource cannot be instantiated
java.lang.ClassNotFoundException: org.apache.spark.metrics.source.SigarSource
do I need to start Sigar or something? I only extract and put the directory and update the spark-env.sh

I removed the cigar then it starts complaining about HDFS sink with the same comment.
16/10/19 21:53:39 ERROR metrics.MetricsSystem: Sink class org.apache.spark.metrics.sink.HDFSSink cannot be instantiated

I have pre-built installation of 1.6.2. Am I missing something?
Thanks again

Answer 1 · 2016-10-19T23:04:55.000Z

Hi @filmonhg ,

you just need to extract the sigar library and point to that directory from the spark-env.sh file.
Have you done this procedure for all the worker nodes in your cluster?

Update: Sorry I just realised that you are using a prebuilt installation. Sparkoscope is not part of the official build so you need to clone this directory and build it as described here: http://spark.apache.org/docs/1.6.2/building-spark.html

Thanks!

Answer 2 · 2016-10-19T23:40:13.000Z

thanks for the quick response: Yes, I have already pointed it to the sigar library from my spark-env.sh.

And when you say clone this directory , do you mean clone spark 1.6.1 that you have here? I want the spark 1.6.2 and in your documentation it says "You do not need to do this if you downloaded a pre-built package.". I am a bit lost here.. Can I use my current spark (pre-built installation ) and just add Sigar and configuration to have the metrics? Thanks again

Answer 3 · 2016-10-20T08:42:33.000Z

sorry if the documentation is a bit misleading. I just added instructions to use sparkoscope and kept the instructions from the official spark README.
because Sparkoscope is a patch to (currently) version 1.6.1 and modifies the web ui, there is no other way to use it besides cloning this repository and building it as per the instructions in http://spark.apache.org/docs/1.6.2/building-spark.html
what version of hadoop are you using? maybe I can provide a prebuilt version for you.

Answer 4 · 2016-10-20T14:18:00.000Z

Thanks for your quick response Yiannis, my use case is that I need cpu usage, memory, IO, disk from spark and hdfs. I am not really into using the UI, rather I just need the metrics. Like you mentioned in your wiki, I am parsing CSV sink metricsand its cumbersome (and CSV sink alone doesn't have CPU usage, unless you have Sigar). Is there a way to use just this part with out having to modify UI (at least for now) because I will keep upgrading to Spark 2.0 or Spark 2.1 etc also I want to put it to HDFS thus HDFSsink is also preferable. Your help as usual is much appreciated on how I should approach this.

And to answer your question I am doing Spark 1.6.2 with Hadoop 2.7.1 as of now.

Thanks

Answer 5 · 2016-10-20T15:55:43.000Z

I didn't imagine that this tool would be useful on its own (without the UI part)
I will provide the prebuilt version for you today/tomorrow.
As a more long-term solution I will create a new repository which creates the jars needed only for HDFSSink and SigarSource without the UI.
Will keep you posted on both :)

Answer 6 · 2016-10-20T18:01:46.000Z

Thanks a lot Yiannis!

Answer 7 · 2016-10-24T12:18:21.000Z

Hi @filmonhg can you please try the release uploaded? https://github.com/ibm-research-ireland/sparkoscope/releases/download/v1.6.2/spark-1.6.2-bin-sparkoscope.tgz

Let me know how it works out for you!

Answer 8 · 2016-10-26T22:40:52.000Z

sorry for late reply, I tested it. I have spark that works from your release but it doesn't poll to HDFS sink and no UI (the ones you have in your wiki) also.. I can only see data in /spark-logs.

Just to give you my perspective, Since I needed to use the metrics in Vendor provided spark (thus I can't build the spark my self )going forward, I am really into having the jar that works with existing spark to poll the metrics to HDFS (i.e. HDFS sink) but I am testing this out also to see what I can achieve.

Thanks so much

Answer 9 · 2016-10-26T22:57:33.000Z

Yeap will try to have the standalone jar version soon. In the meantime can you create the /custom-metrics directory on the HDFS and see if the logs are populated? Sparkoscope assumes this directory exists already in HDFS

Answer 10 · 2016-10-26T23:02:48.000Z

I have created the required directories , it was just empty.

On Wednesday, October 26, 2016, Yiannis Gkoufas notifications@github.com
wrote:

Yeap will try to have the standalone jar version soon. In the meantime can
you create the /custom-metrics directory on the HDFS and see if the logs
are populated? Sparkoscope assumes this directory exists already in HDFS

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJC1ju4TAGQn5itWurnAtTQPaa4zJwi9ks5q39regaJpZM4KbghR
.

Answer 11 · 2016-10-27T06:37:13.000Z

Ok one final thing to try out. The jobs you run maybe they end really fast? Can you try reducing the polling period and run a bigger job?

Answer 12 · 2016-11-02T14:24:18.000Z

sorry I thought I replied but I tried it then with less seconds. I am interrupting dev env to keep it for long but will verify again. I can also help on creating the independent jar with out UI just to poll the Cigar generated metrics and sink it to hdfs. Thanks

Answer 13 · 2016-11-02T16:38:15.000Z

Hey @filmonhg please have a look in the new repository https://github.com/ibm-research-ireland/sparkoscope-headless
It builds the 2 required jars, let me know if you have issues using it.
Thanks!

Answer 14 · 2016-11-02T16:38:56.000Z

Thanks , I will check it out and update you

On Wed, Nov 2, 2016 at 12:38 PM, Yiannis Gkoufas notifications@github.com
wrote:

Hey @filmonhg https://github.com/filmonhg please have a look in the new
repository https://github.com/ibm-research-ireland/sparkoscope-headless
It builds the 2 required jars, let me know if you have issues using it.
Thanks!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJC1jmG4Sn6jiBzekRzCBDcR1xiQ0ATHks5q6Lx5gaJpZM4KbghR
.

Answer 15 · 2016-11-09T16:39:02.000Z

@YiannisGkoufas I tried the headless version and it didn't work for me. So I went back and installed your version from the main branch (whatever is latest) just to see if I can get the original package to work in the first place. I have followed the instruction (packaged it with Maven, downloaded and unzipeed cigar library (pointing it to the LD path in spark-env i.e

spark-env.sh

LD_LIBRARY_PATH=/home/hyperic-sigar-1.6.4/sigar-bin/lib/:$LD_LIBRARY_PATH
HADOOP_CONF_DIR=/opt/hadoop-2.7.1/etc/hadoop
) and updated metrics.properties, spark-env.sh, spark-defaults.conf, created the custom-metrics and spark-logs in my hdfs (version 2.7.1)
metrics.properties
------------------- //the ip will be ip of local machine (master or executor)
executor.sink.hdfs.class=org.apache.spark.metrics.sink.HDFSSink
executor.sink.hdfs.pollPeriod = 20
executor.sink.hdfs.dir = hdfs://172.xx.xx.xx:9000/custom-metrics
executor.sink.hdfs.unit = seconds

spark-defaults.conf

spark.eventLog.enabled true
spark.eventLog.dir hdfs://172.xx.xx.xx:9000/spark-logs
spark.hdfs.metrics.dir hdfs://172.xx.xx.xx:9000/custom-metrics
and I run long enough job (more than ~3 mins) given the poll time is 20 sec. The only different thing I might have done is replacing localhost or 127.0.0.1 with ip of each local machine that I put spark on and that is because my job was complaining on localhost.

Machine that I am running on is 3 node cluster with CentOS release 6.7 (Final)

With the above details, I can only see files logged to /spark-logs of hdfs but not to the /custom-metrics sigar HDFS sink and no UI in my spark (which I don't need but hope I would see). what am I missing ? Thanks again, I followed the Euro spark 2016 conference by the way, I also noticed you mentioned by comments thanks :)

Answer 16 · 2016-11-10T14:24:49.000Z

Hi @filmonhg ,

few things I have noticed:

This path /home/hyperic-sigar-1.6.4/sigar-bin/lib/ doesn't look right to me, are you sure the sigar lib is there?
In the metrics.properties the executor.sink.hdfs.dir should be the address of the namenode of the hadoop cluster and not the ip of every executor. Basically it's the url that shows up on the page when you access http://namenode_ip:50070

But let's try to troubleshoot the pieces one by one. Ignore the instructions in README and follow these instead. Try this in your workstation (without adding slaves etc).

Create 2 directories in your local filesystem (not HDFS): /tmp/spark-logs and /tmp/custom-metrics
Download the latest release: https://github.com/ibm-research-ireland/sparkoscope/releases/download/v1.6.2/spark-1.6.2-bin-sparkoscope.tgz
Edit metrics.properties file to contain:

executor.sink.hdfs.class=org.apache.spark.metrics.sink.HDFSSink
executor.sink.hdfs.pollPeriod = 1
executor.sink.hdfs.dir = file:///tmp/custom-metrics
executor.sink.hdfs.unit = seconds

Edit spark-defaults.conf file to contain:

spark.eventLog.enabled true
spark.eventLog.dir /tmp/spark-logs
spark.hdfs.metrics.dir /tmp/custom-metrics

Start spark and launch a simple job
Check the /tmp/custom-metrics directory contents and the ui in the address http://localhost:8080/history/app-xxxxxxxxx-xxxx/jobs/ to see if you can see the plots as well.

Let me know how it goes

Answer 17 · 2016-11-10T19:02:31.000Z

hello @YiannisGkoufas thanks again,

Regarding the path /home/hyperic-sigar-1.6.4/sigar-bin/lib/ I can see the following libs under this directory
libsigar-amd64-freebsd-6.so libsigar-pa-hpux-11.sl libsigar-s390x-linux.so libsigar-x86-freebsd-5.so sigar-amd64-winnt.dll
libsigar-amd64-linux.so libsigar-ppc64-aix-5.so libsigar-sparc64-solaris.so libsigar-x86-freebsd-6.so sigar.jar
libsigar-amd64-solaris.so libsigar-ppc64-linux.so libsigar-sparc-solaris.so libsigar-x86-linux.so sigar-x86-winnt.dll
libsigar-ia64-hpux-11.sl libsigar-ppc-aix-5.so libsigar-universal64-macosx.dylib libsigar-x86-solaris.so sigar-x86-winnt.lib
libsigar-ia64-linux.so libsigar-ppc-linux.so libsigar-universal-macosx.dylib log4j.jar
Following the latest steps

download and upzip https://github.com/ibm-research-ireland/sparkoscope/releases/download/v1.6.2/spark-1.6.2-bin-sparkoscope.tgz
-created /tmp/spark-logs and /tmp/custom-metrics
-updated metrics.properties and spark-default.conf as above
-started spark in the local machine (localhost:8080) confirmed
-run small job (Example job) SparkPi 1000
Results:
/tmp/custom-metrics/ is empty
but /tmp/spark-logs/ has the events for each time I run a job
N.B. For the later instruction you didn't use Sigar at all??
I even tried it in Ubuntu (my personal machine) and I can see the same problem

Answer 18 · 2016-11-10T19:30:07.000Z

There is no Sigar at all in these instructions, in order to remove an additional dependency.
Try to run SparkPi with 10000 maybe it's too fast.
I really have no idea why it wouldn't work, you are not seeing any errors on the executors?
If you want, create a zip with the contents of the "conf" and "work" directory of spark and the /tmp/spark-logs folder and share it with me to have a look.

Answer 19 · 2016-11-16T22:51:45.000Z

Thanks again, I am using single node (localhost) to try the steps and no errors while I run /home/user1/spark-1.6.2-bin-sparkoscope/bin/run-example SparkPi 10000 , still the issue is that the job run successfully and gave me result
16/11/16 18:48:59 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:36, took 522.235259 s
Pi is roughly 3.14162272

Just going to details (to make it easier to communicate): Here are the exact steps of what I did
(In a single machine, not cluster to keep it simple)
-downloaded and upzip https://github.com/ibm-research-ireland/sparkoscope/releases/download/v1.6.2/spark-1.6.2-bin-sparkoscope.tgz
-went to conf and updated the metrics.properties and spark-defaults.conf (see attached)
-created the /tmp/spark-logs and /tmp/custom-metrics
-went to /sbin/start-master and verified it in localhost:8080
-run the example SparkPi with 10000 and got result for Pi as shown

Here while runing job: unless I go to localhost:4040 the current job is not visible in the localhost:8080 as a app-XXX
and /tmp/spark-logs has log file (like local-1479321615843) but /tmp/custom-metrics is still empty
here is attached conf directory, no work directory
conf.tar.gz

Then I added one worker

because work directory is created if you have executor job, I added executor with the same configuration as master (previous localhost) using
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT and saw these worker added to the master in the master-ip:8080 and run the SparkPi again, now the work directory is created but empty though it calcualted the PI perfectly.

Sorry for the long message and lots of back and forth. I have no issues running, deploying Spark cluster and I work with it on daily basis. May be I am missing something which you might have assumed to be there. I asked someone from my team to go through this also.

Thanks again,

Answer 20 · 2016-11-16T23:49:39.000Z

Hi there,

absolutely, it's really constructive for me, because I need to make the instructions a bit more clear.
I have spotted some points in your workflow that could be the reason why it's not working.
I will update the instructions I posted on a previous message with some additional steps.
Again let's try it on a single node:

Create 2 directories in your local filesystem (not HDFS): /tmp/spark-logs and /tmp/custom-metrics
Download the latest release: https://github.com/ibm-research-ireland/sparkoscope/releases/download/v1.6.2/spark-1.6.2-bin-sparkoscope.tgz
Edit metrics.properties file to contain:

executor.sink.hdfs.class=org.apache.spark.metrics.sink.HDFSSink
executor.sink.hdfs.pollPeriod = 1
executor.sink.hdfs.dir = file:///tmp/custom-metrics
executor.sink.hdfs.unit = seconds

Edit spark-defaults.conf file to contain:

spark.eventLog.enabled true
spark.eventLog.dir /tmp/spark-logs
spark.hdfs.metrics.dir /tmp/custom-metrics

change to the directory of extracted sparkoscope and give:

sbin/start-all.sh

in your browser navigate to localhost:8080

Get the value of the URL shown in bold letters, it should be something like:
spark://your_local_ip:7077

in the directory you are currently in give:

bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://your_local_ip:7077 lib/spark-examples-1.6.2-hadoop2.7.2.jar 1000

Let me know if this works out.
I think it will be useful maybe to create a dockerfile for you to check it out, hopefully it would be useful for other people to quickly test it out.

Thanks for your patience!

Answer 21 · 2016-11-17T15:33:11.000Z

@YiannisGkoufas ok finally it works now, I think the issue was with how I was starting Spark, I usually do sbin/start-master.sh and after updating the slaves file in conf I start slaves as sbin/start-slaves.sh Vs yours sbin/start-all.sh though star-all does the same thing (looking at the script) or the way I run the job because I mention --master and that's how I get to see the job in the localhost:8080 UI.

Will continue with doing it distributed and then with Sigar. Let me know if you have any comments in the mean time.

Thanks again for your continuous support and your patience.

Answer 22 · 2016-11-17T17:39:16.000Z

that's great to know! I hope that you can see the plots as well on the UI right?
what's important is that in the ui of the master you could see the workers registered.
beware that all the workers should be running our version so that they record the metrics.
before trying Sigar in distributed mode, maybe you should try the HDFSSink writing in hdfs now instead of the local filesystem
Thanks!

Answer 23 · 2016-12-05T22:05:30.000Z

Hello @YiannisGkoufas thanks for reaching out as always 👍 , I have done the following progress. The UI based sparkoscope works with Sigar and everything. Now I went down to the headless version and I have the following results:
First I compiled and build Sigar and HDFS jars of the headless sparkoscope as given in your link (https://github.com/ibm-research-ireland/sparkoscope-headless) and point them in spark-default.conf as follows:
spark.executor.extraClassPath /home/user/sparkoscope-headless/sparkoscope-sigarsource/target/sparkoscope-sigarsource-1.6.2.jar:/home/dep/sparkoscope-headless/sparkoscope-hdfssink/user/sparkoscope-hdfssink-1.6.2.jar
The rest of the configuration is as the UI based sparkoscope (i..e in spark-env.sh, spark-defaults.conf, metrics.properties) and I am doing this in Single node to keep it simple.
The job run fine though I see the following WARNING in the log:
16/12/05 17:01:42 WARN util.Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
16/12/05 17:01:42 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/12/05 17:01:42 WARN component.AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4042: java.net.BindException: Address already in use
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)

16/12/05 17:01:42 WARN component.AbstractLifeCycle: FAILED org.spark-project.jetty.server.Server@654b72c0: java.net.BindException: Address already in use
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)

And as a result, I only can see job related files in hdfs of /spark-logs but /custom-metrics is empty. Looks like the headless version is competing with Spark on port. Any ideas?