yahoo/storm-yarn

launch Zookeeper as a part of storm-yarn?

Opened this issue · 3 comments

Currently, we assume that Zookeeper has been deployed before launching storm cluster. It might be better if we could launch Zookeeper with storm-yarn as well.

Any thoughts?

Originally I thought it would be great to launch ZK under YARN with storm, but then as I have gotten more experience with running large storm clusters I think we may not want to do it in all situations. I think it would be great to have, especially for running quick tests, but there are a number of issues that probably need to be overcome first.

ZK is often the scalability bottleneck for storm. Specifically it is limited the number of disk IOs that the slowest ZK node can write into its edit logs. YARN does not currently have any resource relating to disk IOPs, and as such we will get inconsistent and probably bad performance. It can also very negatively impact other things running on the same node that happen to need access to data stored on that disk.

ZK is not designed to replace one node with another node unless the IP address of that nodes is migrated or DNS is updated to point to the replacement node. This makes it very difficult to have a long running ZK ensemble on YARN.

I think we want to support both options and recommend that for production environments an external ZK instance is used.

I'm currently looking at Storm on YARN in an AWS deployment.

If I am using whirr, I guess I would want to create the ZK quorum on their own nodes. Would you use the same quorum for both HBase and Storm?

Since I am relatively new to YARN and Storm, please take my comment with a grain of salt.

I would imagine that Storm-YARN would mean spinning up a YARN cluster (cdh4), adding the storm-YARN jars to the nodes and then add a node for ZK and Nimbus which is outside of the YARN cluster.

So I am not sure if the issue/question is to put ZK on the same node as YARN data nodes, or on different nodes in the cluster? Why would the cluster design look different when running YARN, or MR1?

Just thinking out loud (since I'm trying to get my head around this very thing)...

I think it should be configurable. Most likely zk has been deployed before launching storm via storm-yarn, but it would be great if storm-yarn could start up zk if a host and port for zookeeper haven't been set. This would give the user a great "out-of-the-box" experience being able to deploy storm on YARN right away, but would also give others (with existing zk infrastructure) the ability to connect to an existing instance.

(I also agree w/ @revans2 that in production, you would most likely deploy zk separately)