soabase/exhibitor

Could not find log4j

brooksgarrett opened this issue · 2 comments

On startup exhibitor fails to start zookeeper with the following error:

INFO: Initiating Jersey application, version 'Jersey: 1.18.3 12/01/2014 08:23 AM'
INFO  org.mortbay.log  Started SocketConnector@0.0.0.0:8080 [main]
INFO  com.netflix.exhibitor.core.activity.ActivityLog  State: down [ActivityQueue-0]
INFO  com.netflix.exhibitor.core.activity.ActivityLog  Restart of ZooKeeper skipped due to control panel setting [ActivityQueue-0]
INFO  com.netflix.exhibitor.core.activity.ActivityLog  Attempting to start instance [ActivityQueue-0]
ERROR com.netflix.exhibitor.core.activity.ActivityLog  Trying to kill start instance [ActivityQueue-0]
java.io.IOException: Could not find (.*log4j.*)|(.*slf4j.*) jar
	at com.netflix.exhibitor.core.processes.Details.findJar(Details.java:145)
	at com.netflix.exhibitor.core.processes.Details.<init>(Details.java:57)
	at com.netflix.exhibitor.core.processes.StandardProcessOperations.startInstance(StandardProcessOperations.java:105)
	at com.netflix.exhibitor.core.state.StartInstance.call(StartInstance.java:46)
	at com.netflix.exhibitor.core.state.StartInstance.call(StartInstance.java:23)
	at com.netflix.exhibitor.core.activity.ActivityQueue$1.run(ActivityQueue.java:126)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
INFO  com.netflix.exhibitor.core.activity.ActivityLog  ZooKeeper down/not-serving waiting 30161 of 40000 ms before restarting [ActivityQueue-0]

I'm using shared configuration backed by S3 and my defaults looks like so:

defaults.conf 
zookeeper-install-directory=/opt/zookeeper 
zookeeper-data-directory=/var/lib/zookeeper/data 
zookeeper-log-directory=/var/lib/zookeeper/datalog
zookeeper-log-directory=/var/lib/zookeeper/datalog
log-index-directory=/var/lib/zookeeper/datalog
client-port=2181
connect-port=2888
election-port=3888
zoo-cfg-extra=tickTime\=2000&initLimit\=10&syncLimit\=5&quorumListenOnAllIPs\=true
auto-manage-instances=true

I've checked 100 times and the log4j jars are both in the zookeeper lib directory as well as symlinked in the exhibitor install. I've tried 1.5.5 as well as 1.6.0. The build is via maven against master (as well as last stable release tag). I'm truly at wits end here, has anyone seen this behavior and have any idea of where to start digging?

Hi from looking at the code, Exhibitor makes the assumption that the log4j jar files are located in a lib folder under the Zookeeper install path. In your case that would be /opt/zookeeper/lib.
Can you also make sure that the user which launches Exhibitor has read permission on that directory and its contents?
For example, this is what I see for a CDH distribution of Zookeeper:

~$ ll /usr/lib/zookeeper/lib/
total 2016
-rw-r--r-- 1 root root  208781 Aug 24 16:35 jline-2.11.jar
-rw-r--r-- 1 root root  481535 Aug 24 16:35 log4j-1.2.16.jar
-rw-r--r-- 1 root root 1330394 Aug 24 16:35 netty-3.10.5.Final.jar
-rw-r--r-- 1 root root   26084 Aug 24 16:35 slf4j-api-1.7.5.jar
-rw-r--r-- 1 root root    8869 Aug 24 16:35 slf4j-log4j12-1.7.5.jar
lrwxrwxrwx 1 root root      23 Oct  5 17:21 slf4j-log4j12.jar -> slf4j-log4j12-1.7.5.jar
~$ ll /usr/lib/zookeeper/
total 1408
drwxr-xr-x 2 root root    4096 Oct  5 17:21 bin
drwxr-xr-x 2 root root    4096 Oct  5 17:21 cloudera
lrwxrwxrwx 1 root root      19 Oct  5 17:21 conf -> /etc/zookeeper/conf
drwxr-xr-x 2 root root    4096 Oct  5 17:21 lib
-rw-r--r-- 1 root root   11358 Aug 24 16:35 LICENSE.txt
-rw-r--r-- 1 root root     170 Aug 24 16:35 NOTICE.txt
-rw-r--r-- 1 root root 1410862 Aug 24 16:35 zookeeper-3.4.5-cdh5.12.1.jar
lrwxrwxrwx 1 root root      29 Oct  5 17:21 zookeeper.jar -> zookeeper-3.4.5-cdh5.12.1.jar

In version 1.7.0, I also made it print the absolute path for that exception (dir.getAbsolutePath()). It may help if you try it with 1.7.0 as well.

I actually traced this down to an issue where the shared state on S3 was corrupt and the defaults weren't being used so the path was null. The extra logging would have spotted the issue much faster.

I actually tried using 1.7.0 to get the logging you pointed out and oddly my version string still reads 1.6.0. Since I resolved the error I've haven't looked into the issue with 1.7.0 further. I'll close this issue.