/tusk

JBoss Middleware + Big Data

--------------------------------------------------
Overview
--------------------------------------------------
Tusk is a project that integrates various JBoss Middleware technologies 
with various Big Data technologies. It's purpose is to:
	* house bindings between JBoss tools and Big Data tools
	* provide demo environments that show what can be done when you integrate 
	enterprise class middleware with industry leading Big Data tools
	* give ideas for ways that you can use JBoss to augment Big Data and 
	Big Data to augment JBoss

To run Tusk in a "real" environment, you need a running SOA-P server (with 
EDSP installed) and either a running Hadoop cluster or Cassandra cluster.
This includes running the unit tests, although one of the first tasks will
be to provide embedded Hadoop (HBase, Zookeeper) and Cassandra servers. Once 
the embedded versions are used to validate functionality, the standalone 
versions of these servers is adequate until the move to production or until
performance testing is done.

Due to classloading conflict issues that have not yet been resolved, the 
current version of Tusk is unable to deploy the ispn-integration jar file within
SOA-P. Therefore, it is running within a RESTful web service war file in Tomcat.
When the SOA-P service needs to write the message index to the Infinispan cache
it makes a REST call to Tomcat. This will hopefully be resolved fairly soon.


--------------------------------------------------
Installed Software Versions
--------------------------------------------------
JBoss SOA-P									5.2
JBoss Enterprise Data Services Platform 	5.2
JBoss Business Rules Management System		5.2
Apache Ant									1.8.x
Apache Maven								3.x
Apacht Tomcat								6.x
JDK											1.6.x

Hadoop (Cloudera Distribution for Hadoop)	0.20
HBase (Cloudera Distribution for Hadoop)	0.20
Hive (Cloudera Distribution for Hadoop)		0.20

Apache Cassandra							1.0

* Note: if you only want to run against Hadoop or Cassandra, you don't have to
install the other one.


--------------------------------------------------
Building
--------------------------------------------------
Tusk uses Maven for builds and dependency management. To install the artifacts into the
local repository without executing the unit tests:
	mvn -Dmaven.test.skip.exec=true clean install

To install the artifacts into the local repository:
	mvn clean install 

Once the artifacts are built, deploy them via the following:
	1. Copy the esb-integration-*.esb file into the SOA-P deploy directory
	2. Copy the TuskUI.war file into the Tomcat webapps directory

You can verify that the artifacts were deployed properly by doing the following:
	1. Go to the JMX console (http://localhost:8080/jmx-console) and find the
	BigDataMessengerManagementBean and invoke the stubMessages operation to add a
	message into the data intake pipeline.
	2. View the SOA-P log file to verify that the message was handled properly.
	You'll see something like:
		14:17:08,881 INFO  [BigDataExtractor] Extracted index: id[0]
		14:17:08,897 INFO  [BigDataExtractor] Extracted index: addressLine1[210 N. Church Street]
	3. Go to the Tusk UI (http://localhost:8888/TuskUI/search.html) and search for
	the message that you just injected.


--------------------------------------------------
Running Tusk
--------------------------------------------------
There are convenience scripts in the tusk/bin directory that start/stop/restart (some of) the
Tusk services in the correct order. Currently these only cover the Hadoop services. After running
the start or restart script, give the services some time to get initialized before attempting to
use them. About a minute should do.

These scripts can be updated to manage SOA-P, Tomcat, and Cassandra if necessary.

The Tusk application can be run against either Hbase or Cassandra. To change which data store
it runs against, update the TuskConfiguration.java class in the common module. The dataStore
field contains the data store to run against. TODO update this to read from a config file or run.conf. 


--------------------------------------------------
Dev Env
--------------------------------------------------
Eclipse Helios was used to develop Tusk and it (or a later version) is recommended 
for development. JBoss Developer Studio will work fine as well. In either case, 
ensure that the M2Eclipse plugin is installed. You can install Maven on the command
line as well if you prefer command line builds.

You can find a sample Maven settings.xml file in the conf directory. It must have a
jboss.soa.path variable that points to the jboss-as directory of your SOA-P installation.

Tusk uses git for version control. You should install and configure the git command line
client. Check out http://help.github.com/linux-set-up-git/. You can use yum for installing
git and not bother with synaptic. The important part is the SSh setup and your local config
settings for name and email. 

You will also need to install the egit plugin for eclipse, as follows:
* install egit (team provider) plugin from eclipse marketplace
	* Help->Eclipse Marketplace->Eclipse Marketplace
	* Change "All categories" dropdown to "SCM"
	* Click on "Browse for more solutions" link
	* Search on "egit"
	* Install the "EGit - Git Team Provider" plugin
* Install m2e git scm connector
	* go to file->new->other
	* choose maven->checkout maven projects from scm
	* click on link to find more scm connectors from the m2e marketplace
	* type egit in filter
	* install m2e-egit connector

To get the repository into Eclipse so you can work on it, do the following:
1. Open eclipse
2. File-Import
3. Git-Projects from Git
4. Clone button
5. Paste "git@github.com:jboss-tusk/tusk.git" into URI field
6. Choose http for protocol
7. Next, Next, Next, Finish (repo downloads)
8. Select the repository you just cloned
9. Next
10. Choose "Import as general project"
11. Finish
12. Right-click project root directory and choose "Configure->Convert to Maven Project"
* You can do steps 8-12 for the all modules except conf and bin to treat each as its 
own project (ie so you can do maven builds on just one module instead of the entire trunk).
This is not required though.

In order to push to master (if you have permissions) you should set up SSH. For the git 
setup (see above) you created a keypair and you need to tell Eclipse to use that private key:
* Preferences menu
* General->Network connections->SSH2
* Add private key you created during github setup 


--------------------------------------------------
JBoss Setup
--------------------------------------------------
* JDK
	-Download RPM installer for the the latest JDK 6 from http://www.oracle.com/technetwork/java/javase/downloads/index.html.
	-Execute the following command: "sudo chmod 755 jdk-6u26-linux-x64-rpm.bin"
	-Execute the following command: "sudo ./jdk-6u26-linux-x64-rpm.bin"

* Apache Ant
	-Download the latest ant from http://ant.apache.org/bindownload.cgi.
	-Unzip the file into /opt.
	-Create an ANT_HOME environment variable by adding the following lines to /etc/profile
		export ANT_HOME=/opt/apache-ant-1.8.2
		export PATH=$PATH:$ANT_HOME/bin
		
* EDSP 5.2 and SOA-P 5.2
	-Download SOA-P, EDSP and BRMS (manager) zip files from Red Hat customer access portal.
		-TODO get the exact names of the files to download since there are different packages
	-Unzip soa-5.2.0.GA.zip into the jboss install directory, referred to as $JBOSS_HOME
		-For example, unzip into /usr/local/jboss/ to have $JBOSS_HOME=/usr/local/jboss/jboss-soa-p-5
	-Unzip jboss-brms.war into the $JBOSS_HOME/jboss-as/server/default/deploy directory.
	-Unzip eds-5.2.0.GA.zip into $JBOSS_HOME
	-Change directory to $JBOSS_HOME/eds.
	-Run the following command: "ant"
		-Choose the "default" server profile.
		-Install Apache CXF.
	-Change directory to $JBOSS_HOME/jboss-as/server/default/conf/props.
	-Edit soa-users.properties.
		admin=admin
		user1=password
	-Edit soa-roles.properties.
		admin=JBossAdmin,HttpInvoker,user,admin
		user1=admin,JBossAdmin,READWRITE
	-Edit teiid-security-users.properties.
		user1=password
	-Edit teiid-security-roles.properties.
		admin=admin
		user1=audit,log
	-Change directory to $JBOSS_HOME/jboss-as/bin
	-Start EDSP via the following command: "./run.sh -c <profile>"


--------------------------------------------------
Tomcat Setup
--------------------------------------------------
It doesn't matter where this is installed to. Just unzip the distribution and run it on
demand. Or install it as a service - whatever you want.

The most important thing is to change the configuration to use port 8888 for HTTP. To do
this edit the conf/server.xml file and change the HTTP connector.


--------------------------------------------------
Hadoop (HDFS, HBase, MapReduce, Hive) Setup
--------------------------------------------------
* Once Hadoop is installed, the main config files for the Hadoop services are at:
	/etc/hadoop/conf
	/etc/hbase/conf
	/etc/zookeeper
	/etc/hive/conf
	
* Install the services:
	wget http://archive.cloudera.com/redhat/cdh/cdh3-repository-1.0-1.noarch.rpm
	sudo yum --nogpgcheck localinstall cdh3-repository-1.0-1.noarch.rpm
	sudo rpm --import http://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY-cloudera
	sudo yum install hadoop-0.20-conf-pseudo
	sudo yum install hadoop-hbase
	sudo yum install hadoop-hive
	sudo vim /etc/security/limits.conf
		hdfs  -       nofile  32768
		hbase  -       nofile  32768
	sudo vim /etc/alternatives/hadoop-etc/conf/hdfs-site.xml
		<property>
		  <name>dfs.datanode.max.xcievers</name>
		  <value>4096</value>
		</property>
	sudo service hadoop-0.20-namenode start
	sudo service hadoop-0.20-secondarynamenode start
	sudo service hadoop-0.20-datanode start
	sudo service hadoop-0.20-tasktracker start
	sudo service hadoop-0.20-jobtracker start
	sudo yum install hadoop-hbase-master
	
	sudo vim /etc/hbase/conf/hbase-site.xml
		<configuration>
		  <property>
		    <name>hbase.cluster.distributed</name>
		    <value>true</value>
		  </property>
		  <property>
		    <name>hbase.rootdir</name>
		    <value>hdfs://localhost/hbase</value>
		    <description>The directory shared by RegionServers.</description>
		  </property>
		  <property>
		    <name>dfs.replication</name>
		    <value>1</value>
		    <description>The replication count for HLog and HFile storage. Should not be greater than HDFS datanode count.</description>
		  </property>
		</configuration>
	sudo yum install hadoop-zookeeper-server
	sudo service hadoop-zookeeper-server start
	sudo vim /etc/zookeeper/zoo.cfg
	
	sudo service hadoop-hbase-master start
	sudo yum install hadoop-hbase-regionserver
	sudo service hadoop-hbase-regionserver start

	sudo -u hdfs hadoop fs -mkdir /tmp
	sudo -u hdfs hadoop fs -chmod g+w /tmp
	sudo -u hdfs hadoop fs -mkdir /user/hive
	sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse
	sudo -u hdfs hadoop fs -chmod g+w /user/hive/warehouse
	sudo vim /etc/hive/conf/hive-site.xml; add the following, updating paths as necessary:
		<property>
		  <name>hive.aux.jars.path</name>
		  <value>file:///usr/lib/hive/lib/hive-hbase-handler-0.7.1-cdh3u1.jar,file:///usr/lib/zookeeper/zookeeper-3.3.3-cdh3u1.jar,file:///usr/lib/hbase/hbase-0.90.3-cdh3u1.jar,file:///usr/lib/hbase/lib/guava-r06.jar</value>
		</property>

* HBase DDL
Run the following commands to create the HBase structures
	$ hbase shell
	> create 'messages', 'data', 'metadata'
	> create 'message-index', 'fields'

* Hive DDL (TODO need to validate these; they are probably wrong)
Run the following commands to create the Hive structures
	$ hive
	> create external table hbase_message_index(key string, diseases string, groupId string, patientId string, planId string, state string)
		stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
		with SERDEPROPERTIES ("hbase.columns.mapping" = ":key,fields:diseases,fields:groupId,fields:patientId,fields:planId,fields:state")        
		TBLPROPERTIES("hbase.table.name" = "message-index");

* Notes and Trousbleshooting
-Sometimes the hadoop-hbase-master and/or hadoop-hbase-regionserver and/or hadoop-zookeeper-server daemons die and have 
to be started again. If hive/hbase are not working, then the first thing to do is make sure these daemons are running via:
	$ sudo service hadoop-zookeeper-server restart
	$ sudo service hadoop-hbase-master restart
	$ sudo service hadoop-hbase-regionserver restart

-If you start the hbase shell and run a command and you see the following error message:
	FATAL zookeeper.ZKConfig: The server in zoo.cfg cannot be set to localhost in a fully-distributed setup because it won't be reachable.
then do the following:
	In /etc/zookeeper/zoo.cfg, change the server name in server.0 from 'localhost' to the actual server name.

-If you see errors about zookeeper connection limits, do the following:
	In /etc/hbase/conf/hbase-site.xml, add the following property:
		<property>
		  <name>hbase.zookeeper.property.maxClientCnxns</name>
		  <value>200</value>
		  <final>true</final>
		</property>
	In /etc/zookeeper/zoo.cfg, add the following property:
		maxClientCnxns=200 
	Restart the following daemons: hadoop-hbase-master, hadoop-hbase-regionserver, hadoop-zookeeper-server


--------------------------------------------------
Cassandra Setup
--------------------------------------------------
You can install Cassandra anywhere and run it on-demand or as a service.

If running Tusk against cassandra, you must start the cassandra cluster before running the application
Do this on a single node by running the "cassandra" command in the cassandra/bin directory.
The keyspaces for the infinispan cachestore will be created automatically if they are not there
so no need to create them manually. You must create the keyspace for the main message data store,
which is "TuskData" by default (see the KeyspaceDefinition lines of code in  BigDataExtractor.java 
in the esb-integration project). The ispn-integration/src/main/resources/cassandra-schema.txt
file contains the commands to create the Cassandra schema.

The current implementation requires a running cassandra server. There is an embedded cassandra
server that can be used for testing. Check out the cassandra cachestore module in the
Infinispan codebase for the code. If we use it we'd have to pre-create the keyspace and column
family automatically each time the embedded server is started. The infinispan cassandra cachestore
automatically creates a keyspace so we can use that as a guide.


--------------------------------------------------
Random Notes
--------------------------------------------------
* As of 10/8/2011, there is an exception when the service handles the message.
This is because Infinispan requires a newer version of JBoss Logging (3.0) than
what comes with EDSP 5.1. It shouldn't be too hard to fix the deployment so that
the ESB uses the deployed version of jboss logging (e.g. jboss-logging-3.0.0.GA.jar)
instead of the one packaged in EDSP, which is version 2.1 I think.