/BIDMach_Spark

Code to allow running BIDMach on Spark including HDFS integration and lightweight sparse model updates (Kylix).

Primary LanguagePython

BIDMach_Spark

Code to allow running BIDMach on Spark including HDFS integration and lightweight sparse model updates (Kylix).

Dependencies

This repo depends on BIDMat, and also on lz4 and hadoop. Assuming you have hadoop installed and working, and that you've built a working BIDMat jar, copy these files into the lib directory of this repo. i.e.

cp BIDMat/BIDMat.jar BIDMach_Spark/lib
cp BIDMat/lib/lz4-*.*.jar BIDMach_Spark/lib

you'll also need the hadoop common library from your hadoop installation:

cp $HADOOP_HOME/share/hadoop/common/hadoop-common-*.*.jar BIDMach_Spark/lib

and then

cd BIDMach_Spark
./sbt package

will build BIDMatHDFS.jar. Copy this back to the BIDMat lib directory:

cp BIDMatHDFS.jar ../BIDMat/lib

Make sure $HADOOP_HOME is set to the hadoop home directory (usually /use/local/hadoop), and make sure hdfs is running:

$HADOOP_HOME/sbin/start-dfs.sh

Then you should have HDFS access with BIDMat by invoking

BIDMat/bidmath
saveFMat("hdfs://localhost:9000/filename.fmat")
or
saveFMat("hdfs://filename.fmat")

Hadoop Config

The hadoop quickstart guides dont mention this but you need to set the hdfs config to point to a persistent set of directories to hold the HDFS data. Here's a typical hdfs-site.xml:
 
<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
     <property>
         <name>dfs.name.dir</name>
         <value>/data/hdfs/name</value>
     </property>
     <property>
         <name>dfs.data.dir</name>
         <value>/data/hdfs/data</value>
     </property>
</configuration>