Code to allow running BIDMach on Spark including HDFS integration and lightweight sparse model updates (Kylix).
This repo depends on BIDMat, and also on lz4 and hadoop. Assuming you have hadoop installed and working, and that you've built a working BIDMat jar, copy these files into the lib directory of this repo. i.e.
cp BIDMat/BIDMat.jar BIDMach_Spark/lib cp BIDMat/lib/lz4-*.*.jar BIDMach_Spark/lib
you'll also need the hadoop common library from your hadoop installation:
cp $HADOOP_HOME/share/hadoop/common/hadoop-common-*.*.jar BIDMach_Spark/lib
and then
cd BIDMach_Spark ./sbt package
will build BIDMatHDFS.jar
. Copy this back to the BIDMat lib directory:
cp BIDMatHDFS.jar ../BIDMat/lib
Make sure $HADOOP_HOME is set to the hadoop home directory (usually /use/local/hadoop), and make sure hdfs is running:
$HADOOP_HOME/sbin/start-dfs.sh
Then you should have HDFS access with BIDMat by invoking
BIDMat/bidmath
saveFMat("hdfs://localhost:9000/filename.fmat")or
saveFMat("hdfs://filename.fmat")The hadoop quickstart guides dont mention this but you need to set the hdfs config to point to a persistent set of directories to hold the HDFS data. Here's a typical hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>/data/hdfs/name</value> </property> <property> <name>dfs.data.dir</name> <value>/data/hdfs/data</value> </property> </configuration>