This repository contains the source code implementation of the paper "Redundancy Elimination in Distributed Matrix Computation", which appeared at SIGMOD 2022.
ReMac is a distributed matrix computation library developed based on SystemDS 2.0.0, which automatically and adaptively eliminates redundancy in execution plans to improve performance.
-
Prerequisite: Java 8
-
Download, setup and start Hadoop 3.2.2 on your cluster. In particuar, what we need is HDFS, so you do not have to configure and deploy YARN. (
config/hadoop/
lists the configuration used in the paper.) -
Download, setup and start Spark 3.0.1 on your cluster. (
config/spark/
lists the configuration used in the paper.)For Spark configuration, in the file of
spark-defaults.conf
, you need to specifyspark.driver.memory
,spark.executor.cores
andspark.executor.instances
which are essential to the Optimizer of ReMac.In addition, we recommend adding:
spark.serializer org.apache.spark.serializer.KryoSerializer
to employ the Kryo serialization library.spark.driver.extraJavaOptions "-Xss256M"
andspark.executor.extraJavaOptions "-Xss256M"
in case of java.lang.StackOverflowError.
-
Download the source code of SystemDS 2.0.0 as a zip file and not git clone.
-
Replace the original
src
of SystemDS with theremac/src
in this repository. -
Follow the installation guide of SystemDS to build the project. The building artifact
SystemDS.jar
is in thetarget
folder.
The scripts implementing the algorithms used in our experiments are in the folder scripts
.
The datasets used in our experiments are described in Section 6.1.
For a quick start, you can use scripts/data.dml
to generate random datasets that have the same metadata as mentioned in the paper.
The running command of ReMac is the same as that of SystemDS.
In addition, there are options in running ReMac.
-
The default matrix sparsity estimator is MNC. To use the metadata-based estimator, you need to add
-metadata
in the command line. -
ReMac uses the dynamic programming-based method for adaptive elimination in default. To use the enumeration method, you need to add
-optimizer force
in the command line.For example, the command to run the DFP algorithm on the criteo1 dataset with the metadata-based estimator and the enumeration method is
spark-submit SystemDS.jar -metadata -optimizer force -stats -f ./scripts/dfp.dml -nvargs name=criteo1
(note: You need to add
spark-submit
to the environment variablePATH
, and run this command from thetarget
directory.) -
In particular, ReMac employs the block-wise search for redundancy. To employ the tree-wise search, you need to use the folder
remac-tree_search/src
to overridesrc
and rebuild the project.