/systemds-remac

ReMac

Primary LanguageJavaApache License 2.0Apache-2.0

This repository contains the source code implementation of the paper "Redundancy Elimination in Distributed Matrix Computation", which appeared at SIGMOD 2022.

ReMac

ReMac is a distributed matrix computation library developed based on SystemDS 2.0.0, which automatically and adaptively eliminates redundancy in execution plans to improve performance.

Installation

  • Prerequisite: Java 8

  • Download, setup and start Hadoop 3.2.2 on your cluster. In particuar, what we need is HDFS, so you do not have to configure and deploy YARN. (config/hadoop/ lists the configuration used in the paper.)

  • Download, setup and start Spark 3.0.1 on your cluster. (config/spark/ lists the configuration used in the paper.)

    For Spark configuration, in the file of spark-defaults.conf, you need to specify spark.driver.memory, spark.executor.cores and spark.executor.instances which are essential to the Optimizer of ReMac.

    In addition, we recommend adding:

    • spark.serializer org.apache.spark.serializer.KryoSerializer to employ the Kryo serialization library.
    • spark.driver.extraJavaOptions "-Xss256M" and spark.executor.extraJavaOptions "-Xss256M" in case of java.lang.StackOverflowError.
  • Download the source code of SystemDS 2.0.0 as a zip file and not git clone.

  • Replace the original src of SystemDS with the remac/src in this repository.

  • Follow the installation guide of SystemDS to build the project. The building artifact SystemDS.jar is in the target folder.

Algorithms and Datasets

The scripts implementing the algorithms used in our experiments are in the folder scripts.

The datasets used in our experiments are described in Section 6.1. For a quick start, you can use scripts/data.dml to generate random datasets that have the same metadata as mentioned in the paper.

Running ReMac

The running command of ReMac is the same as that of SystemDS.

In addition, there are options in running ReMac.

  • The default matrix sparsity estimator is MNC. To use the metadata-based estimator, you need to add -metadata in the command line.

  • ReMac uses the dynamic programming-based method for adaptive elimination in default. To use the enumeration method, you need to add -optimizer force in the command line.

    For example, the command to run the DFP algorithm on the criteo1 dataset with the metadata-based estimator and the enumeration method is

    spark-submit SystemDS.jar -metadata -optimizer force -stats -f ./scripts/dfp.dml -nvargs name=criteo1 

    (note: You need to add spark-submit to the environment variable PATH, and run this command from the target directory.)

  • In particular, ReMac employs the block-wise search for redundancy. To employ the tree-wise search, you need to use the folder remac-tree_search/src to override src and rebuild the project.