/sgx-spark

SGX-Spark

Primary LanguageJavaApache License 2.0Apache-2.0

Sgx-Spark

This is Apache Spark with modifications to run security sensitive code inside Intel SGX enclaves. The implementation leverages sgx-lkl, a library OS that allows to run Java-based applications inside SGX enclaves.

Docker quick start

This guide shows how to run Sgx-Spark in a few simple steps using Docker. Most parts of the setup and deployment are wrapped within Docker containers. Compliation and deployment should thus be smooth.

Preparing the Sgx-Spark Docker environment

  • Clone this Sgx-Spark repository

  • Build the Sgx-Spark base image. The name of the resulting Docker image is sgxpsark. This process might take a while (30-60 mins):

      sgx-spark/dockerfiles$ docker build -t sgxspark .
    
  • Prepare the disk image that will be required by sgx-lkl. Due to restrictions of Docker, this step can currently not be implemented as part of the above Docker build process. Thus, this step is platform-dependent. The process has been successfully tested on Ubuntu 16.04 and Arch Linux:

      sgx-spark/lkl$ make prepare-image
    
  • Create a Docker network device that will be used for communication by the Docker containers. Note that by creating a user-defined network, Docker will create an embedded DNS server so that workers can find the Spark master by name.

      sgx-spark$ docker network create sgxsparknet
    

Running Sgx-Spark jobs using Docker

From within directory sgx-spark/dockerfiles, run the Sgx-Spark master node, the Sgx-Spark worker node, as well as the actual Sgx-Spark job as follows.

  • Run the Sgx-Spark master node:

      sgx-spark/dockerfiles$ docker run \
      --user user \
      --env-file $(pwd)/docker-env \
      --net sgxsparknet \
      --name sgxspark-docker-master \
      -p 7077:7077 \
      -p 8082:8082 \
      -ti sgxspark /sgx-spark/master.sh
    
  • Run the Sgx-Spark worker node:

      sgx-spark/dockerfiles$ docker run \
      --user user \
      --memory="4g" \
      --shm-size="8g" \
      --env-file $(pwd)/docker-env \
      --net sgxsparknet \
      --privileged \
      -v $(pwd)/../lkl:/spark-image:ro \
      -ti sgxspark /sgx-spark/worker-and-enclave.sh
    
  • Run the Sgx-Spark job as follows.

    As of writing, the three jobs below are known to be fully supported:

    • WordCount

        sgx-spark/dockerfiles$ docker run \
        --user user \
        --memory="4g" \
        --shm-size="8g" \
        --env-file $(pwd)/docker-env \
        --net sgxsparknet \
        --privileged \
        -v $(pwd)/../lkl:/spark-image:ro \
        -e SPARK_JOB_CLASS=org.apache.spark.examples.MyWordCount \
        -e SPARK_JOB_NAME=WordCount \
        -e SPARK_JOB_ARG0=README.md \
        -e SPARK_JOB_ARG1=output \
        -ti sgxspark /sgx-spark/driver-and-enclave.sh
      
    • KMeans

        sgx-spark/dockerfiles$ docker run \
        --user user \
        --memory="4g" \
        --shm-size="8g" \
        --env-file $(pwd)/docker-env \
        --net sgxsparknet \
        --privileged \
        -v $(pwd)/../lkl:/spark-image:ro \
        -e SPARK_JOB_CLASS=org.apache.spark.examples.mllib.KMeansExample \
        -e SPARK_JOB_NAME=KMeans \
        -e SPARK_JOB_ARG0=data/mllib/kmeans_data.txt \
        -ti sgxspark /sgx-spark/driver-and-enclave.sh
      
    • LineCount

        sgx-spark/dockerfiles$ docker run \
        --user user \
        --memory="4g" \
        --shm-size="8g" \
        --env-file $(pwd)/docker-env \
        --net sgxsparknet \
        --privileged \
        -v $(pwd)/../lkl:/spark-image:ro \
        -e SPARK_JOB_CLASS=org.apache.spark.examples.LineCount \
        -e SPARK_JOB_NAME=LineCount \
        -e SPARK_JOB_ARG0=SgxREADME.md \
        -ti sgxspark /sgx-spark/driver-and-enclave.sh
      

Native compilation, installation and deployment

To run Sgx-Spark natively, proceed as detailed in the following.

Install package dependencies

Install all required dependencies. For Ubuntu 16.04, these can be installed as follows:

$ sudo apt-get update
$ sudo apt-get install -y --no-install-recommends scala libtool autoconf curl xutils-dev git build-essential maven openjdk-8-jdk ssh bc python autogen wget autotools-dev sudo automake

Compile and install Google Protocol Buffer 2.5.0

Hadoop, and thus Spark, depends on Google Protocol Buffers (GPB) in version 2.5.0:

  • Make sure to uninstall any other versions of GPB

  • Install GPB v2.5.0. Instructions for Ubuntu 16.04 are as follows:

      $ cd /tmp
      /tmp$ wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
      /tmp$ tar xvf protobuf-2.5.0.tar.gz
      /tmp$ cd protobuf-2.5.0
      /tmp/protobuf-2.5.0$ ./autogen.sh && ./configure && make && sudo make install
      /tmp/protobuf-2.5.0$ sudo apt-get install -y --no-install-recommends libprotoc-dev
    

    Instructions for Arch Linux are available at https://stackoverflow.com/a/29799354/2273470.

Compile sgx-lkl

As Sgx-Spark uses sgx-lkl, the latter must have been downloaded and compiled successfully. As of writing (June 14, 2018), sgx-lkl should be compiled using branch cleanup-musl. Please follow the documentation of sgx-lkl and ensure that your installation of sgx-lkl executes simple Java applications successfully.

Compile Sgx-Spark

sgx-spark$ build/mvn -DskipTests package
  • As part of this compilation process, a modified Hadoop library has been compiled. Copy the Hadoop JAR file into the Sgx-Spark jars directory:

      sgx-spark$ cp hadoop-2.6.5-src/hadoop-common-project/hadoop-common/target/hadoop-common-2.6.5.jar assembly/target/scala-2.11/jars/
    
  • Sgx-Spark ships with a native C library (libringbuff.so) that enables shared-memory-based communication between two JVMs. Compile as follows:

      sgx-spark/C$ make install
    

Prepare the Sgx-Spark disk images that will be run using sgx-lkl

  • Adjust file spark-sgx/lkl/Makefile for your environment:

    Variable SGX_LKL must point to your sgx-lkl directory (see Prerequisites).

  • Build the Sgx-Spark disk image required for sgx-lkl:

      sgx-spark/lkl$ make clean all
    

Run Sgx-Spark using sgx-lkl

Finally, we are ready to run (i) the Sgx-Spark master node, (ii) the Sgx-Spark worker node, (iii) the worker's enclave, (iv) the Sgx-Spark client, and (v) the client's enclave. In the following commands, replace: <hostname> with the master node's actual hostname; <sgx-lkl> with the path to your sgx-lkl installation.

Note: After running each example, make sure to (i) restart all processes, (ii) delete all shared memory files in /dev/shm.

  • If you run all the nodes locally, you need to add the following line to variables.sh:

      export SPARK_LOCAL_IP=127.0.0.1
    
  • Run the Master node

      sgx-spark$ ./master.sh
    
  • Run the Worker node

      sgx-spark$ ./worker.sh
    
  • Run the enclave for the Worker node

      sgx-spark$ ./worker-enclave.sh
    
  • Run the enclave for the driver program. This is the process that will output the job results!

      sgx-spark$ ./driver-enclave.sh
    
  • Finally, submit a Spark job. The result will be output in the process we started just before.

    • WordCount

        sgx-spark$ ./submitwordcount.sh
      
    • KMeans

        sgx-spark$ ./submitkmeans.sh
      
    • LineCount

        sgx-spark$ ./submitlinecount.sh
      

Native execution of the same Spark installation

In order to run the above installation without SGX, start your environment as follows:

  • Start the Master node as above

  • Start the Worker node as above, but change environment variable SGX_ENABLED=true to SGX_ENABLED=false

  • Do not start the enclaves

  • Submit the Spark job as above, but change evironment variable SGX_ENABLED=true to SGX_ENABLED=false

Important developer notes

Code changes and recompilation

There are a few important things to keep in mind when developing Sgx-Spark:

  • Whenever you change parts of the code, obviously, you must recompile the Spark code

      sgx-spark$ mvn package -DskipTests
    

    There have been (not clearly definable) situations in which the above command did not compile all of the changed files. In this case, issue:

      sgx-spark$ mvn clean package -DskipTests
    
  • After making changes to the Sgx-Spark code and after compiling the Java/Scala code (see above), you always need to rebuild the lkl image that will be used by sgx-lkl:

      sgx-spark/lkl$ make clean all
    
  • If you changed parts of the Hadoop code (in directory hadoop-2.6.5-src), you will also need to copy the resulting *jar file:

      sgx-spark$ cp hadoop-2.6.5-src/hadoop-common-project/hadoop-common/target/hadoop-common-2.6.5.jar assembly/target/scala-2.11/jars/
    
  • Lastly, do not forget to remove all related shared memory files in /dev/shm/ before running your next experiment!

Running without sgx-lkl

Development with sgx-lkl can be tedious. For development purposes, a special flag allows to run the enclave-side of Sgx-Spark in a regular JVM rather than on top of sgx-lkl. To make use of this feature, run the enclave JVMs using scripts worker-enclave-nosgx.sh and driver-enclave-nosgx.sh.

Under the hood, these scripts set environment variable DEBUG_IS_ENCLAVE_REAL=false (defaults to true) and provide the JVM with a value for environment variable SGXLKL_SHMEM_FILE. Note that the value of SGXLKL_SHMEM_FILE must be the same as the one provided for the corresponding Worker (worker.sh) and Driver (driver.sh).