
Time series prediction with Spark ML and MMLSpark

Time series prediction (scala)

Predict the sales of different items for a time period of 28 days. The data to which the model is applied is from the M5 Forecasting Kaggle competition.


Setup Jupyter environment (Jupyter Docker Stacks image)

  • For model prototyping we use a Jupyter notebook from Jupyter Docker Stacks. The Dockerfile is build on top of this image and just enables the Jupyter notebook extensions.

    # build the image
    docker build -t ai-all-spark:0.1 .
    # start a container (after setting up data_dir)
    docker run -it --name sharky \
        -p 8889:8888 \
        -v "$(pwd)":/home/jovyan/work \
        -v "$DATA_DIR":/home/jovyan/work/data \
  • Import .jar files in a spark session in a Jupyter notebook.
    When an Apache Toree - Scala notebook is opened a spark session is automatically initialized. In order to import a jar-file you have to modify the kernel startup script:

    # Access the container
    docker exec -it sharky /bin/bash
    # Show the folders where the essential information about the Jupyter kernels is stored
    jupyter kernelspec list
    # Go the directory where the apache_toree_scala is located
    cd /opt/conda/share/jupyter/kernels/apache_toree_scala
    # from "argv" you should be able to locate the kernel startup script
    # /opt/conda/share/jupyter/kernels/apache_toree_scala/bin/run.sh  
    cat kernel.json

    Now, you can modify the shell script by adding the jars of interest. In particular, we modify

    eval exec \
         "${SPARK_HOME}/bin/spark-submit" \
         --name "'Apache Toree'" \
         "${SPARK_OPTS}" \
         --class org.apache.toree.Main \
         "${TOREE_ASSEMBLY}" \
         "${TOREE_OPTS}" \


    eval exec \
         "${SPARK_HOME}/bin/spark-submit" \
         --name "'Apache Toree'" \
         "${SPARK_OPTS}" \
         --class org.apache.toree.Main \
         --jars < location of the jar file > \
         "${TOREE_ASSEMBLY}" \
         "${TOREE_OPTS}" \

    Exit from the container with ^Q ^P (not sure).

    You can check if the new jars are present in the spark session by opening a Jupyter notebook and executing spark.sparkContext.listJars.foreach(println)

Setup Jupyter environment (Microsoft MMLSpark image)

  • Just follow the instructions from the official Github repository
    docker run -it --rm \
       -p \
       -v "$(pwd)":/notebooks/myfiles \
       -v "$DATA_DIR":/notebooks/data \


  • Create .jar
    sbt assembly
    At the moment, we have assumed that the mmlspark library is not available in the runtime environment which makes the jar file huge (400-500 MB).
  • Test
    At the moment, only several functions are tested.
    sbt test