/MISON

Implementing MISON by Microsoft in C++ as a test

Primary LanguageScala

MISON Parser

  • A Sequential MISON Parser for proof of concept. MISON is a query-based parser that parses only certain columns to improve performance. Implementation is based on the research paper Mison: A Fast JSON Parser for Data Analytics.

Features

  • Extracts JSON data based on specified columns
  • Contains a file (SparkContext.scala) that integrates MISON into Spark (version 2.1.1)
  • Converts JSON data to Spark's Data Frame (enabling testing with Spark's method of extracting JSON data)

Getting Started

  • Go to Spark's website and download Spark source code with version 2.1.1 (May 02 2017) as this is the version we worked on.
  • After extracting the spark-2.1.1.tgz, install Java, Python, and Scala so that Spark's terminal interface can be run.
  • In the Spark folder (spark-2.1.1), type the following into the terminal. Note: There is more detailed information at Spark's website.
$ build/sbt clean package

Testing

  • Note: The following files mentioned are in the Spark folder.
  • Move fileHandler.scala, Bitmaps.scala, Bits.scala, Parser.scala, SparkContext.scala to spark-2.1.1/core/src/main/scala/org/apache/spark
    • fileHandler.scala: loads the text files (formatted in JSON) into readable String
    • Bitmaps.scala: converts String into bitmaps
    • Bits.scala: represents bits and supports bit operations
    • Parser.scala: parses through bitmaps and extracts necessary information to create a data frame
    • SparkContext.scala: calls loading, bitmap conversion, parse functions and creates a data frame
  • Create a folder in spark-2.1.1/bin and put MData.txt, MQuery.txt, Mtweet.txt. As you can see in testCorrectness.scala and MData.txt, we named the folder test.
    • MData.txt: contains path for data files to support multiple file paths
    • MQuery.txt: list of columns to extract
    • Mtweet.txt: JSON data (fileHandler only supports text files)
    • tweet_10.json: JSON data for Spark as Spark supports JSON files
    • testCorrectness.scala: compares MISON's data frame and Spark's data frame
  • In the terminal, type the following in the directory spark-2.1.1. Note: There is more detailed information at Spark's website.
$ export SPARK_PREPEND_CLASSES=true
$ build/sbt compile
  • In the terminal, type ./spark-shell in directory spark-2.1.1/bin to start running Spark.
  • When the Spark interface appears, type the following to run the test.
:load test/testCorrectness.scala

Known Bugs/Potential Improvements

  • Spark takes account of a case where the column doesn't exist (creates a record and treats the missing columns as null) while this parser doesn't (doesn't create a record).
  • Spark has its own JSON parsing system while this parser uses its own, which may have some issues (ex: it doesn't detect unicode characters, so a string manipulation function fixString was implemented in SparkContext). Using built-in Spark classes/functions may boost performance.
  • The parser's performance can be dramatically improved with parallel computing and speculative parsing.

Authors