MISON Parser

A Sequential MISON Parser for proof of concept. MISON is a query-based parser that parses only certain columns to improve performance. Implementation is based on the research paper Mison: A Fast JSON Parser for Data Analytics.

Features

Extracts JSON data based on specified columns
Contains a file (SparkContext.scala) that integrates MISON into Spark (version 2.1.1)
Converts JSON data to Spark's Data Frame (enabling testing with Spark's method of extracting JSON data)

Go to Spark's website and download Spark source code with version 2.1.1 (May 02 2017) as this is the version we worked on.
After extracting the spark-2.1.1.tgz, install Java, Python, and Scala so that Spark's terminal interface can be run.
In the Spark folder (spark-2.1.1), type the following into the terminal. Note: There is more detailed information at Spark's website.

$ build/sbt clean package

Note: The following files mentioned are in the Spark folder.
Move fileHandler.scala, Bitmaps.scala, Bits.scala, Parser.scala, SparkContext.scala to spark-2.1.1/core/src/main/scala/org/apache/spark
- fileHandler.scala: loads the text files (formatted in JSON) into readable String
- Bitmaps.scala: converts String into bitmaps
- Bits.scala: represents bits and supports bit operations
- Parser.scala: parses through bitmaps and extracts necessary information to create a data frame
- SparkContext.scala: calls loading, bitmap conversion, parse functions and creates a data frame
Create a folder in spark-2.1.1/bin and put MData.txt, MQuery.txt, Mtweet.txt. As you can see in testCorrectness.scala and MData.txt, we named the folder test.
- MData.txt: contains path for data files to support multiple file paths
- MQuery.txt: list of columns to extract
- Mtweet.txt: JSON data (fileHandler only supports text files)
- tweet_10.json: JSON data for Spark as Spark supports JSON files
- testCorrectness.scala: compares MISON's data frame and Spark's data frame
In the terminal, type the following in the directory spark-2.1.1. Note: There is more detailed information at Spark's website.

$ export SPARK_PREPEND_CLASSES=true
$ build/sbt compile

In the terminal, type ./spark-shell in directory spark-2.1.1/bin to start running Spark.
When the Spark interface appears, type the following to run the test.

:load test/testCorrectness.scala

Spark takes account of a case where the column doesn't exist (creates a record and treats the missing columns as null) while this parser doesn't (doesn't create a record).
Spark has its own JSON parsing system while this parser uses its own, which may have some issues (ex: it doesn't detect unicode characters, so a string manipulation function fixString was implemented in SparkContext). Using built-in Spark classes/functions may boost performance.
The parser's performance can be dramatically improved with parallel computing and speculative parsing.