Spark Movie Analyzer


You will need to have the following dependencies:

  • Python3
  • Kafka
  • Spark Notebook version 0.7.0
  • Scikit-learn for python
  • Kafka-python for python
  • Scala 2.11


The project contains several files and folder:

  • bash script launching kafka and zookeeper on the local host machine, on port.
  • python-processing/ python script writing data to kafka from themoviedb api or from a single file.
  • python-processing/ analyzes json from themoviedb API taking its input from stdin and writing to stdout the result.
  • spark-consumer: fetches data from the topic written by the and script, and calls on each Spark Node.
  • hdfs-writer: gets processed movie from spark-consumer and writes output to a given folder. This process is used to make the data persistent.



Run the custom Kafka script on the machine hosting Kafka and Zookeeper:

$ ./ -i <path_to_kafka_folder>

This script will launch the kafka instance, the zookeeper instance, and create the two following topics (if they do not exist):

  • movie-topic: contains raw movie from themoviedb API.
  • movie-analyzed: contains the result of the movie on which sentiment analysis has been applied.

Start the analyzer process:

cd spark-consumer
sbt "run -b localhost:9092 -gid test -c movie-topic -p movie-analyzed"

Start the fetcher fetchning permanently from the API:

cd python-processing

Start the fetcher from a file:

cd python-processing
./ -i <json_database>


You can launch the notebook on another machine or on the server. In order to do so, you have to go to the notebook directory and execute:


Do not forget to modify the broker value with either localhost if the notebook is on the server, or with the server ip adress.