You will need to have the following dependencies:
- Python3
- Kafka
- Spark Notebook version 0.7.0
- Scikit-learn for python
- Kafka-python for python
- Scala 2.11
The project contains several files and folder:
- start-kafka.sh: bash script launching kafka and zookeeper on the local host machine, on port.
- python-processing/fetcher.py: python script writing data to kafka from themoviedb api or from a single file.
- python-processing/analysis.py: analyzes json from themoviedb API taking its input from stdin and writing to stdout the result.
- spark-consumer: fetches data from the topic written by the fetcher.py and script, and calls analysis.py on each Spark Node.
- hdfs-writer: gets processed movie from spark-consumer and writes output to a given folder. This process is used to make the data persistent.
Run the custom Kafka script on the machine hosting Kafka and Zookeeper:
$ ./start-kafka.sh -i <path_to_kafka_folder>
This script will launch the kafka instance, the zookeeper instance, and create the two following topics (if they do not exist):
- movie-topic: contains raw movie from themoviedb API.
- movie-analyzed: contains the result of the movie on which sentiment analysis has been applied.
Start the analyzer process:
cd spark-consumer
sbt "run -b localhost:9092 -gid test -c movie-topic -p movie-analyzed"
Start the fetcher fetchning permanently from the API:
cd python-processing
./fetcher.py
Start the fetcher from a file:
cd python-processing
./fetcher.py -i <json_database>
You can launch the notebook on another machine or on the server. In order to do so, you have to go to the notebook directory and execute:
./bin/spark-notebook
Do not forget to modify the broker value with either localhost if the notebook is on the server, or with the server ip adress.