In this project, you will be provided with a real-world dataset, extracted from Kaggle, on San Francisco crime incidents, and you will provide statistical analyses of the data using Apache Spark Structured Streaming. You will draw on the skills and knowledge you've learned in this course to create a Kafka server to produce data, and ingest data through Spark Structured Streaming.
- Java 1.8.x
- Scala 2.11.x
- Spark 2.4.x
- Kafka
- Python 3.6 or above
- Zookeeper:
/usr/bin/zookeeper-server-start config/zookeeper.properties
- Kafka server:
/usr/bin/kafka-server-start config/server.properties
- Insert data into topic:
python kafka_server.py
- Kafka consumer:
kafka-console-consumer --from-beginning --bootstrap-server localhost:9092
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4 --master local[*] data_stream.py
1. How did changing values on the SparkSession property parameters affect the throughput and latency of the data?
The performance evaluation metric we could refer to is processedRowsPerSecond
, higher the number, better performance we have.
2. What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?
To make the performance better, we should maximize the processedRowsPerSecond
as mentioned above. the most pairs are:
- spark.default.parallelism
- spark.sql.shuffle.partitions
to pick the parallelism, the ideal number would be 3 * CPU core
. and for partitions, asssuming we are using HDFS, typical partition size is 64MB. then the number is total data size/64MB.