- Python 3.7+
- Apache Spark 3.0.1
- Apache Kafka(standalone)
1- Producer Python
python3 kafka_server.py
2- Consumer Python
python3 consumer_server.py
2- Consumer Spark Streaming
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 --master local[*] data_stream.py
1- How did changing values on the SparkSession property parameters affect the throughput and latency of the data?
It either increased or decreased processedRowsPerSecond.
2- What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?
The properties that showed variations were:
.config("maxOffsetsPerTrigger", 200) \
.config("spark.default.parallelism", 50) \
.config("spark.sql.shuffle.partitions", 100) \