
My implementation of project 2 in the udacity Data Streaming Nanodegree

Primary LanguagePython

Udacity Data Streaming Project: SF Crime Statistics with Spark Streaming

How did changing values on the SparkSession property parameters affect the throughput and latency of the data?

I use maxOffsetsPerTrigger to impact the throughput and latency of the data.

The higher this value is, the more rows are processed per seconds.

I monitor the performance using the progressReport parameters inputRowsPerSecond, processedRowsPerSecond and durationMs, specifically the value of triggerExecution

For example,

For a maxOffsetsPerTrigger value of 200, i get: "inputRowsPerSecond" : 11.667250029168125, "processedRowsPerSecond" : 24.786218862312552, "durationMs" : { "addBatch" : 7964, "getBatch" : 6, "getOffset" : 5, "queryPlanning" : 42, "triggerExecution" : 8069, "walCommit" : 48 }

For a maxOffsetsPerTrigger value of 1000, i get: "inputRowsPerSecond" : 57.23115664167573, "processedRowsPerSecond" : 128.0737704918033, "durationMs" : { "addBatch" : 7689, "getBatch" : 8, "getOffset" : 10, "queryPlanning" : 41, "triggerExecution" : 7808, "walCommit" : 52 },

In the above comparison, both batches take about similar time to execute (triggerExecution ~ 8s) but one had 200 rows and the other had 1000

What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

I found that a small value for spark.sql.shuffle.partitions (such as 25) dramatically increases the value of processedRowsPerSecond to 126.

By setting spark.default.parallelism to 100 i'm able to push the value of processedRowsPerSecond up to 131