databricks/Spark-The-Definitive-Guide

Chapter_3_sort_code_missing

adigiosaffatte opened this issue · 2 comments

When introducing the streaming DataFrame, the following code:

from pyspark.sql.functions import window, column, desc, col
staticDataFrame\
.selectExpr( # variante di select che accetta espressioni SQL
    "CustomerId",
    "(UnitPrice * Quantity) as total_cost",
    "InvoiceDate") \
.groupBy(
    col("CustomerId"), window(col("InvoiceDate"), "1 day")) \
.sum("total_cost") \
.show(5)

produces an output different from the one shown in the chapter, because it misses a "sorting line".

I think the correct code should be:

from pyspark.sql.functions import window, column, desc, col
staticDataFrame\
.selectExpr( # variante di select che accetta espressioni SQL
    "CustomerId",
    "(UnitPrice * Quantity) as total_cost",
    "InvoiceDate") \
.groupBy(
    col("CustomerId"), window(col("InvoiceDate"), "1 day")) \
.sum("total_cost") \
.sort(desc("sum(total_cost)")) \
.show(5)

go ahead and make a pull request to fix this please

just made the pull request