databricks/Spark-The-Definitive-Guide

samples chapter 5?

Closed this issue · 2 comments

In chapter 5 the following dataframe is used:

val df = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("dbfs:/mnt/defg/streaming/*.csv")
.coalesce(5)

I cannot find a /streaming subdirectory in the /data subdirectory

When using the retail-data/all/online-retail-dataset.csv I get problems getting the InvoiceDate recognized as a Date.

To get the window functions, rollup, cube & pivot examples to work I am using the retail-data/all/online-retail-dataset.csv file with this code to be able to convert the InvoiceDate:

import org.apache.spark.sql.functions.{col}

// Converting the "bad" date
// getDate("7/28/2011 8:21") returns "28-07-2011
def getDate(s:String): String = {
val DateRegex = """([0-9]{1,2})/([0-9]{1,2})/([0-9]{4}) ([0-9]{1,2}):([0-9]{1,2})""".r
def addBlank(s: String): String = {
if (s.length == 1) "0" + s else s
}
s match {
case DateRegex(month, day, year, h, m) => year + "-" + addBlank(month) + "-" + addBlank(day)
}
}

val getDateUdf = udf(getDate(_:String):String)

val dfWithDate = df.withColumn("date", to_date(getDateUdf(col("InvoiceDate"))))

dfWithDate.printSchema

dfWithDate.show(5)

dfWithDate.createOrReplaceTempView("dfWithDate")

I think this was an older version of the code and that we resolved this in newer versions of the book. I'll close this out for now, let me know if you're still having troubles