databricks/Spark-The-Definitive-Guide

Code samples in repo

mrandrewandrade opened this issue · 11 comments

Going through the guide and it's a bit tedious to type out all of the code examples. Copy pasting is a bit tedious as well due to line endings and “” instead of ""

Would you accept PR which includes the code examples from the book ? Many textbooks have the code examples available on their website.

Hey! Thanks for the thought. As we just finalized the entire book, I think we are a bit premature in getting all of the code samples in here. I certainly empathize with the challenge that you're facing :).

Can you give me another week or so to figure out the best approach and then I can follow up?

Thanks,
Bill

I also added the code to a gist here if it's helpful!

hey @mrandrewandrade, the book has finally gone to print! Sorry for the delay on this. However, now that it's gone to print - we're ready to stat adding the code samples!

What I'm going to do is try to pull all of them out of the book now in order to post them here. I don't expect it to take too long, but feel free to let me know if you have any thoughts!

Will be this code samples used in the book or it's located in another repository?

They will be in this repo.

i was going through your free preview copy and run into some issue with python code to select top 5. I have to make following change to get the top 5 rows, otherwise, the 'ORDER BY' won't work:
old:
purchaseByCustomerPerHour = streamingDataFrame
.selectExpr(
'CustomerId',
'(UnitPrice * Quantity) as total_cost',
'InvoiceDate')
.groupBy(
col('CustomerId'),
window(col('InvoiceDate'), '1 day'))
.sum('total_cost')

new:
purchaseByCustomerPerHour = streamingDataFrame
.selectExpr(
'CustomerId',
'(UnitPrice * Quantity) as total_cost',
'InvoiceDate')
.groupBy(
col('CustomerId'),
window(col('InvoiceDate'), '1 day'))
.sum('total_cost')
.withColumnRenamed('sum(total_cost)', 'sum_total_cost')

old:
spark.sql("""
SELECT *
FROM customer_purchases
ORDER BY 'sum(total_cost)' DESC
""").show(5)

new:
spark.sql("""
SELECT *
FROM customer_purchases
ORDER BY sum_total_cost DESC
""").show(5)

I think those version are likely old. The book has changed quite a bit since then.

@anabranch I see, thanks.

Code is posted! It's under the code folder. Let me know if you see any issues!