Data Analysis with Python and PySpark

This is the companion repository for the Data Analysis with Python and PySpark book (Manning, estimated publishing date: 2022.) It contains the source code and data download scripts, when pertinent.

Get the data

The complete data set for the book hovers at around ~1GB. Because of this, I moved the data sources to Drobpox to avoid cloning a gigantic repository. The book assumes the data is under ./data.

Mistakes or omissions

If you encounter mistakes in the book manuscript (including the printed source code), please use the Manning platform to provide feedback.