This is the code repository for PySpark Cookbook, published by Packt.
Over 60 recipes for implementing big data processing and analytics using Apache Spark and Python
Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. The PySpark Cookbook presents effective and time-saving recipes for leveraging the power of Python and putting it to use in the Spark ecosystem.
This book covers the following exciting features:
- Configure a local instance of PySpark in a virtual environment
- Install and configure Jupyter in local and multi-node environments
- Create DataFrames from JSON and a dictionary using pyspark.sql
- Explore regression and clustering models available in the ML module
- Use DataFrames to transform data used for modeling
If you feel this book is for you, get your copy today!
All of the code is organized into folders. For example, Chapter02.
The code will look like the following:
if [ "${_check_R_req}" = "true" ]; then
checkR
fi
Following is what you need for this book: The PySpark Cookbook is for you if you are a Python developer looking for hands-on recipes for using the Apache Spark 2.x ecosystem in the best possible way. A thorough understanding of Python (and some familiarity with Spark) will help you get the best out of the book.
With the following software and hardware list you can run all code files present in the book (Chapter 1-8).
Chapter | Software required | OS required |
---|---|---|
1-8 | Apache Spark, Python, Jupyter, Cloudera QuickStart VM | Linux distro (preferably Ubuntu >14.04) |
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. Click here to download it.
Denny Lee Denny Lee is a technology evangelist at Databricks. He is a hands-on data science engineer with 15+ years of experience. His key focuses are solving complex large-scale data problems—providing not only architectural direction but hands-on implementation of such systems. He has extensive experience of building greenfield teams as well as being a turnaround/change catalyst. Prior to joining Databricks, he was a senior director of data science engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight).
Tomasz Drabas Tomasz Drabas is a data scientist specializing in data mining, deep learning, machine learning, choice modeling, natural language processing, and operations research. He is the author of Learning PySpark and Practical Data Analysis Cookbook. He has a PhD from University of New South Wales, School of Aviation. His research areas are machine learning and choice modeling for airline revenue management.
Click here if you have any feedback or suggestions.
If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.
Simply click on the link to claim your free PDF.