/PySpark-ETL

PySpark-ETL

Primary LanguageJupyter Notebook

Learning PySpark

Code base for the Learning PySpark book by Tomasz Drabas and Denny Lee.

Book cover

Available from Packt and Amazon.

Introduction

It is estimated that in 2013 the whole world produced around 4.4 zettabytes of data; that is, 4.4 billion terabytes! By 2020, we (as a human race) are expected to produce ten times that. With data getting larger literally by the second there is a growing appetite for making sense out of it.

In this book, we will guide you through the latest incarnation of Apache Spark using Python. We will show you how to read structured and unstructured data, how to use some fundamental data types available in PySpark, how to build machine learning models, operate on graphs, read streaming data and deploy your models in the cloud. Each chapter will tackle different problem and by the end of the book we hope you will be knowledgeable enough to solve other problems we did not have space to cover here.

Table of contents:

  1. Understanding Spark
  2. Resilient Distributed Dataset
  3. DataFrames
  4. Preparing Data for Modeling
  5. Introducing MLlib
  6. Introducing the ML Package
  7. GraphFrames
  8. TensorFrames
  9. Polyglot Persistence with Blaze
  10. Structured Streaming
  11. Packaging Spark Applications

About authors

Tomasz Drabas is a Data Scientist working for Microsoft and currently residing in Seattle area. He has over 13 years of experience in data analytics and data science in numerous elds: advanced technology, airlines, telecommunications, nance and consulting he gained while working on three continents: Europe, Australia and North America. While in Australia, Tomasz has been working on his PhD in Operations Research with focus on choice modeling and revenue management applications in airline industry.

At Microsoft, Tomasz works with big data on a daily basis solving machine learning problems such as anomaly detection, churn prediction or pattern recognition using Spark.

Tomasz has also authored the Practical Data Analysis Cookbook published by Packt Publishing in 2016; you can purchase that book on Amazon, Packt and O’Reilly.

Denny Lee is a Principal Program Manager at Microsoft for the Azure DocumentDB team – Microsoft’s blazing fast, planet-scale managed document store service. He is a hands-on distributed systems and data sciences engineer with more than 18 years of experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments.

He has extensive experience in building green eld teams as well as turnaround / change catalyst. Prior to joining the Azure DocumentDB team, Denny worked as a Technology Evangelist at Databricks; he has been working with Apache Spark since 0.5. He was also the Senior Director of Data Sciences Engineering at Concur, and was on the incubation team that built Microsoft’s Hadoop on Windows and Azure service (currently known as HDInsight). Denny also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers for the last fteen years.