SparkSQL

In this notebook, we will learn how to use the DataFrame API and SparkSQL to perform simple data analytics tasks.

Goals

The main goals of this notebook are the following:

  1. Understand the advantages and disadvantages of using DataFrame over RDD
  2. Analyze the airline data with the DataFrame API and SparkSQL

Steps

  • First, in section 1, we will go through a short introduction about the DataFrame API with a small example to see how can we use it and how it compares to the low-level RDD abstraction.
  • In section 2, we delve into the details of the use case of this notebook including: providing the context, introducing the data
  • In section 3, we perform data exploration and analysis