pyspark-s3-parquet-example

This repository demonstrates some of the mechanics necessary to load a sample Parquet formatted file from an AWS S3 Bucket. A python job will then be submitted to a local Apache Spark instance which will run a SQLContext to create a temporary table and load the Parquet file contents into a DataFrame. SQL queries will then be possible against the in-memory temporary table. SparkSQL has a lot to explore and this repo will serve as cool place to check things out.

The sample Parquet file was pulled from the following repository. Thanks a bunch!

Running the Examples

AWS EMR using A Zeppelin Notebook

The following script can be copied and pasted inside a Zeppelin notebook running in AWS EMR.

Prerequisites

AWS Account created
AWS ACCESS key and SECRET ACCESS KEY stored in ~/.aws/credentials file
EMR Cluster Configured with Spark 1.6.1 and Apache Zeppelin
Copy the parquet file to a s3 bucket in your AWS account.

Steps

Configure the Spark Interpreter in Zeppelin.
Copy the script into a new Zeppelin Notebook.
Run the script with the "arrow button".
Profit and play around with PySpark in the safety of the Zeppelin notebook.

Sample Output

The following output is from one of my sample runs ...

Run against local instance of Spark

The following-script as been configured to run against a local instance of spark. The location of the Parquet file is also being served locally rather than from s3. Other than that the script is doing the same as the AWS script and the output will be the same.

Prerequisites

Apache Spark 1.6.1 installed locally.
Python 2.7.11

Steps

From root cd into pyspark-scripts
Run the following python nations-parquet-sql-local.py
Once again profit and play around.

redapt/pyspark-s3-parquet-example

pyspark-s3-parquet-example

Running the Examples

AWS EMR using A Zeppelin Notebook

Prerequisites

Steps

Sample Output

Run against local instance of Spark

Prerequisites

Steps