This repository demonstrates some of the mechanics necessary to load a sample Parquet formatted file from an AWS S3 Bucket. A python job will then be submitted to a local Apache Spark instance which will run a SQLContext to create a temporary table and load the Parquet file contents into a DataFrame. SQL queries will then be possible against the in-memory temporary table. SparkSQL has a lot to explore and this repo will serve as cool place to check things out.
The sample Parquet file was pulled from the following repository. Thanks a bunch!
The following script can be copied and pasted inside a Zeppelin notebook running in AWS EMR.
- AWS Account created
- AWS ACCESS key and SECRET ACCESS KEY stored in ~/.aws/credentials file
- EMR Cluster Configured with Spark 1.6.1 and Apache Zeppelin
- Copy the parquet file to a s3 bucket in your AWS account.
- Configure the Spark Interpreter in Zeppelin.
- Copy the script into a new Zeppelin Notebook.
- Run the script with the "arrow button".
- Profit and play around with PySpark in the safety of the Zeppelin notebook.
The following output is from one of my sample runs ...
The following-script as been configured to run against a local instance of spark. The location of the Parquet file is also being served locally rather than from s3. Other than that the script is doing the same as the AWS script and the output will be the same.
- Apache Spark 1.6.1 installed locally.
- Python 2.7.11
- From root cd into pyspark-scripts
- Run the following python nations-parquet-sql-local.py
- Once again profit and play around.