Introduction to PySpark

Presentation for PyData Berlin September meetup

Agenda

Why Spark?
Why PySpark?
Why Not [Py]Spark?
Getting Started
Core Concepts
ETL Example
Machine Learning Example
Unit Testing
Performance
Gotchas
[Py]Spark Alternatives
References

Why Spark?

Large data sets
Cost of scaling up >> cost of scaling out
Batch processing, stream processing, graph processing, SQL and machine learning
In memory (sometimes)
Programming model
Generic framework

Why PySpark?

ScalaData?
Existing platform
Team - existing and future

Why Not [Py]Spark?

Performance
Complexity
Troubleshooting
Necessary?
Small community?

Getting Started

Local
- OSX: Homebrew
- Windows
- Linux
Cloud
- Databricks
- AWS
- ZeppelinHub, Microsoft, Google, etc.

Core concepts

Driver / Workers
RDDs
- Immutable collection
- Resilient
- Distributed / partitioned and can control partitioning
- In-memory (at times)
Loading data
- Files on local filesystem, HDFS, S3, RedShift, Hive, etc.
- CSV, JSON, Parquet, etc.
Transforms
- map / reduce
- filter
- aggregate
- joins
Actions
- writeTextFile
- count
- take / first
- collect
DataFrames
- Higher-level concept
- Based on RDD
- Structured - like a table and with a schema (which can be inferred)
- Faster
- Easier to work with
- API or SQL

ETL Example

Databricks notebook

ML Example

Unit Testing

findspark
- export SPARK_HOME="..."
spark-testing-base
- class SparkTestingBase: TestCase
- class SparkTestingBaseReuse: TestCase
export PYSPARK_SUBMIT_ARGS=“… pyspark-shell"
export SPARK_MASTER=“yarn-client"

Performance

Cache / Persist
- Ronert Obst, Dat Tran - PySpark in Practice
Double serialization cost
Cython and/or compiled libraries
Potential to call Scala/Java code?
- Holden Karau - Improving PySpark Performance: Spark performance beyond the JVM
- Zeppelin
- Databricks / Livy

Gotchas

Pickling of class when distributing methods (seemingly including statics)
Spurious error messages
- Examples
  - Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader
  - You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o23)
  - 16/06/14 14:46:20 INFO latency: StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 334AFFEECBCB0CC9)
  - java.lang.IllegalArgumentException: Invalid S3 URI: hostname does not appear to be a valid S3 endpoint:
- In some cases these aren't errors at all. In others they're masking the real errors - look elsewhere in the console / log
For some (e.g. running locally and disconnected) use-cases, HiveContext is less stable than SQLContext (though community generally recommends the former)
Distributing Python files to the workers
- --py-files and/or --zip-files seem not always to work as expected
- Packaging files and installing on servers (e.g. in bootstrap) seems more reliable
Select syntax quirks
- Use bitwise operators such as ~ on columns
- Other random errors can often be fixed with the addition of brackets
spark-csv
- In some cases need to set an escape character and neither None nor the empty string work. Weird unicode characters seem to work
- When seeing problems such as java.lang.NoClassDefFoundError or java.lang.NoSuchMethodError, check you're using the version built for the appropriate version of Scala (2.10 vs.2.11)
- sqlContext.read.load fails with the following error, when reading CSV files, if format='csv' is not specified (which is not required for sqlContext.load:
  
  Caused by: java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/Users/Richard/src/earnest/preprocessing/storage/local/mnt/3m-panel/card/20160120_YODLEE_CARD_PANEL.txt is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [46, 50, 48, 10]
Redshift data source's behavior can challenge expectations
- Be careful with schemas and be aware of when it's rewriting them
- For longer text fields do not allow the datasource to [re]create the table
- Pre-actions don't seem to work on some builds
- Remember to set-up a cleanup policy for the transfer directory on S3

[Py]Spark Alternatives

Scala Spark
Beam / Flink / Apex / ...
Pig etc.
AWK / sed?
Python
- Pandas?
- Threads?
- AsyncIO/Tornado/etc.
- Multiprocessing
- Parallel Python
- IPython Parallel
- Cython
- Queues / pub/sub (NPQ, Celery, SQS, etc.)
- Gearman
- PyRes
- Distarray / Blaze
- Dask

References

Dataset used in examples
- [Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology.
- In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.
- http://hdl.handle.net/1822/14838
- http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt
- Modified to add names and addresses using Faker
DataFrames
Performance
Zeppelin
- http://www.slideshare.net/DanielMadrigal20/intro-to-spark-with-zeppelin-crash-course-hadoop-summit-sj
- https://github.com/hortonworks-gallery/zeppelin-notebooks
ML
- https://spark.apache.org/docs/1.5.1/ml-features.html#stringindexer
Books
Courses by edX

Contact

richdutton on pythonberlin Slack
richdutton on github
https://de.linkedin.com/in/duttonrichard
http://earnestresearch.com

richjames0/introduction-to-pyspark

Introduction to PySpark

Agenda

Why Spark?

Why PySpark?

Why Not [Py]Spark?

Getting Started

Core concepts

ETL Example

ML Example

Unit Testing

Performance

Gotchas

[Py]Spark Alternatives

References

Contact