/introduction-to-pyspark

Presentation for PyData Berlin meetup

Primary LanguagePython

Introduction to PySpark

Presentation for PyData Berlin September meetup

Agenda

  • Why Spark?
  • Why PySpark?
  • Why Not [Py]Spark?
  • Getting Started
  • Core Concepts
  • ETL Example
  • Machine Learning Example
  • Unit Testing
  • Performance
  • Gotchas
  • [Py]Spark Alternatives
  • References

Why Spark?

  • Large data sets
  • Cost of scaling up >> cost of scaling out
  • Batch processing, stream processing, graph processing, SQL and machine learning
  • In memory (sometimes)
  • Programming model
  • Generic framework

Why PySpark?

  • ScalaData?
  • Existing platform
  • Team - existing and future

Why Not [Py]Spark?

  • Performance
  • Complexity
  • Troubleshooting
  • Necessary?
  • Small community?

Getting Started

  • Local
  • Cloud
    • Databricks
    • AWS
    • ZeppelinHub, Microsoft, Google, etc.

Core concepts

  • Driver / Workers
  • RDDs
    • Immutable collection
    • Resilient
    • Distributed / partitioned and can control partitioning
    • In-memory (at times)
  • Loading data
    • Files on local filesystem, HDFS, S3, RedShift, Hive, etc.
    • CSV, JSON, Parquet, etc.
  • Transforms
    • map / reduce
    • filter
    • aggregate
    • joins
  • Actions
    • writeTextFile
    • count
    • take / first
    • collect
  • DataFrames
    • Higher-level concept
    • Based on RDD
    • Structured - like a table and with a schema (which can be inferred)
    • Faster
    • Easier to work with
    • API or SQL

ETL Example

Databricks notebook

ML Example

Unit Testing

  • findspark
    • export SPARK_HOME="..."
  • spark-testing-base
    • class SparkTestingBase: TestCase
    • class SparkTestingBaseReuse: TestCase
  • export PYSPARK_SUBMIT_ARGS=“… pyspark-shell"
  • export SPARK_MASTER=“yarn-client"

Performance

Gotchas

  • Pickling of class when distributing methods (seemingly including statics)
  • Spurious error messages
    • Examples
      • Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader
      • You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o23)
      • 16/06/14 14:46:20 INFO latency: StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 334AFFEECBCB0CC9)
      • java.lang.IllegalArgumentException: Invalid S3 URI: hostname does not appear to be a valid S3 endpoint:
    • In some cases these aren't errors at all. In others they're masking the real errors - look elsewhere in the console / log
  • For some (e.g. running locally and disconnected) use-cases, HiveContext is less stable than SQLContext (though community generally recommends the former)
  • Distributing Python files to the workers
    • --py-files and/or --zip-files seem not always to work as expected
    • Packaging files and installing on servers (e.g. in bootstrap) seems more reliable
  • Select syntax quirks
    • Use bitwise operators such as ~ on columns
    • Other random errors can often be fixed with the addition of brackets
  • spark-csv
    • In some cases need to set an escape character and neither None nor the empty string work. Weird unicode characters seem to work

    • When seeing problems such as java.lang.NoClassDefFoundError or java.lang.NoSuchMethodError, check you're using the version built for the appropriate version of Scala (2.10 vs.2.11)

    • sqlContext.read.load fails with the following error, when reading CSV files, if format='csv' is not specified (which is not required for sqlContext.load:

      Caused by: java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/Users/Richard/src/earnest/preprocessing/storage/local/mnt/3m-panel/card/20160120_YODLEE_CARD_PANEL.txt is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [46, 50, 48, 10]

  • Redshift data source's behavior can challenge expectations
    • Be careful with schemas and be aware of when it's rewriting them
    • For longer text fields do not allow the datasource to [re]create the table
    • Pre-actions don't seem to work on some builds
    • Remember to set-up a cleanup policy for the transfer directory on S3

[Py]Spark Alternatives

  • Scala Spark
  • Beam / Flink / Apex / ...
  • Pig etc.
  • AWK / sed?
  • Python
    • Pandas?
    • Threads?
    • AsyncIO/Tornado/etc.
    • Multiprocessing
    • Parallel Python
    • IPython Parallel
    • Cython
    • Queues / pub/sub (NPQ, Celery, SQS, etc.)
    • Gearman
    • PyRes
    • Distarray / Blaze
    • Dask

References

Contact