Presentation for PyData Berlin September meetup
- Why Spark?
- Why PySpark?
- Why Not [Py]Spark?
- Getting Started
- Core Concepts
- ETL Example
- Machine Learning Example
- Unit Testing
- Performance
- Gotchas
- [Py]Spark Alternatives
- References
- Large data sets
- Cost of scaling up >> cost of scaling out
- Batch processing, stream processing, graph processing, SQL and machine learning
- In memory (sometimes)
- Programming model
- Generic framework
- ScalaData?
- Existing platform
- Team - existing and future
- Performance
- Complexity
- Troubleshooting
- Necessary?
- Small community?
- Driver / Workers
- RDDs
- Immutable collection
- Resilient
- Distributed / partitioned and can control partitioning
- In-memory (at times)
- Loading data
- Files on local filesystem, HDFS, S3, RedShift, Hive, etc.
- CSV, JSON, Parquet, etc.
- Transforms
map / reduce
filter
aggregate
joins
- Actions
writeTextFile
count
take / first
collect
- DataFrames
- Higher-level concept
- Based on RDD
- Structured - like a table and with a schema (which can be inferred)
- Faster
- Easier to work with
- API or SQL
- findspark
export SPARK_HOME="..."
- spark-testing-base
class SparkTestingBase: TestCase
class SparkTestingBaseReuse: TestCase
export PYSPARK_SUBMIT_ARGS=“… pyspark-shell"
export SPARK_MASTER=“yarn-client"
- Cache / Persist
- Double serialization cost
- Cython and/or compiled libraries
- Potential to call Scala/Java code?
- Holden Karau - Improving PySpark Performance: Spark performance beyond the JVM
- Zeppelin
- Databricks / Livy
- Pickling of class when distributing methods (seemingly including statics)
- Spurious error messages
- Examples
Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader
You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o23)
16/06/14 14:46:20 INFO latency: StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 334AFFEECBCB0CC9)
java.lang.IllegalArgumentException: Invalid S3 URI: hostname does not appear to be a valid S3 endpoint:
- In some cases these aren't errors at all. In others they're masking the real errors - look elsewhere in the console / log
- Examples
- For some (e.g. running locally and disconnected) use-cases,
HiveContext
is less stable thanSQLContext
(though community generally recommends the former) - Distributing Python files to the workers
--py-files
and/or--zip-files
seem not always to work as expected- Packaging files and installing on servers (e.g. in bootstrap) seems more reliable
- Select syntax quirks
- Use bitwise operators such as
~
on columns - Other random errors can often be fixed with the addition of brackets
- Use bitwise operators such as
- spark-csv
-
In some cases need to set an escape character and neither
None
nor the empty string work. Weird unicode characters seem to work -
When seeing problems such as
java.lang.NoClassDefFoundError
orjava.lang.NoSuchMethodError
, check you're using the version built for the appropriate version of Scala (2.10 vs.2.11) -
sqlContext.read.load
fails with the following error, when reading CSV files, ifformat='csv'
is not specified (which is not required forsqlContext.load
:Caused by: java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/Users/Richard/src/earnest/preprocessing/storage/local/mnt/3m-panel/card/20160120_YODLEE_CARD_PANEL.txt is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [46, 50, 48, 10]
-
- Redshift data source's behavior can challenge expectations
- Be careful with schemas and be aware of when it's rewriting them
- For longer text fields do not allow the datasource to [re]create the table
- Pre-actions don't seem to work on some builds
- Remember to set-up a cleanup policy for the transfer directory on S3
- Scala Spark
- Beam / Flink / Apex / ...
- Pig etc.
- AWK / sed?
- Python
- Pandas?
- Threads?
- AsyncIO/Tornado/etc.
- Multiprocessing
- Parallel Python
- IPython Parallel
- Cython
- Queues / pub/sub (NPQ, Celery, SQS, etc.)
- Gearman
- PyRes
- Distarray / Blaze
- Dask
- Dataset used in examples
- [Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology.
- In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.
- http://hdl.handle.net/1822/14838
- http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt
- Modified to add names and addresses using Faker
- DataFrames
- https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
- https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2#.xe6wm0h56
- https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
- https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
- https://databricks.com/blog/2015/08/12/from-pandas-to-apache-sparks-dataframe.html
- Adapters
- Performance
- https://stackoverflow.com/questions/31684842/how-to-use-java-scala-function-from-an-action-or-a-transformation
- http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
- http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
- https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html
- Zeppelin
- ML
- Books
- Courses by edX
richdutton
onpythonberlin
Slackrichdutton
on github- https://de.linkedin.com/in/duttonrichard
- http://earnestresearch.com