Tips and Tricks
This repo contains a random collection of Spark code, written mostly in python (using the PySpark API). I have also included code/scripts in Scala and SparkR. Feel free to copy and use as-in. Let me know if you have any questions or feedback regarding any of the code.
Zeppelin Notebook Hub (can be used to view Zeppelin notebooks, in json format): https://www.zeppelinhub.com/viewer/
Spark Tuning & Best Practices Reference: https://github.com/zaratsian/HDP_Tuning_Unofficial
Spark Tuning Tool: https://github.com/zaratsian/Spark/blob/master/spark_tuning_tool.py
Machine Learning Cheatsheets:
• SKLearn - Choosing the right estimator
• Keras Cheatsheet
• SAS - ML Algorithms
• MS Azure - ML Algorithms
• Kaggle ML Solutions
References:
• Apache Spark Quickstart
• Spark PySpark (Python) API
• Databricks - Guide
• Databricks - Developer Resources
• Spark Tuning Guide
• Spark Tuning - Garbage Collection
• Hortonworks - Spark Reference
• Anaconda Hortonworks Management Packs
• Apache Spark - Best Practices & Tuning
• PySpark Cheatsheet