/Spark

Apache Spark (Scala, PySpark, SparkR) Code, Tricks, and References

Primary LanguageJupyter Notebook

Tips and Tricks

This repo contains a random collection of Spark code, written mostly in python (using the PySpark API). I have also included code/scripts in Scala and SparkR. Feel free to copy and use as-in. Let me know if you have any questions or feedback regarding any of the code.

Zeppelin Notebook Hub (can be used to view Zeppelin notebooks, in json format): https://www.zeppelinhub.com/viewer/

Spark Tuning & Best Practices Reference: https://github.com/zaratsian/HDP_Tuning_Unofficial
Spark Tuning Tool: https://github.com/zaratsian/Spark/blob/master/spark_tuning_tool.py

Machine Learning Cheatsheets:
    • SKLearn - Choosing the right estimator
    • Keras Cheatsheet
    • SAS - ML Algorithms
    • MS Azure - ML Algorithms
    • Kaggle ML Solutions

References:
    • Apache Spark Quickstart
    • Spark PySpark (Python) API
    • Databricks - Guide
    • Databricks - Developer Resources
    • Spark Tuning Guide
    • Spark Tuning - Garbage Collection
    • Hortonworks - Spark Reference
    • Anaconda Hortonworks Management Packs
    • Apache Spark - Best Practices & Tuning
    • PySpark Cheatsheet