/LearnPyspark

The pyspark course repo

Primary LanguageJupyter Notebook

Let's learning PySpark!

Setup the enviroment

  1. Setup the Scala 2.11.8:
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin
  1. Setup Spark 2.1.0
export SPARK_HOME=/path/to/your-spark/spark-2.1.0-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/python/lib/pyspark.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON=ipython
  1. Start with pyspark and jupyter
  • pyspark shell
pyspark
  • jupyter (check the notebook.sh)
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook" 
pyspark

Course syllabus

  • Week 1 - introduce the Spark and RDD (Roger)
  • Week 2 - Spark Dataframe, SQL and interact with Hive (冠穎、采襄)
  • Week 3 - Spark configuration with partitions, yarn mode and so on. (Miles)
  • Week 4~ - Case study (hackathon, digital, hive table, text or work with Kafka)